# Airbnb
## Availability Data

This dataset provides insight on whether a listing was available or not on a given date. With that, we can understand listing availability on Airbnb, and use this data to predict availability using Machine Learning.

Here, we're going to focus on the `available` feature, which can take two possible states:

* f = False, which means it is not available (It's booked)
* t = True, which means it is available (It's not booked)


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

df = pd.read_parquet('/Users/rafaelduarte/Projects/airbnb/data/raw/new-york-city/calendar-2023-10-01.csv.parquet')
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df.head(10)

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2595,2023-10-02,f,$240.00,$240.00,30.0,1125.0
1,2595,2023-10-03,f,$240.00,$240.00,30.0,1125.0
2,2595,2023-10-04,f,$240.00,$240.00,30.0,1125.0
3,2595,2023-10-05,f,$240.00,$240.00,30.0,1125.0
4,2595,2023-10-06,f,$240.00,$240.00,30.0,1125.0
5,2595,2023-10-07,f,$240.00,$240.00,30.0,1125.0
6,2595,2023-10-08,f,$240.00,$240.00,30.0,1125.0
7,2595,2023-10-09,f,$240.00,$240.00,30.0,1125.0
8,2595,2023-10-10,f,$240.00,$240.00,30.0,1125.0
9,2595,2023-10-11,f,$240.00,$240.00,30.0,1125.0


In [2]:
df.shape

(14159104, 7)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14159104 entries, 0 to 14159103
Data columns (total 7 columns):
 #   Column          Dtype         
---  ------          -----         
 0   listing_id      int64         
 1   date            datetime64[ns]
 2   available       object        
 3   price           object        
 4   adjusted_price  object        
 5   minimum_nights  float64       
 6   maximum_nights  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(3)
memory usage: 756.2+ MB


In [4]:
# Add new columns
df['Year'] = df['date'].dt.isocalendar().year
df['Quarter'] = df['date'].dt.quarter
df['Month'] = df['date'].dt.month
df['Week'] = df['date'].dt.isocalendar().week
df['Weekday'] = df['date'].dt.weekday
df['Day'] = df['date'].dt.day
df['Dayofyear'] = df['date'].dt.dayofyear

# Add 'Weekend' column
df['Weekend'] = df['date'].dt.weekday // 5  # 0 for weekdays, 1 for weekends

df.index = pd.to_datetime(df['date'], infer_datetime_format=True)

In [5]:
df.head(30)

Unnamed: 0_level_0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights,Year,Quarter,Month,Week,Weekday,Day,Dayofyear,Weekend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2023-10-02,2595,2023-10-02,f,$240.00,$240.00,30.0,1125.0,2023,4,10,40,0,2,275,0
2023-10-03,2595,2023-10-03,f,$240.00,$240.00,30.0,1125.0,2023,4,10,40,1,3,276,0
2023-10-04,2595,2023-10-04,f,$240.00,$240.00,30.0,1125.0,2023,4,10,40,2,4,277,0
2023-10-05,2595,2023-10-05,f,$240.00,$240.00,30.0,1125.0,2023,4,10,40,3,5,278,0
2023-10-06,2595,2023-10-06,f,$240.00,$240.00,30.0,1125.0,2023,4,10,40,4,6,279,0
2023-10-07,2595,2023-10-07,f,$240.00,$240.00,30.0,1125.0,2023,4,10,40,5,7,280,1
2023-10-08,2595,2023-10-08,f,$240.00,$240.00,30.0,1125.0,2023,4,10,40,6,8,281,1
2023-10-09,2595,2023-10-09,f,$240.00,$240.00,30.0,1125.0,2023,4,10,41,0,9,282,0
2023-10-10,2595,2023-10-10,f,$240.00,$240.00,30.0,1125.0,2023,4,10,41,1,10,283,0
2023-10-11,2595,2023-10-11,f,$240.00,$240.00,30.0,1125.0,2023,4,10,41,2,11,284,0


## Data Cleaning

### Encoding `available`

As it comes from the source, the feature `available` is represented as: 
* `t` for True or Available
* `f` for False or Unavailable

To make it possible for a Machine Learning model to compute it, let's make it numeric, where:

* `0` represents Available
* `1` represents Booked (Unavaialable)

In [6]:
# Ensure the 'available' column is in numeric format
df['available'] = df['available'].map({'t': 0, 'f': 1}).astype(float)

df.available.value_counts()

1.0    8360329
0.0    5798775
Name: available, dtype: int64

### Transforming Monetary Values

From the source, `price` and `adjusted_price` are `objects`, with $.

To make there columns usable, we need to:

* Remove the $
* Remove commas (,)
* Transform them into float

In [7]:
# Remove commas and the '$' symbol and convert to float for 'price' column
df['price'] = df['price'].str.replace(',', '', regex=True).str.replace('$', '').astype(float)

# Remove commas and the '$' symbol and convert to float for 'adjusted_price' column
df['adjusted_price'] = df['adjusted_price'].str.replace(',', '', regex=True).str.replace('$', '').astype(float)

# Checking results
df.info()

  df['price'] = df['price'].str.replace(',', '', regex=True).str.replace('$', '').astype(float)
  df['adjusted_price'] = df['adjusted_price'].str.replace(',', '', regex=True).str.replace('$', '').astype(float)


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 14159104 entries, 2023-10-02 to 2024-09-29
Data columns (total 15 columns):
 #   Column          Dtype         
---  ------          -----         
 0   listing_id      int64         
 1   date            datetime64[ns]
 2   available       float64       
 3   price           float64       
 4   adjusted_price  float64       
 5   minimum_nights  float64       
 6   maximum_nights  float64       
 7   Year            UInt32        
 8   Quarter         int64         
 9   Month           int64         
 10  Week            UInt32        
 11  Weekday         int64         
 12  Day             int64         
 13  Dayofyear       int64         
 14  Weekend         int64         
dtypes: UInt32(2), datetime64[ns](1), float64(5), int64(7)
memory usage: 1.6 GB


### EDA

Let's take a deeper look into our data.

In [8]:
# Step 1: Filter the DataFrame to include only rows where 'available' is 'f'
filtered_df = df[df['available'] == 1.0]

# Step 2 and 3: Group by 'listing_id' and count 'f' entries
count_f_entries = filtered_df.groupby('listing_id')['available'].count()

# Step 4: Find the 'listing_id' with the highest count of 'f' entries (highest occupancy rate)
listing_id_with_highest_f_count = count_f_entries.idxmax()

print("Listing ID with the highest occupancy rate:", listing_id_with_highest_f_count)


Listing ID with the highest occupancy rate: 11943


In [12]:
import random

# We can use this code to get a random listing_id

# Get a list of unique listing_ids
unique_listing_ids = df['listing_id'].unique()

# Select a random listing_id
random_listing_id = random.choice(unique_listing_ids)

# Print the random listing_id
print("Random Listing ID:", random_listing_id)

Random Listing ID: 660806119289637869


In [13]:
# selecting only the desired listing_id
df_01 = df[df['listing_id'] == 660806119289637869]

# checking the dataframe
print(df_01.available.value_counts())
df_01.head(10)

1.0    230
0.0    135
Name: available, dtype: int64


Unnamed: 0_level_0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights,Year,Quarter,Month,Week,Weekday,Day,Dayofyear,Weekend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2023-10-02,660806119289637869,2023-10-02,1.0,220.0,220.0,30.0,1125.0,2023,4,10,40,0,2,275,0
2023-10-03,660806119289637869,2023-10-03,1.0,220.0,220.0,30.0,1125.0,2023,4,10,40,1,3,276,0
2023-10-04,660806119289637869,2023-10-04,1.0,220.0,220.0,30.0,1125.0,2023,4,10,40,2,4,277,0
2023-10-05,660806119289637869,2023-10-05,1.0,220.0,220.0,30.0,1125.0,2023,4,10,40,3,5,278,0
2023-10-06,660806119289637869,2023-10-06,1.0,220.0,220.0,30.0,1125.0,2023,4,10,40,4,6,279,0
2023-10-07,660806119289637869,2023-10-07,1.0,220.0,220.0,30.0,1125.0,2023,4,10,40,5,7,280,1
2023-10-08,660806119289637869,2023-10-08,1.0,220.0,220.0,30.0,1125.0,2023,4,10,40,6,8,281,1
2023-10-09,660806119289637869,2023-10-09,1.0,220.0,220.0,30.0,1125.0,2023,4,10,41,0,9,282,0
2023-10-10,660806119289637869,2023-10-10,1.0,220.0,220.0,30.0,1125.0,2023,4,10,41,1,10,283,0
2023-10-11,660806119289637869,2023-10-11,1.0,220.0,220.0,30.0,1125.0,2023,4,10,41,2,11,284,0
