# Machine Learning Applications for Airbnb Data

### Group 3 - Dhruv Shah, Jenn Hong, Setu Shah, Sonya Dreyer

---



• State the problem

• Tell us who cares about this problem and Why

• Describe your data – where it came from, what it contains

• Present some interesting descriptive analyses (plots/tables) that motivates your exercise

• Present your main results

• Which methods worked best for your problem?

• What were the challenges you faced? Tell us about the biggest challenge you faced and how you
overcame it (or, tried but did not – that’s fine too – not every problem has a solution.)

• Conclude – what have you learnt that can be put to practice?

# Data Cleaning

---



In [1]:
# Import preprocessing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Download the file

!wget 'https://maven-datasets.s3.amazonaws.com/Airbnb/Airbnb+Data.zip'

--2023-11-18 20:27:11--  https://maven-datasets.s3.amazonaws.com/Airbnb/Airbnb+Data.zip
Resolving maven-datasets.s3.amazonaws.com (maven-datasets.s3.amazonaws.com)... 52.216.60.241, 3.5.29.215, 3.5.20.106, ...
Connecting to maven-datasets.s3.amazonaws.com (maven-datasets.s3.amazonaws.com)|52.216.60.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91005234 (87M) [application/zip]
Saving to: ‘Airbnb+Data.zip’


2023-11-18 20:27:12 (87.8 MB/s) - ‘Airbnb+Data.zip’ saved [91005234/91005234]



In [3]:
# Unzip the file

!unzip Airbnb+Data.zip

Archive:  Airbnb+Data.zip
   creating: Airbnb Data/
  inflating: Airbnb Data/Listings.csv  
  inflating: Airbnb Data/Listings_data_dictionary.csv  
  inflating: Airbnb Data/Reviews.csv  
  inflating: Airbnb Data/Reviews_data_dictionary.csv  


In [4]:
# Load the data frames

listings =  pd.read_csv('/content/Airbnb Data/Listings.csv', encoding = 'latin1', low_memory = False)

reviews = pd.read_csv('/content/Airbnb Data/Reviews.csv', encoding = 'latin1', low_memory = False)

In [5]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279712 entries, 0 to 279711
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   listing_id                   279712 non-null  int64  
 1   name                         279539 non-null  object 
 2   host_id                      279712 non-null  int64  
 3   host_since                   279547 non-null  object 
 4   host_location                278872 non-null  object 
 5   host_response_time           150930 non-null  object 
 6   host_response_rate           150930 non-null  float64
 7   host_acceptance_rate         166625 non-null  float64
 8   host_is_superhost            279547 non-null  object 
 9   host_total_listings_count    279547 non-null  float64
 10  host_has_profile_pic         279547 non-null  object 
 11  host_identity_verified       279547 non-null  object 
 12  neighbourhood                279712 non-null  object 
 13 

In [7]:
# Converting to datetime

listings.host_since = pd.to_datetime(listings.host_since)

In [8]:
# Converting to out-of-10 scale

listings.review_scores_rating = listings.review_scores_rating / 10

In [10]:
# Converting prices to USD

cities = listings['city'].unique()
exchange_rates = [1.0808, 1, 0.028388, 0.20328, 0.65462, 0.039480, 1.0808, 0.12777, 0.0493, 0.053215] # update these numbers before fitting models
currency_map = dict(zip(cities, exchange_rates))

listings['usd_price'] = listings.apply(lambda row: row['price'] * currency_map[row['city']], axis=1) # create new column
listings.drop('price', axis = 1, inplace = True) # drop original column

In [11]:
# Converting to numerical category

# Potentially problematic -> Converting NULL values to zero

listings.host_is_superhost = listings.host_is_superhost.apply(lambda x: 1 if x == 't' else 0)
listings.host_has_profile_pic = listings.host_has_profile_pic.apply(lambda x: 1 if x == 't' else 0)
listings.host_identity_verified = listings.host_identity_verified.apply(lambda x: 1 if x == 't' else 0)
listings.instant_bookable = listings.instant_bookable.apply(lambda x: 1 if x == 't' else 0)

In [12]:
# We can or should drop listing_id, host_id, property, neighbourhood

# We can drop Districts as it has only districts of New York, rest are all NULL

# We should drop name and possibly host_location (unless we want to/can figure out how to extract precise location --> latitude and longitude can be used to create clusters like in the lab)

# All host locations within each country have been mapped to the most prominent city in that country

# We need to possibly impute values (or drop columns) for host response time/rate, host_acceptance_rate, and some of the ratings columns [Iterative Imputer]

In [13]:
listings.head(2)

Unnamed: 0,listing_id,name,host_id,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_total_listings_count,...,maximum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,usd_price
0,281420,"Beautiful Flat in le Village Montmartre, Paris",1466919,2011-12-03,"Paris, Ile-de-France, France",,,,0,1.0,...,1125,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,57.2824
1,3705183,39 mÃÂ² Paris (Sacre CÃâur),10328771,2013-11-29,"Paris, Ile-de-France, France",,,,0,1.0,...,1125,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,129.696


In [14]:
listings.neighbourhood.value_counts()

I Centro Storico     14874
Sydney                8074
Copacabana            7712
Cuauhtemoc            7626
Buttes-Montmartre     7237
                     ...  
Lighthouse Hill          1
Willowbrook              1
Magalhaes Bastos         1
Woodrow                  1
Agua Santa               1
Name: neighbourhood, Length: 660, dtype: int64

# Preprocessing

---



In [17]:
# Jenn's one hot encoding for amenities

In [15]:
# Splitting the data into training and test sets to estimate generalization error

from sklearn.model_selection import train_test_split

X = listings.drop("usd_price", axis=1)
y = listings["usd_price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((223769, 32), (55943, 32), (223769,), (55943,))

In [16]:
# Building preprocessing pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn import set_config
set_config(display='diagram')

cat_attribs = ['host_response_time', 'host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'city', 'room_type', 'amenities', 'instant_bookable'] # not sure if host_since (maybe split by months) is included here

num_attribs = ['host_response_rate', 'host_acceptance_rate', 'host_total_listings_count', 'accommodates', 'bedrooms', 'review_scores_rating', 'review_scores_accuracy', 'minimum_nights',
               'maximum_nights', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value'] # excluding latitude and longitude

preprocess_pipeline = ColumnTransformer([
        ("cat", OneHotEncoder(drop="first"), cat_attribs),
        ("num", StandardScaler(), num_attribs),
    ])

preprocess_pipeline

In [20]:
reviews.head(20)

Unnamed: 0,listing_id,review_id,date,reviewer_id
0,11798,330265172,2018-09-30,11863072
1,15383,330103585,2018-09-30,39147453
2,16455,329985788,2018-09-30,1125378
3,17919,330016899,2018-09-30,172717984
4,26827,329995638,2018-09-30,17542859
5,74561,330089224,2018-09-30,173044789
6,140355,330194958,2018-09-30,160093807
7,162163,329980859,2018-09-30,94026758
8,167998,329950677,2018-09-30,35388162
9,178188,330213008,2018-09-30,3652511


In [22]:
training_df = listings[['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'city', 'instant_bookable', 'host_total_listings_count', 'accommodates', 'bedrooms', 'minimum_nights', 'maximum_nights', 'usd_price']]
training_df = training_df.dropna()

In [23]:
# Splitting the data into training and test sets to estimate generalization error

from sklearn.model_selection import train_test_split

X = training_df.drop("usd_price", axis=1)
y = training_df["usd_price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((200105, 10), (50027, 10), (200105,), (50027,))

In [24]:
# Building preprocessing pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn import set_config
set_config(display='diagram')

cat_attribs = ['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'city', 'instant_bookable'] # not sure if host_since (maybe split by months) is included here

num_attribs = ['host_total_listings_count', 'accommodates', 'bedrooms', 'minimum_nights',
               'maximum_nights'] # excluding latitude and longitude

preprocess_pipeline = ColumnTransformer([
        ("cat", OneHotEncoder(drop="first"), cat_attribs),
        ("num", StandardScaler(), num_attribs),
    ])

preprocess_pipeline

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

lr_pipeline = Pipeline([
    ('preprocessor', preprocess_pipeline),
    ('model', LinearRegression())
])

lr_pipeline.fit(X_train, y_train)

In [55]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

y_pred = lr_pipeline.predict(X_test)
mean_squared_error(y_test, y_pred, squared=False)

388.50873336824054

In [31]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.04433300081209146

In [32]:
from sklearn.metrics import median_absolute_error
median_absolute_error(y_test, y_pred)

41.84551322329613

In [35]:
from sklearn.linear_model import Lasso
lasso_pipeline = Pipeline([
    ('preprocessor', preprocess_pipeline),
    ('model', Lasso())
])


In [36]:
lasso_pipeline.fit(X_train, y_train)

In [56]:
y_pred = lasso_pipeline.predict(X_test)
mean_squared_error(y_test, y_pred, squared=False)

388.75808004317116

In [38]:
r2_score(y_test, y_pred)

0.043105904293091024

In [39]:
from sklearn.tree import DecisionTreeRegressor
dt_pipeline = Pipeline([
    ('preprocessor', preprocess_pipeline),
    ('model', DecisionTreeRegressor())
])

In [40]:
dt_pipeline.fit(X_train, y_train)

In [58]:
y_pred = dt_pipeline.predict(X_test)
mean_squared_error(y_test, y_pred, squared=False)

420.992797448858

In [42]:
r2_score(y_test, y_pred)

-0.1189457946843342

In [43]:
from sklearn.linear_model import Ridge
ridge_pipeline = Pipeline([
    ('preprocessor', preprocess_pipeline),
    ('model', Ridge())
])


In [46]:
ridge_pipeline.fit(X_train, y_train)

In [59]:
y_pred = ridge_pipeline.predict(X_test)
mean_squared_error(y_test, y_pred, squared=False)

388.5086998979579

In [48]:
r2_score(y_test, y_pred)

0.04433316547476385

In [49]:
from sklearn.ensemble import RandomForestRegressor
rf_pipeline = Pipeline([
    ('preprocessor', preprocess_pipeline),
    ('model', RandomForestRegressor())
])


In [51]:
rf_pipeline.fit(X_train, y_train)

In [60]:
y_pred = ridge_pipeline.predict(X_test)
mean_squared_error(y_test, y_pred, squared=False)

388.5086998979579

In [61]:
r2_score(y_test, y_pred)

0.04433316547476385