# Machine Learning Applications for Airbnb Data

### Group 3 - Dhruv Shah, Jenn Hong, Setu Shah, Sonya Dreyer

---



• State the problem

• Tell us who cares about this problem and Why

• Describe your data – where it came from, what it contains

• Present some interesting descriptive analyses (plots/tables) that motivates your exercise

• Present your main results

• Which methods worked best for your problem?

• What were the challenges you faced? Tell us about the biggest challenge you faced and how you
overcame it (or, tried but did not – that’s fine too – not every problem has a solution.)

• Conclude – what have you learnt that can be put to practice?

# Data Cleaning

---



In [1]:
# Import preprocessing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Download the file

!wget 'https://maven-datasets.s3.amazonaws.com/Airbnb/Airbnb+Data.zip'

--2023-12-02 00:07:20--  https://maven-datasets.s3.amazonaws.com/Airbnb/Airbnb+Data.zip
Resolving maven-datasets.s3.amazonaws.com (maven-datasets.s3.amazonaws.com)... 52.217.160.9, 52.217.118.169, 52.217.140.81, ...
Connecting to maven-datasets.s3.amazonaws.com (maven-datasets.s3.amazonaws.com)|52.217.160.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91005234 (87M) [application/zip]
Saving to: ‘Airbnb+Data.zip.1’


2023-12-02 00:07:23 (30.2 MB/s) - ‘Airbnb+Data.zip.1’ saved [91005234/91005234]



In [3]:
# Unzip the file

!unzip Airbnb+Data.zip

Archive:  Airbnb+Data.zip
replace Airbnb Data/Listings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [4]:
# Load the data frames

listings =  pd.read_csv('/content/Airbnb Data/Listings.csv', encoding = 'latin1', low_memory = False)

#reviews = pd.read_csv('/content/Airbnb Data/Reviews.csv', encoding = 'latin1', low_memory = False)

In [5]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279712 entries, 0 to 279711
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   listing_id                   279712 non-null  int64  
 1   name                         279539 non-null  object 
 2   host_id                      279712 non-null  int64  
 3   host_since                   279547 non-null  object 
 4   host_location                278872 non-null  object 
 5   host_response_time           150930 non-null  object 
 6   host_response_rate           150930 non-null  float64
 7   host_acceptance_rate         166625 non-null  float64
 8   host_is_superhost            279547 non-null  object 
 9   host_total_listings_count    279547 non-null  float64
 10  host_has_profile_pic         279547 non-null  object 
 11  host_identity_verified       279547 non-null  object 
 12  neighbourhood                279712 non-null  object 
 13 

In [6]:
# Converting to datetime

listings.host_since = pd.to_datetime(listings.host_since)

In [7]:
# Converting to out-of-10 scale

listings.review_scores_rating = listings.review_scores_rating / 10

In [8]:
# Converting prices to USD

cities = listings['city'].unique()
exchange_rates = [1.0808, 1, 0.028388, 0.20328, 0.65462, 0.039480, 1.0808, 0.12777, 0.0493, 0.053215] # update these numbers before fitting models
currency_map = dict(zip(cities, exchange_rates))

listings['usd_price'] = listings.apply(lambda row: row['price'] * currency_map[row['city']], axis=1) # create new column
listings.drop('price', axis = 1, inplace = True) # drop original column

In [9]:
# Converting to numerical category

# Potentially problematic -> Converting NULL values to zero

listings.host_is_superhost = listings.host_is_superhost.apply(lambda x: 1 if x == 't' else 0)
listings.host_has_profile_pic = listings.host_has_profile_pic.apply(lambda x: 1 if x == 't' else 0)
listings.host_identity_verified = listings.host_identity_verified.apply(lambda x: 1 if x == 't' else 0)
listings.instant_bookable = listings.instant_bookable.apply(lambda x: 1 if x == 't' else 0)

In [10]:
# We can or should drop listing_id, host_id, property, neighbourhood

# We can drop Districts as it has only districts of New York, rest are all NULL

# We should drop name and possibly host_location (unless we want to/can figure out how to extract precise location --> latitude and longitude can be used to create clusters like in the lab)

# All host locations within each country have been mapped to the most prominent city in that country

# We need to possibly impute values (or drop columns) for host response time/rate, host_acceptance_rate, and some of the ratings columns [Iterative Imputer]

In [11]:
# Dropping hopeless columns

columns_to_drop = ['listing_id', 'host_id', 'property_type', 'neighbourhood', 'district', 'property_type','name','host_location']

listings = listings.drop(columns=columns_to_drop, axis=1)

In [12]:
# Dropping columns with > 50% missing values

missing_values_columns = ['host_response_time', 'host_response_rate', 'host_acceptance_rate']

listings = listings.drop(columns=missing_values_columns, axis=1)

# Preprocessing

---



In [13]:
# Splitting the data into training and test sets to estimate generalization error

from sklearn.model_selection import train_test_split

X = listings.drop("usd_price", axis=1)
y = listings["usd_price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((223769, 22), (55943, 22), (223769,), (55943,))

In [14]:
# # Iteratively impute missing values for numerical columns

# X_train_num = X_train.select_dtypes(include=[np.number])

# # explicitly require this experimental feature
# from sklearn.experimental import enable_iterative_imputer  # noqa

# # now you can import normally from sklearn.impute
# from sklearn.impute import IterativeImputer

# iter_imputer = IterativeImputer(random_state=42)
# X_train_imp = iter_imputer.fit_transform(X_train_num)
# X_train_imp_df = pd.DataFrame(X_train_imp, columns=X_train_num.columns, index=X_train_num.index)

In [15]:
# Building preprocessing pipeline
#### JENN: I changed the a couple of things so that it can work with the cost complexity pruning stuff, we need to go over the differences

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn import set_config
set_config(display='diagram')

cat_attribs = ['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'city', 'room_type', 'instant_bookable'] # not sure if host_since (maybe split by months) is included here

num_attribs = ['host_total_listings_count', 'accommodates', 'bedrooms', 'review_scores_rating', 'review_scores_accuracy', 'minimum_nights',
               'maximum_nights', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value'] # excluding latitude and longitude

# missing_attribs = ['host_total_listings_count', 'bedrooms', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
#                'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']

num_pipeline = make_pipeline(IterativeImputer(random_state = 42), StandardScaler())

# Dropping amenities for now

preprocess_pipeline = ColumnTransformer([
        ("cat", OneHotEncoder(drop="first", sparse_output=False), cat_attribs),
        ("num", num_pipeline, num_attribs),
    ])
preprocess_pipeline.set_output(transform='pandas')

preprocess_pipeline

In [16]:
# # Checking data after pre-processing
# print(X_train.shape)
# X_train_prepared = preprocess_pipeline.fit_transform(X_train)
# print(X_train_prepared.shape)

In [17]:
# # Checking new column names

# preprocess_pipeline.get_feature_names_out()

In [18]:
# X_train_prepared_df = pd.DataFrame(X_train_prepared, # the numpy array containing the processed data
#                                    columns=preprocess_pipeline.get_feature_names_out(), # column names
#                                    index=X_train.index # row numbers/labels
#                                    )
# X_train_prepared_df.isna().sum()

In [19]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg = make_pipeline(preprocess_pipeline, LinearRegression())
lin_reg.fit(X_train, y_train)
y_train_predictions = lin_reg.predict(X_train)

lin_rmse = mean_squared_error(y_train, y_train_predictions, squared=False)
print(f"The training data RMSE is {lin_rmse:.0f} or about {(lin_rmse/y_train.mean()*100):.0f}% error")

The training data RMSE is 433 or about 389% error


In [20]:
from sklearn.metrics import r2_score

print(f'R-squared score from Linear Regression model is {r2_score(y_train, y_train_predictions):.3f}')

R-squared score from Linear Regression model is 0.036


In [21]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = make_pipeline(preprocess_pipeline, DecisionTreeRegressor(random_state=42))
tree_reg.fit(X_train, y_train)
y_train_predictions = tree_reg.predict(X_train)
tree_rmse = mean_squared_error(y_train, y_train_predictions, squared=False)
print(f'Training data error for the tree {tree_rmse:.0f}')

Training data error for the tree 126


In [22]:
#from sklearn.metrics import r2_score

print(f'R-squared score from Decision Tree model is {r2_score(y_train, y_train_predictions):.3f}')

R-squared score from Decision Tree model is 0.919


In [23]:
from sklearn.model_selection import cross_val_score

# First for the linear regression
lin_cv_rmses = -cross_val_score(lin_reg, X_train, y_train,
                              scoring="neg_root_mean_squared_error", cv=3)
print(f"Average Linear Regression Cross-Validation RMSE: {lin_cv_rmses.mean():.0f}")

Average Linear Regression Cross-Validation RMSE: 422


In [24]:
# Then the decision tree regressor

tree_cv_rmses = -cross_val_score(tree_reg, X_train, y_train,
                              scoring="neg_root_mean_squared_error", cv=3)
print(f"Average Decision Tree Regression Cross-Validation RMSE: {tree_cv_rmses.mean():.0f}")

Average Decision Tree Regression Cross-Validation RMSE: 531


In [25]:
##################################
#JENN CODE

In [26]:
X_train.columns

Index(['host_since', 'host_is_superhost', 'host_total_listings_count',
       'host_has_profile_pic', 'host_identity_verified', 'city', 'latitude',
       'longitude', 'room_type', 'accommodates', 'bedrooms', 'amenities',
       'minimum_nights', 'maximum_nights', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'instant_bookable'],
      dtype='object')

In [27]:
#i have to break the pipeline back out
X_train_prepd = preprocess_pipeline.fit_transform(X_train)
X_test_prepd = preprocess_pipeline.transform(X_test)

tree_reg_2 = DecisionTreeRegressor(random_state=42, max_depth=5)
tree_reg_2.fit(X_train_prepd, y_train)

# from sklearn.tree import plot_tree

# plt.figure(dpi=200) # Makes the figure a little larger, easier to read.
# plot_tree(tree_reg_2, filled=True, feature_names=list(X_train_prepd.columns)); # graphically shows the tree


In [28]:
clf_full = DecisionTreeRegressor()
path = clf_full.cost_complexity_pruning_path(X_train_prepd, y_train) #a method on decision tree classifier
ccp_alphas, impurities = path.ccp_alphas, path.impurities # default impurity is Gini #impurity is like the classification version of MSE

In [29]:
ccp_alphas[:10]

array([ 0.00000000e+00, -1.94289029e-16, -1.62549011e-16, -1.30104261e-16,
       -1.17961196e-16, -6.93889390e-17, -6.93889390e-17, -6.59194921e-17,
       -5.72458747e-17, -5.69206141e-17])

In [30]:
print(f'There are {ccp_alphas.shape[0]} alpha values.')

There are 132807 alpha values.


In [31]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# def alpha_generator():
#     for alpha in ccp_alphas:
#         yield alpha

param_grid = {'ccp_alpha': ccp_alphas}  # Just one parameter, no pipeline steps, thus no __ syntax.

random_search = RandomizedSearchCV(DecisionTreeRegressor(random_state=42), param_grid, n_iter=10, cv=3, scoring='neg_mean_squared_error')

random_search.fit(X_train_prepd, y_train)

# Display
random_cv_res = pd.DataFrame(random_search.cv_results_)
random_cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
display(random_cv_res.filter(regex = '(^param_|mean_test_score)', axis=1).head())
best_tree = random_search.best_estimator_
print(f'The total number of nodes is {best_tree.tree_.node_count} and the max depth is {best_tree.tree_.max_depth}.')

Unnamed: 0,param_ccp_alpha,mean_test_score
1,0.082816,-286121.511379
4,0.003197,-286825.624171
6,0.002851,-286833.248569
9,0.002012,-286852.045711
5,0.000754,-286873.446436


The total number of nodes is 29161 and the max depth is 45.


In [32]:
print(f'R-squared-score of the best fit tree was {r2_score(y_test, best_tree.predict(X_test_prepd))}.')

R-squared-score of the best fit tree was -0.8872151133242141.


In [34]:
!pip install scikit-optimize
from skopt import BayesSearchCV

bayes_search = BayesSearchCV(DecisionTreeRegressor(random_state=42), param_grid, n_iter=50, cv=3, scoring='neg_mean_squared_error', random_state=42)
bayes_search.fit(X_train_prepd, y_train)



KeyError: ignored