# Ensemble Methods on AirBnB 🦾🦾

# Faire du voting et du stacking !

Two popular boosting algorithms are Adaboost and XGBoost, the goal of this exercise is to apply them both to a prediction problem and evaluate their performance with different base models. The dataset we will use is that of Airbnb listings in Seattle, the goal is to predict the price per night of the listing.
There will be quite a lot of preprocessing to do in this exercise as well as some interesting exploratory analysis and visualization. Do not hesitate to deviate from the questions to explore the data further.

1. Let's import the usual librairies.

In [160]:
# Load in our libraries
import pandas as pd
import numpy as np
import re
import ast
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import plotly.figure_factory as ff

import warnings
warnings.filterwarnings('ignore')

from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.tree import DecisionTreeRegressor
# import ensemble methods
from sklearn.ensemble import AdaBoostRegressor, VotingRegressor, StackingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import Ridge, Lasso, LinearRegression

2. Import the ```listings.csv``` dataset from s3 using the following link: https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+Supervis%C3%A9/Boosting/listings.csv

In [130]:
dataset = pd.read_csv("https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+Supervis%C3%A9/Boosting/listings.csv")
dataset.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,10.0,f,,WASHINGTON,f,strict,t,t,6,1.48
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,...,10.0,f,,WASHINGTON,f,strict,f,f,2,1.15
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,...,,f,,WASHINGTON,f,flexible,f,f,1,
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,...,9.0,f,,WASHINGTON,f,strict,f,f,1,0.89


3. There are a lot of columns in this dataset. Display the dataset info.

In [131]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 92 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3818 non-null   int64  
 1   listing_url                       3818 non-null   object 
 2   scrape_id                         3818 non-null   int64  
 3   last_scraped                      3818 non-null   object 
 4   name                              3818 non-null   object 
 5   summary                           3641 non-null   object 
 6   space                             3249 non-null   object 
 7   description                       3818 non-null   object 
 8   experiences_offered               3818 non-null   object 
 9   neighborhood_overview             2786 non-null   object 
 10  notes                             2212 non-null   object 
 11  transit                           2884 non-null   object 
 12  thumbn

4. Let's proceed to some visualization, first display the distribution of the price variable. You will have to preprocess it as it is not in a numerical format.

In [132]:
dataset["price"] = dataset["price"].apply(lambda x: float(re.sub(r"[,$]","", x)))
px.histogram(dataset["price"])

5. The distribution of the target variable is skewed towards high values (this is a very usual situation when working with prices, many items are around the average price range and the higher the price, the fewer items there are). A standard way of working with such variables is to change the scale using the log function so the distribution becomes evenly distributed.
Create a price_log variable that's equal to log(price)

In [133]:
dataset["price_log"] = np.log10(dataset["price"])
px.histogram(dataset["price_log"])

The distribution looks a lot better for prediction purposes after the log transformation!

6. Visualize the price against the following variables : 

- ```room type```
- ```beds```
- ```property type```

In [134]:
px.box(dataset, x = 'room_type', y = 'price_log')

In [135]:
px.box(dataset, x = 'beds', y = 'price_log')


In [136]:
px.box(dataset, x = 'property_type', y = 'price_log')

7. Isolate the target variable in an object y and the other variables in an object X

In [137]:
target_variable = "price_log"

X = dataset.drop([target_variable,"price"], axis=1)
y = dataset[target_variable]

8. We will have to remove a certain number of variables that we do not know how to use at this point. Start by removing the variables that could be interpreted as an ```id``` , we will also remove the variables that contain long texts as we haven't learned about text processing yet. 

We also have to remove all variables related to price, as they represent a risk of leak because of their direct link to the target variable, like ```monthly price```.

A certain number of variables contain a very high amount of missing values, in some cases these missing values correspond to an information we can exploit, sometimes not. Remove these not so useful variables from the dataset, strat by checking the proportion of missing values for all variables.

Your dataset should only contain categorical and numerical variables after this step. Check if your final dataset contains the following variables :

```
Index(['host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bedrooms', 'beds', 'bed_type', 'security_deposit', 'cleaning_fee',
       'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights',
       'has_availability', 'availability_30', 'availability_60',
       'availability_90', 'availability_365', 'number_of_reviews',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'requires_license', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification', 'calculated_host_listings_count',
       'reviews_per_month'],
      dtype='object')
```

In [138]:
features_list = [
       "host_since",
       "host_response_time",
       "host_response_rate",
       "host_acceptance_rate",
       "host_is_superhost",
       "host_listings_count",
       "host_total_listings_count",
       "host_verifications",
       "host_has_profile_pic",
       "host_identity_verified",
       "neighbourhood_group_cleansed",
       "latitude",
       "longitude",
       "is_location_exact",
       "property_type",
       "room_type",
       "accommodates",
       "bathrooms",
       "bedrooms",
       "beds",
       "bed_type",
       "security_deposit",
       "cleaning_fee",
       "guests_included",
       "extra_people",
       "minimum_nights",
       "maximum_nights",
       "has_availability",
       "availability_30",
       "availability_60",
       "availability_90",
       "availability_365",
       "number_of_reviews",
       "review_scores_rating",
       "review_scores_accuracy",
       "review_scores_cleanliness",
       "review_scores_checkin",
       "review_scores_communication",
       "review_scores_location",
       "review_scores_value",
       "requires_license",
       "instant_bookable",
       "cancellation_policy",
       "require_guest_profile_picture",
       "require_guest_phone_verification",
       "calculated_host_listings_count",
       "reviews_per_month"
]

X = X.loc[:, features_list]

9. Are there any remaining missing values ? Is there a relevant way to replace those missing values without using imputing methods ? Are all the variables in a numerical format ? If not run some preprocessing to create a clean dataset.

In [139]:
X.isnull().sum()/X.shape[0]*100

host_since                           0.052383
host_response_time                  13.698271
host_response_rate                  13.698271
host_acceptance_rate                20.246202
host_is_superhost                    0.052383
host_listings_count                  0.052383
host_total_listings_count            0.052383
host_verifications                   0.052383
host_has_profile_pic                 0.052383
host_identity_verified               0.052383
neighbourhood_group_cleansed         0.000000
latitude                             0.000000
longitude                            0.000000
is_location_exact                    0.000000
property_type                        0.026192
room_type                            0.000000
accommodates                         0.000000
bathrooms                            0.419068
bedrooms                             0.157150
beds                                 0.026192
bed_type                             0.000000
security_deposit                  

In [140]:
X["cleaning_fee"] = X["cleaning_fee"].apply(lambda x: float(re.sub(r"[$]", "", x)) if type(x) == str else x)
X["cleaning_fee"] = X["cleaning_fee"].fillna(0)

X["host_response_rate"] = X["host_response_rate"].apply(lambda x: float(re.sub(r"[%]", "", x)) if type(x) == str else x)

X["host_acceptance_rate"] = X["host_acceptance_rate"].apply(lambda x: float(re.sub(r"[%]", "", x)) if type(x) == str else x)

X["security_deposit"] = X["security_deposit"].apply(lambda x: float(re.sub(r"[,$]","", str(x))))
X["security_deposit"] = X["security_deposit"].fillna(0)

X["host_has_profile_pic"] = X["host_has_profile_pic"].fillna("f")

X["host_identity_verified"] = X["host_identity_verified"].fillna("f")

X["is_location_exact"] = X["is_location_exact"].fillna("f")

X["host_response_time"] = X["host_response_time"].fillna("unknown")

X["host_is_superhost"] = X["host_is_superhost"].fillna("unknown")

X["property_type"] = X["property_type"].fillna("unknown")

X["extra_people"] = X["extra_people"].apply(lambda x: float(re.sub(r"[,$]","", x)))

X["host_verifications"] = X['host_verifications'].fillna("[]")
X["host_verifications"] = X["host_verifications"].apply(lambda x : x.replace("None","[]"))
X["host_verifications"] = X["host_verifications"].apply(lambda x: len(ast.literal_eval(x)))

X["host_since"] = (pd.Timestamp("today") - pd.to_datetime(X["host_since"], format= "%Y-%m-%d"))
X["host_since"] = X["host_since"].apply(lambda x: x.days)


10. Check that all variables that can can be converted are in numerical format, do not forget to check y as well.

In [141]:
X.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 47 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   host_since                        3816 non-null   float64
 1   host_response_time                3818 non-null   object 
 2   host_response_rate                3295 non-null   float64
 3   host_acceptance_rate              3045 non-null   float64
 4   host_is_superhost                 3818 non-null   object 
 5   host_listings_count               3816 non-null   float64
 6   host_total_listings_count         3816 non-null   float64
 7   host_verifications                3818 non-null   int64  
 8   host_has_profile_pic              3818 non-null   object 
 9   host_identity_verified            3818 non-null   object 
 10  neighbourhood_group_cleansed      3818 non-null   object 
 11  latitude                          3818 non-null   float64
 12  longit

11. Apply ```train_test_split``` to create an X_train X_test y_train and y_test objects. (with random_state = 1)

In [142]:
X_train_unproc, X_test_unproc, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state = 1)

12. Separate the variables into two groups, one for the numerical variables and one for the categorical variables. And apply preprocessings to each subgroup of variables properly.

In [143]:
numeric_features = X.select_dtypes(exclude="object").columns
categorical_features = X.select_dtypes(include="object").columns

numeric_transformer = Pipeline(
    steps=[
        ('imputer', KNNImputer()),
        ('scaler', StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy = 'most_frequent')),
        ('encoder', OneHotEncoder())
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train = preprocessor.fit_transform(X_train_unproc)
X_test = preprocessor.transform(X_test_unproc)

13. What score would you expect for a model that would always predict the average price?

14. Train an Adaboost model with all its default parameters, what's the score ?


In [146]:
adaboost_regressor = AdaBoostRegressor()
adaboost_regressor.fit(X_train, y_train)

print(f"R2 adaboost_regressor train : {adaboost_regressor.score(X_train, y_train)}")
print(f"R2 adaboost_regressor test : {adaboost_regressor.score(X_test, y_test)}")

R2 adaboost_regressor train : 0.6649173759909199
R2 adaboost_regressor test : 0.6626111041103181


15. Train an XGBoost model with all its default parameters except max_depth=3 (the same as adaboost default), what's the score ?

In [148]:
xgb_regressor = XGBRegressor(max_depth=3)
xgb_regressor.fit(X_train, y_train)

print(f"R2 xgb_regressor train : {xgb_regressor.score(X_train, y_train)}")
print(f"R2 xgb_regressor test : {xgb_regressor.score(X_test, y_test)}")

R2 xgb_regressor train : 0.8655358199848072
R2 xgb_regressor test : 0.745687453825888


16. Adaboost does not seem to be performing as well as XGBoost, however it does not seem to overfit the data as much, try and improve it by playing with its parameters ```learning rate``` & ```n_estimators``` thanks to a grid search

In [152]:
adaboost_regressor = AdaBoostRegressor()

params = {
    'n_estimators':[150, 200, 250, 300],
    "learning_rate":[1.0, 1.5, 2.0, 2.5]
}

gridsearch_ab_reg = GridSearchCV(adaboost_regressor, param_grid = params, cv = 3)
gridsearch_ab_reg.fit(X_train, y_train)

print("Best hyperparameters : ", gridsearch_ab_reg.best_params_)
print("Best validation R2 : ", gridsearch_ab_reg.best_score_)
print(f"R2 adaboost_regressor train : {gridsearch_ab_reg.score(X_train, y_train)}")
print(f"R2 adaboost_regressor test : {gridsearch_ab_reg.score(X_test, y_test)}")

Best hyperparameters :  {'learning_rate': 2.5, 'n_estimators': 250}
Best validation R2 :  0.6493485229184605
R2 adaboost_regressor train : 0.6841427644034382
R2 adaboost_regressor test : 0.6615840479056627


We don't seem to be able to reach XGBoost performance using Adaboost in this case

17. Let's now run a sanity check to make sure that Adaboost and XGBoost actually improved the performance of their base models which are regression trees in this case. Train a regression tree model with max_depth = 3 (the default for Adaboost)

In [154]:
decision_tree_regressor = DecisionTreeRegressor(max_depth=3)
decision_tree_regressor.fit(X_train, y_train)

print(f"R2 decision_tree_regressor train {decision_tree_regressor.score(X_train, y_train)}")
print(f"R2 decision_tree_regressor test {decision_tree_regressor.score(X_test, y_test)}")

R2 decision_tree_regressor train 0.5883192169975924
R2 decision_tree_regressor test 0.6034582067623893


We conclude here that both boosting algorithms have fulfilled their missions, they both were able to improve performance on the test set compared to the base model! However XGBoost seems to have superior performance in this case despite higher levels of over fitting.

18. Train separately three independent models, and then implement a voting. Do you get better results?

In [158]:
ridge = Ridge()

params = {
    'alpha': [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0]
}

gridsearch_ridge = GridSearchCV(ridge, param_grid = params, cv = 3)
gridsearch_ridge.fit(X_train, y_train)

print("Best hyperparameters : ", gridsearch_ridge.best_params_)
print("Best validation R2 : ", gridsearch_ridge.best_score_)
print("R2 on training set : ", gridsearch_ridge.score(X_train, y_train))
print("R2 on test set : ", gridsearch_ridge.score(X_test, y_test))

Best hyperparameters :  {'alpha': 50.0}
Best validation R2 :  0.6764345759884707
R2 on training set :  0.704088550877146
R2 on test set :  0.6895178324940459


In [159]:
dt = DecisionTreeRegressor()

params = {
    'max_depth': [1, 2, 3], 
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2, 3, 4]
}

dt_opt = GridSearchCV(dt, param_grid = params, cv = 3)
dt_opt.fit(X_train, y_train)

print("Best hyperparameters : ", dt_opt.best_params_)
print("Best validation R2 : ", dt_opt.best_score_)
print("R2 on training set : ", dt_opt.score(X_train, y_train))
print("R2 on test set : ", dt_opt.score(X_test, y_test))

Best hyperparameters :  {'max_depth': 3, 'min_samples_leaf': 3, 'min_samples_split': 2}
Best validation R2 :  0.5722125753256275
R2 on training set :  0.5883192169975924
R2 on test set :  0.6034582067623895


In [161]:
svm = SVR(kernel = 'rbf')

params = {
    'C': [0.1, 1.0, 10.0],
    'gamma': [0.1, 1.0, 10.0]
}

svm_opt = GridSearchCV(svm, param_grid = params, cv = 3)
svm_opt.fit(X_train, y_train)

print("Best hyperparameters : ", svm_opt.best_params_)
print("Best validation R2 : ", svm_opt.best_score_)
print("R2 on training set : ", svm_opt.score(X_train, y_train))
print("R2 on test set : ", svm_opt.score(X_test, y_test))

Best hyperparameters :  {'C': 1.0, 'gamma': 0.1}
Best validation R2 :  0.5066096405622634
R2 on training set :  0.8931489797650991
R2 on test set :  0.5298619542742087


In [162]:
voting = VotingRegressor(estimators=[("linear", ridge), ("tree", dt), ("svm", svm)])
voting.fit(X_train, y_train)

print("R2 on training set : ", voting.score(X_train, y_train))
print("R2 on test set : ", voting.score(X_test, y_test))

R2 on training set :  0.917298147133873
R2 on test set :  0.7068546258424546


19. Try a stacking method and conclude about the best model.

In [163]:
stacking = StackingRegressor(estimators = [("logistic", ridge), ("tree", dt), ("svm", svm)], cv = 3)
preds = stacking.fit_transform(X_train, y_train)

predictions = pd.DataFrame(preds, columns=stacking.named_estimators_.keys())

display(predictions)

print("R2 on training set : ", stacking.score(X_train, y_train))
print("R2 on test set : ", stacking.score(X_test, y_test))

Unnamed: 0,logistic,tree,svm
0,1.983554,2.301030,2.202851
1,1.861308,1.544068,1.810400
2,1.917328,1.875061,1.975203
3,1.860830,1.845098,1.917997
4,1.938455,1.954243,1.931226
...,...,...,...
3049,2.414935,2.260071,2.283646
3050,1.981438,1.977724,1.877715
3051,2.058526,2.079181,2.058688
3052,2.000495,1.954243,2.014449


R2 on training set :  0.8814413960399532
R2 on test set :  0.7226925603610834


In [164]:
corr_matrix = predictions.corr().round(2)
import plotly.figure_factory as ff

fig = ff.create_annotated_heatmap(corr_matrix.values,
                                  x = corr_matrix.columns.tolist(),
                                  y = corr_matrix.index.tolist())


fig.show()

**The tree model is very correlated to the other ones. Let's drop it and see if it improves the performances.**

In [165]:
stacking = StackingRegressor(estimators = [("liear", ridge), ("svm", svm)], cv = 3)
preds = stacking.fit_transform(X_train, y_train)

predictions = pd.DataFrame(preds, columns=stacking.named_estimators_.keys())

display(predictions)

print("R2 on training set : ", stacking.score(X_train, y_train))
print("R2 on test set : ", stacking.score(X_test, y_test))

Unnamed: 0,liear,svm
0,1.983554,2.202851
1,1.861308,1.810400
2,1.917328,1.975203
3,1.860830,1.917997
4,1.938455,1.931226
...,...,...
3049,2.414935,2.283646
3050,1.981438,1.877715
3051,2.058526,2.058688
3052,2.000495,2.014449


R2 on training set :  0.8270807190174594
R2 on test set :  0.7175718960148018
