## 6.10 Homework

The goal of this homework is to create a tree-based regression model for prediction apartment prices (column `'price'`).

In this homework we'll again use the New York City Airbnb Open Data dataset - the same one we used in homework 2 and 3.

You can take it from [Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

Let's load the data:

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
columns = [
    'neighbourhood_group', 'room_type', 'latitude', 'longitude',
    'minimum_nights', 'number_of_reviews','reviews_per_month',
    'calculated_host_listings_count', 'availability_365',
    'price'
]
columns_without_price = columns.copy()
columns_without_price.remove('price')
df = pd.read_csv('data.csv', usecols=columns)
df.reviews_per_month = df.reviews_per_month.fillna(0)

In [3]:
df.isnull().sum()

neighbourhood_group               0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

* Apply the log tranform to `price`
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1

In [4]:
df['price'] = np.log1p(df.price.values)

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [7]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.price.values
del df_train['price']

y_val = df_val.price.values
del df_val['price']

y_test = df_test.price.values
del df_test['price']

Now, use `DictVectorizer` to turn train and validation into matrices:

In [8]:
from sklearn.feature_extraction import DictVectorizer

In [9]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[columns_without_price].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val[columns_without_price].to_dict(orient='records')
X_val = dv.transform(val_dict)

## Question 1

Let's train a decision tree regressor to predict the price variable. 

* Train a model with `max_depth=1`

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
from sklearn.tree import DecisionTreeRegressor, export_text
 
model = DecisionTreeRegressor(max_depth = 1)
model.fit(X_train, y_train)
feature = dv.get_feature_names()

In [25]:
print(export_text(model, feature_names = feature))

|--- room_type=Entire home/apt <= 0.50
|   |--- value: [4.29]
|--- room_type=Entire home/apt >  0.50
|   |--- value: [5.15]



Which feature is used for splitting the data?

* `room_type`
* `neighbourhood_group`
* `number_of_reviews`
* `reviews_per_month`

## Question 2

Train a random forest model with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1`  (optional - to make training faster)

In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

model = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
rmse = mean_squared_error(y_val, y_pred)** 0.5
rmse

0.4615632303514057

What's the RMSE of this model on validation?

* 0.059
* 0.259
* 0.459
* 0.659

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10
* Set `random_state` to `1`
* Evaluate the model on the validation dataset

In [13]:
for i in range(10, 201, 10):
    model = RandomForestRegressor(n_estimators = i, random_state=1, n_jobs = -1)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    rmse = mean_squared_error(y_val, y_pred) ** 0.5 
    print("For estimators %s, RMSE is %.3f" % (i, rmse))

For estimators 10, RMSE is 0.462
For estimators 20, RMSE is 0.448
For estimators 30, RMSE is 0.446
For estimators 40, RMSE is 0.444
For estimators 50, RMSE is 0.442
For estimators 60, RMSE is 0.442
For estimators 70, RMSE is 0.441
For estimators 80, RMSE is 0.441
For estimators 90, RMSE is 0.441
For estimators 100, RMSE is 0.440
For estimators 110, RMSE is 0.439
For estimators 120, RMSE is 0.439
For estimators 130, RMSE is 0.439
For estimators 140, RMSE is 0.439
For estimators 150, RMSE is 0.439
For estimators 160, RMSE is 0.439
For estimators 170, RMSE is 0.439
For estimators 180, RMSE is 0.439
For estimators 190, RMSE is 0.439
For estimators 200, RMSE is 0.439


After which value of `n_estimators` does RMSE stop improving?

- 10
- 50
- 70
- 120

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values, try different values of `n_estimators` from 10 till 200 (with step 10)
* Fix the random seed: `random_state=1`

In [14]:
for depth in [10,15,20,25]:
    print('For max_depth %s \n' % depth)
    for i in range(10, 201, 10):
        model = RandomForestRegressor(n_estimators = i, random_state=1, n_jobs = -1, max_depth = depth)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)
        rmse = mean_squared_error(y_val, y_pred) ** 0.5 
        print("For estimators %s, RMSE is %.3f" % (i, rmse))

For max_depth 10 

For estimators 10, RMSE is 0.446
For estimators 20, RMSE is 0.442
For estimators 30, RMSE is 0.441
For estimators 40, RMSE is 0.441
For estimators 50, RMSE is 0.441
For estimators 60, RMSE is 0.441
For estimators 70, RMSE is 0.441
For estimators 80, RMSE is 0.441
For estimators 90, RMSE is 0.440
For estimators 100, RMSE is 0.440
For estimators 110, RMSE is 0.440
For estimators 120, RMSE is 0.440
For estimators 130, RMSE is 0.440
For estimators 140, RMSE is 0.440
For estimators 150, RMSE is 0.440
For estimators 160, RMSE is 0.440
For estimators 170, RMSE is 0.440
For estimators 180, RMSE is 0.440
For estimators 190, RMSE is 0.440
For estimators 200, RMSE is 0.440
For max_depth 15 

For estimators 10, RMSE is 0.450
For estimators 20, RMSE is 0.441
For estimators 30, RMSE is 0.440
For estimators 40, RMSE is 0.439
For estimators 50, RMSE is 0.438
For estimators 60, RMSE is 0.438
For estimators 70, RMSE is 0.437
For estimators 80, RMSE is 0.437
For estimators 90, RMSE is 

What's the best `max_depth`:

* 10
* 15
* 20
* 25

Bonus question (not graded):

Will the answer be different if we change the seed for the model?

In [15]:
model1 = RandomForestRegressor(n_estimators = 10, random_state=1, n_jobs = -1, max_depth= 20)
model1.fit(X_train, y_train)
y_pred = model.predict(X_val)
rmse = mean_squared_error(y_val, y_pred) ** 0.5 
print(rmse)

0.43869448497893626


In [16]:
model1 = RandomForestRegressor(n_estimators = 10, random_state=5, n_jobs = -1, max_depth= 20)
model1.fit(X_train, y_train)
y_pred = model.predict(X_val)
rmse = mean_squared_error(y_val, y_pred) ** 0.5 
print(rmse)

0.4386944849789362


## Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorith, it finds the best split. 
When doint it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the imporatant features 
for tree-based models.

In Scikit-Learn, tree-based models contain this information in the `feature_importances_` field. 

For this homework question, we'll find the most important feature:

* Train the model with these parametes:
    * `n_estimators=10`,
    * `max_depth=20`,
    * `random_state=1`,
    * `n_jobs=-1` (optional)
* Get the feature importance information from this model

In [17]:
model = RandomForestRegressor(n_estimators = 10, max_depth = 20, random_state=1, n_jobs = -1)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)

In [None]:
d = {'feature': dv.get_feature_names(), 'values': model.feature_importances_}
feature_info_values = pd.DataFrame(data = d)
feature_info_values.sort_values('values', ascending = False)

What's the most important feature? 

* `neighbourhood_group=Manhattan`
* `room_type=Entire home/apt`	
* `longitude`
* `latitude`

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` first to `0.1` and then to `0.01`

In [19]:
import xgboost as xgb

In [20]:
features = dv.get_feature_names()
dtrain = xgb.DMatrix(X_train, label = y_train, feature_names = features)
dval = xgb.DMatrix(X_val, label = y_val, feature_names = features)



In [21]:
watchlist = [(dtrain, 'train'), (dval, 'val')]
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,

    'objective': 'reg:squarederror',
    'nthread': 8,

    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round = 100)
y_pred = model.predict(dval)
rmse = mean_squared_error(y_pred,y_val) ** 0.5
print(rmse)

0.43621034591295677


In [22]:
xgb_params = {
    'eta': 0.1, 
    'max_depth': 6,
    'min_child_weight': 1,

    'objective': 'reg:squarederror',
    'nthread': 8,

    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round = 100)
y_pred = model.predict(dval)
rmse = mean_squared_error(y_pred,y_val) ** 0.5
print(rmse)

0.43249655247991464


In [23]:
xgb_params = {
    'eta': 0.01, 
    'max_depth': 6,
    'min_child_weight': 1,

    'objective': 'reg:squarederror',
    'nthread': 8,

    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round = 100)
y_pred = model.predict(dval)
rmse = mean_squared_error(y_pred,y_val) ** 0.5
print(rmse)

1.630452438951798


Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* 0.01

## Submit the results


Submit your results here: https://forms.gle/wQgFkYE6CtdDed4w8

It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Deadline


The deadline for submitting is 20 October 2021, 17:00 CET (Wednesday). After that, the form will be closed.

