# Homework 6: Decision Trees and Ensemble Learning

### Dataset

In this homework, we continue using the fuel efficiency dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv).

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `fuel_efficiency_mpg`)

### Preparing the dataset

Preparation:

* Fill missing values with zeros.
* Do train/validation/test split with 60%/20%/20% distribution.
* Use the `train_test_split` function and set the random_state parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.


In [21]:
import pandas as pd

DATESET_NAME = 'car_fuel_efficiency.csv'

In [22]:
from pathlib import Path

import requests


DATASET_URL = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/'

def fetch():
    resp = requests.get(
        f'{DATASET_URL}{DATESET_NAME}',
        allow_redirects=False,
        timeout=10,
    )

    resp.raise_for_status()

    with open(DATESET_NAME, 'w') as f:
        f.write(resp.text)

if not Path(DATESET_NAME).exists():
    fetch()

In [23]:
df = pd.read_csv(DATESET_NAME)
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


In [24]:
from sklearn.model_selection import train_test_split


df = df.fillna(0)

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.fuel_efficiency_mpg.values
y_val = df_val.fuel_efficiency_mpg.values
y_test = df_test.fuel_efficiency_mpg.values

del df_train['fuel_efficiency_mpg']
del df_val['fuel_efficiency_mpg']
del df_test['fuel_efficiency_mpg']


In [25]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.fit_transform(val_dict)

test = df_test.to_dict(orient='records')
X_test = dv.fit_transform(test)


## Question 1

Let's train a decision tree regressor to predict the `fuel_efficiency_mpg` variable.

Train a model with `max_depth=1`.
Which feature is used for splitting the data?

* `vehicle_weight`
* `model_year`
* `origin`
* `fuel_type`

In [26]:
from sklearn.tree import DecisionTreeRegressor, export_text
import pprint

dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)

pprint.pprint(export_text(dt,feature_names=dv.get_feature_names_out(), show_weights=True))

('|--- vehicle_weight <= 3022.11\n'
 '|   |--- value: [16.88]\n'
 '|--- vehicle_weight >  3022.11\n'
 '|   |--- value: [12.94]\n')


## Question 2

Train a random forest regressor with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)

What's the RMSE of this model on the validation data?

* 0.045
* 0.45
* 4.5
* 45.0

In [27]:
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


dt = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
dt.fit(X_train, y_train)

y_pred = dt.predict(X_val)

np.sqrt(mean_squared_error(y_val, y_pred))

np.float64(0.4599777557336148)

## Question 3

Now let's experiment with the `n_estimators` parameter

Try different values of this parameter from 10 to 200 with step 10.

Set `random_state` to 1.

Evaluate the model on the validation dataset.

After which value of `n_estimators` does RMSE stop improving? Consider 3 decimal places for calculating the answer.

* 10
* 25
* 80
* 200

If it doesn't stop improving, use the latest iteration number in your answer.

In [28]:
prev = None
for n_estinamor in range(10, 201, 10):
    dt = RandomForestRegressor(n_estimators=n_estinamor, random_state=1, n_jobs=-1)
    dt.fit(X_train, y_train)

    y_pred = dt.predict(X_val)

    rmse = round(np.sqrt(mean_squared_error(y_val, y_pred)), 3)
    if not prev:
        prev = rmse
        continue

    if rmse < prev:
        prev = rmse
    elif rmse >= prev:
        print(rmse, n_estinamor)
        break

0.445 70


## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: [10, 15, 20, 25]
* For each of these values,
    * try different values of `n_estimators` from 10 till 200 (with step 10)
    * calculate the mean RMSE
* Fix the random seed: `random_state=1`

What's the best `max_depth`, using the mean RMSE?

* 10
* 15
* 20
* 25

In [29]:
data = []
for max_depth in range(10, 26, 5):
    for n_estinamor in range(10, 201, 10):
        dt = RandomForestRegressor(n_estimators=n_estinamor, max_depth=max_depth, random_state=1, n_jobs=-1)
        dt.fit(X_train, y_train)
        y_pred = dt.predict(X_val)
        rmse = round(np.sqrt(mean_squared_error(y_val, y_pred)), 3)
        data.append((max_depth, n_estinamor, rmse))


In [30]:
heat = pd.DataFrame(data, columns=["max_depth", "n_estinamor", "rmse"])
heat.sort_values("rmse").head()

Unnamed: 0,max_depth,n_estinamor,rmse
17,10,180,0.44
19,10,200,0.44
18,10,190,0.44
16,10,170,0.44
15,10,160,0.44


## Question 5

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorithm, it finds the best split. When doing it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the [`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_) field.

For this homework question, we'll find the most important feature:

Train the model with these parameters:
* `n_estimators=10`,
* `max_depth=20`,
* `random_state=1`,
* `n_jobs=-1` (optional)

Get the feature importance information from this model
What's the most important feature (among these 4)?

* `vehicle_weight`
* `horsepower`
* `acceleration`
* `engine_displacement`

In [31]:
dt = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
dt.fit(X_train, y_train)

series = pd.Series(dt.feature_importances_, index=dv.get_feature_names_out())
series.filter(items=['vehicle_weight', 'horsepower', 'acceleration', 'engine_displacement'] )


vehicle_weight         0.959162
horsepower             0.016040
acceleration           0.011471
engine_displacement    0.003269
dtype: float64

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```python
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change eta from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

* `0.3`
* `0.1`
* Both give equal value

In [57]:
import xgboost as xgb

features = list(dv.get_feature_names_out())

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)
watchlist = [(dtrain, 'train'), (dval, 'val')]

xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

data = {}

In [58]:
# with eta 0.3
model03 = xgb.train(xgb_params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=10)

y_pred03 = model03.predict(dval)
data[0.3] = round(np.sqrt(mean_squared_error(y_val, y_pred03)), 4)


[0]	train-rmse:1.81393	val-rmse:1.85444
[10]	train-rmse:0.37115	val-rmse:0.43896
[20]	train-rmse:0.33553	val-rmse:0.43376
[30]	train-rmse:0.31475	val-rmse:0.43752
[40]	train-rmse:0.30202	val-rmse:0.43968
[50]	train-rmse:0.28456	val-rmse:0.44140
[60]	train-rmse:0.26768	val-rmse:0.44290
[70]	train-rmse:0.25489	val-rmse:0.44531
[80]	train-rmse:0.24254	val-rmse:0.44689
[90]	train-rmse:0.23193	val-rmse:0.44839
[99]	train-rmse:0.21950	val-rmse:0.45018


In [None]:
xgb_params['eta'] = 0.1
model01 = xgb.train(xgb_params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=10)
y_pred01 = model01.predict(dval)
data[0.1] = round(np.sqrt(mean_squared_error(y_val, y_pred01)), 4)

[0]	train-rmse:2.28944	val-rmse:2.34561
[10]	train-rmse:0.91008	val-rmse:0.94062
[20]	train-rmse:0.48983	val-rmse:0.53064
[30]	train-rmse:0.38342	val-rmse:0.44289
[40]	train-rmse:0.35343	val-rmse:0.42746
[50]	train-rmse:0.33998	val-rmse:0.42498
[60]	train-rmse:0.33054	val-rmse:0.42456
[70]	train-rmse:0.32202	val-rmse:0.42503
[80]	train-rmse:0.31667	val-rmse:0.42563
[90]	train-rmse:0.31059	val-rmse:0.42586
[99]	train-rmse:0.30419	val-rmse:0.42623


In [64]:
if len(set(data.values())) == 1:
    print("Both give equal value")
else:
    key = min(data, key=data.get)
    print(f"Best result with `eta` = {key} ({data[key]})")


Best result with `eta` = 0.1 (0.4262)
