## Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.
> If it's exactly in between two options, select the higher value.


### Dataset

In this homework, we continue using the fuel efficiency dataset.
Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).



### Preparing the dataset 

Preparation:

* Fill missing values with zeros.
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

In [1]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv -O data.csv

--2025-11-12 01:13:34--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 874188 (854K) [text/plain]
Saving to: ‘data.csv’


2025-11-12 01:13:35 (37.2 MB/s) - ‘data.csv’ saved [874188/874188]



In [2]:
import pandas as pd
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


In [3]:
numerical = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical = df.select_dtypes(include=['object']).columns.tolist()
numerical.remove('fuel_efficiency_mpg')  # Remove target variable from numerical features
print("Numerical columns:", numerical)
print("Categorical columns:", categorical)

Numerical columns: ['engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight', 'acceleration', 'model_year', 'num_doors']
Categorical columns: ['origin', 'fuel_type', 'drivetrain']


In [4]:
df.isnull().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

In [5]:
df.fillna(0, inplace=True)

In [6]:
df.isnull().sum()

engine_displacement    0
num_cylinders          0
horsepower             0
vehicle_weight         0
acceleration           0
model_year             0
origin                 0
fuel_type              0
drivetrain             0
num_doors              0
fuel_efficiency_mpg    0
dtype: int64

In [7]:
for col in categorical + numerical:
    print(df[col].value_counts())

origin
Europe    3254
Asia      3247
USA       3203
Name: count, dtype: int64
fuel_type
Gasoline    4898
Diesel      4806
Name: count, dtype: int64
drivetrain
All-wheel drive      4876
Front-wheel drive    4828
Name: count, dtype: int64
engine_displacement
190    816
200    805
210    770
220    729
180    719
170    662
230    617
160    559
240    550
250    463
150    438
140    384
260    339
270    296
130    290
280    218
120    217
110    173
290    146
300    108
100     79
310     77
90      53
80      43
320     33
330     27
70      25
60      19
350     11
340     10
50       9
40       7
30       5
370      4
380      2
10       1
Name: count, dtype: int64
num_cylinders
4.0     1858
3.0     1792
2.0     1395
5.0     1376
6.0      946
1.0      681
0.0      665
7.0      537
8.0      258
9.0      115
10.0      52
11.0      21
12.0       6
13.0       2
Name: count, dtype: int64
horsepower
0.0      708
152.0    142
145.0    141
151.0    134
141.0    130
        ... 
40.0      

In [8]:
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)  
len(df_train), len(df_val), len(df_test)

(5822, 1941, 1941)

In [9]:
y_train = df_train['fuel_efficiency_mpg'].values.round(2)
y_val = df_val['fuel_efficiency_mpg'].values.round(2)
y_test = df_test['fuel_efficiency_mpg'].values.round(2)


In [10]:
df_train.reset_index(drop=True, inplace=True)
df_val.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

In [11]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
columns = categorical + numerical
#columns = ["vehicle_weight","model_year","origin","fuel_type"]
print (columns)
X_train = dv.fit_transform(df_train[columns].to_dict(orient='records'))

['origin', 'fuel_type', 'drivetrain', 'engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight', 'acceleration', 'model_year', 'num_doors']


In [12]:
X_train.shape, y_train.shape

((5822, 14), (5822,))

## Question 1

Let's train a decision tree regressor to predict the `fuel_efficiency_mpg` variable. 

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?


* `'vehicle_weight'`
* `'model_year'`
* `'origin'`
* `'fuel_type'`

In [13]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import roc_auc_score
dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)


0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,1
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [14]:
from sklearn.tree import export_text
r = export_text(dt, feature_names=dv.get_feature_names_out().tolist())
print(r)

|--- vehicle_weight <= 3022.11
|   |--- value: [16.88]
|--- vehicle_weight >  3022.11
|   |--- value: [12.94]



## Answer 1 
vehicle_weight

## Question 2

Train a random forest regressor with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)


What's the RMSE of this model on the validation data?

* 0.045
* 0.45
* 4.5
* 45.0

In [15]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

0,1,2
,n_estimators,10
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [16]:
X_val = dv.transform(df_val[columns].to_dict(orient='records'))
y_pred = rf.predict(X_val)
y_pred, y_val

(array([18.519, 15.246, 18.107, ..., 14.785, 13.544, 16.009], shape=(1941,)),
 array([18.44, 15.34, 18.44, ..., 14.86, 13.83, 16.17], shape=(1941,)))

In [17]:
from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y_pred, y_val)  # RMSE

0.45830588911324105

## Answer 2
0.45

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10.
* Set `random_state` to `1`.
* Evaluate the model on the validation dataset.


After which value of `n_estimators` does RMSE stop improving?
Consider 3 decimal places for calculating the answer.

- 10
- 25
- 80
- 200

If it doesn't stop improving, use the latest iteration number in
your answer.

In [18]:
import numpy as np

In [19]:
n_range = np.arange(10, 201, 10)
for n in n_range:
    rf = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_val)
    rmse = root_mean_squared_error(y_pred, y_val)
    print(f'n_estimators: {n}, RMSE: {rmse:.3f}')

n_estimators: 10, RMSE: 0.458
n_estimators: 20, RMSE: 0.453
n_estimators: 30, RMSE: 0.451
n_estimators: 40, RMSE: 0.448
n_estimators: 50, RMSE: 0.446
n_estimators: 60, RMSE: 0.445
n_estimators: 70, RMSE: 0.444
n_estimators: 80, RMSE: 0.444
n_estimators: 90, RMSE: 0.444
n_estimators: 100, RMSE: 0.444
n_estimators: 110, RMSE: 0.443
n_estimators: 120, RMSE: 0.444
n_estimators: 130, RMSE: 0.444
n_estimators: 140, RMSE: 0.443
n_estimators: 150, RMSE: 0.443
n_estimators: 160, RMSE: 0.443
n_estimators: 170, RMSE: 0.443
n_estimators: 180, RMSE: 0.443
n_estimators: 190, RMSE: 0.443
n_estimators: 200, RMSE: 0.443


## Answer 3
n_estimators = 70

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values,
  * try different values of `n_estimators` from 10 till 200 (with step 10)
  * calculate the mean RMSE 
* Fix the random seed: `random_state=1`


What's the best `max_depth`, using the mean RMSE?

* 10
* 15
* 20
* 25


In [20]:
n_range = np.arange(10, 201, 10)
d_values = [10,15,20,25]

for d in d_values:
    rmses = []
    for n in n_range:
        rf = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1, max_depth=d)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_val)
        rmse = root_mean_squared_error(y_pred, y_val)
        #print(f'n_estimators: {n}, RMSE: {rmse:.3f}')
        rmses.append(round(rmse,3))
    print(f'decision_tree_depth: {d}, Mean RMSE: {np.mean(rmses):.3f}')

decision_tree_depth: 10, Mean RMSE: 0.442
decision_tree_depth: 15, Mean RMSE: 0.445
decision_tree_depth: 20, Mean RMSE: 0.445
decision_tree_depth: 25, Mean RMSE: 0.445


## Answer 4
best max_depth: 10

# Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorithm, it finds the best split. 
When doing it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the
[`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_)
field. 

For this homework question, we'll find the most important feature:

* Train the model with these parameters:
  * `n_estimators=10`,
  * `max_depth=20`,
  * `random_state=1`,
  * `n_jobs=-1` (optional)
* Get the feature importance information from this model


What's the most important feature (among these 4)? 

* `vehicle_weight`
*	`horsepower`
* `acceleration`
* `engine_displacement`	


In [24]:
rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1, max_depth=20)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val)

In [33]:
feature_importances = rf.feature_importances_.round(3)
print (list(zip (df_train.columns, feature_importances)))

[('engine_displacement', np.float64(0.011)), ('num_cylinders', np.float64(0.0)), ('horsepower', np.float64(0.0)), ('vehicle_weight', np.float64(0.003)), ('acceleration', np.float64(0.0)), ('model_year', np.float64(0.0)), ('origin', np.float64(0.016)), ('fuel_type', np.float64(0.003)), ('drivetrain', np.float64(0.002)), ('num_doors', np.float64(0.002)), ('fuel_efficiency_mpg', np.float64(0.0))]


## Answer 5

engine_displacement = 0.011
horsepower = 0.0 
accelaration = 0.0
vehicle_weight = 0.003

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* Both give equal value


In [41]:
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

In [35]:
!pip install xgboost
import xgboost as xgb

Collecting xgboost
  Downloading xgboost-3.1.1-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting nvidia-nccl-cu12 (from xgboost)
  Downloading nvidia_nccl_cu12-2.28.7-py3-none-manylinux_2_18_x86_64.whl.metadata (2.0 kB)
Downloading xgboost-3.1.1-py3-none-manylinux_2_28_x86_64.whl (115.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.9/115.9 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading nvidia_nccl_cu12-2.28.7-py3-none-manylinux_2_18_x86_64.whl (296.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.8/296.8 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: nvidia-nccl-cu12, xgboost
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [xgboost]m1/2[0m [xgboost]
[1A[2KSuccessfully installed nvidia-nccl-cu12-2.28.7 xgboost-3.1.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49

In [42]:
features = dv.get_feature_names_out().tolist()
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features) 
model = xgb.train(xgb_params, dtrain, num_boost_round=100)
y_pred = model.predict(dval)
rmse = root_mean_squared_error(y_pred, y_val)
print(f'XGBoost RMSE for 0.3 eta : {rmse:.3f}')

XGBoost RMSE for 0.3 eta : 0.451


In [43]:
xgb_params['eta'] = 0.1
model = xgb.train(xgb_params, dtrain, num_boost_round=100)
y_pred = model.predict(dval)
rmse = root_mean_squared_error(y_pred, y_val)
print(f'XGBoost RMSE for 0.1 eta: {rmse:.3f}')

XGBoost RMSE for 0.1 eta: 0.428


## Answer 6
eta : rmse
0.3 : 0.451
0.1 : 0.428


## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw06
* If your answer doesn't match options exactly, select the closest one. If the answer is exactly in between two options, select the higher value.