## Homework for Module 2: Machine Learning for Regression

For this homework, we'll use the Car Fuel Efficiency dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv).

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `fuel_efficiency_mpg`).

Use only the following columns:

`engine_displacement`,
`horsepower`,
`vehicle_weight`,
`model_year`,
`fuel_efficiency_mpg`

In [56]:
import pandas as pd
import requests

from pathlib import Path


FILE_NAME = "car_fuel_efficiency.csv"


def fetch():
    resp = requests.get(
        f"https://raw.githubusercontent.com/alexeygrigorev/datasets/master/{FILE_NAME}",
        allow_redirects=False,
        timeout=10,
    )

    resp.raise_for_status()

    with open(FILE_NAME, "w") as f:
        f.write(resp.text)

if not Path(FILE_NAME).exists():
    fetch()

df = pd.read_csv(FILE_NAME)[['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year', 'fuel_efficiency_mpg']]
df.head()

Unnamed: 0,engine_displacement,horsepower,vehicle_weight,model_year,fuel_efficiency_mpg
0,170,159.0,3413.433759,2003,13.231729
1,130,97.0,3149.664934,2007,13.688217
2,170,78.0,3079.038997,2018,14.246341
3,220,,2542.392402,2009,16.912736
4,210,140.0,3460.87099,2009,12.488369


## Question 1

There's one column with missing values. What is it?

In [57]:
df.isnull().any()

engine_displacement    False
horsepower              True
vehicle_weight         False
model_year             False
fuel_efficiency_mpg    False
dtype: bool

## Question 2

What's the median (50% percentile) for variable `horsepower`?

In [58]:
df['horsepower'].median()

np.float64(149.0)

### Splitting dataset into train/val/test sets

Shuffle the dataset (the filtered one you created above), use seed 42.

Split your data in train/val/test sets, with 60%/20%/20% distribution.

Use the same code as in the lectures.


In [63]:
import numpy as np

def get_datasets(seed):
    len_df = len(df)

    len_val = int(len_df * 0.2)
    len_test = int(len_df * 0.2)
    len_train = len_df - len_val - len_test

    df_train = df.iloc[:len_train]
    df_val = df.iloc[len_train:len_train+len_val]
    df_test = df.iloc[len_train+len_val:]

    idx = np.arange(len_df)
    np.random.seed(seed)
    np.random.shuffle(idx)

    df_train = df.iloc[idx[:len_train]]
    df_val = df.iloc[idx[len_train:len_train+len_val]]
    df_test = df.iloc[idx[len_train+len_val:]]

    X_train = df_train.reset_index(drop=True)
    X_val = df_val.reset_index(drop=True)
    X_test = df_test.reset_index(drop=True)

    y_train = df_train.fuel_efficiency_mpg.values
    y_val = df_val.fuel_efficiency_mpg.values
    y_test = df_test.fuel_efficiency_mpg.values

    del X_train['fuel_efficiency_mpg']
    del X_val['fuel_efficiency_mpg']
    del X_test['fuel_efficiency_mpg']

    return X_train, X_val, X_test, y_train, y_val, y_test

In [67]:
X_train, X_val, X_test, y_train, y_val, y_test = get_datasets(42)

## Question 3

We need to deal with missing values for the column from Q1.

We have two options: fill it with 0 or with the mean of this variable.

Try both options. For each, train a linear regression model without regularization using the code from 
the lessons.

For computing the mean, use the training only!

Use the validation dataset to evaluate the models and compare the RMSE of each option.

Round the RMSE scores to 2 decimal digits using round(score, 2)

Which option gives better RMSE?

In [72]:
def train_linear_regression(X, y):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

def predict(X, w0, w):
    return w0 + X.dot(w)

def rmse(y, y_pred):
    se = (y - y_pred) ** 2
    mse = se.mean()
    return np.sqrt(mse)


X_train_zero = X_train.copy()
X_train_zero = X_train_zero.fillna(0)

X_train_mean = X_train.copy()
X_train_mean = X_train_mean.fillna(X_train_mean['horsepower'].mean())

w0_zero, w_zero = train_linear_regression(X_train_zero, y_train)
w0_mean, w_mean = train_linear_regression(X_train_mean, y_train)


X_val_zero = X_val.copy()
X_val_zero = X_val_zero.fillna(0)

y_zero_pred = predict(X_val_zero, w0_zero, w_zero)

X_val_mean = X_val.copy()
X_val_mean = X_val_mean.fillna(X_val_mean['horsepower'].mean())

y_mean_pred = predict(X_val_mean, w0_mean, w_mean)

rmse_zero = rmse(y_val, y_zero_pred)
rmse_mean = rmse(y_val, y_mean_pred)

print(f"RMSE for filled with zero: {rmse_zero}")
print(f"RMSE for filled with mean: {rmse_mean}")

RMSE for filled with zero: 0.5173782638832584
RMSE for filled with mean: 0.46362369950052107


## Question 4

Now let's train a regularized linear regression.

For this question, fill the NAs with 0.

Try different values of r from this list: [0, 0.01, 0.1, 1, 5, 10, 100].

Use RMSE to evaluate the model on the validation dataset.

Round the RMSE scores to 2 decimal digits.

Which r gives the best RMSE?

If there are multiple options, select the smallest r.

In [73]:
def train_linear_regression_with_reg(X, y, r):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX = XTX + r * np.eye(XTX.shape[0])
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

min_rmse = None
for r in [0, 0.01, 0.1, 1, 5, 10, 100]:
    X_train_zero = X_train.copy()
    X_train_zero = X_train_zero.fillna(0)
    
    w0, w = train_linear_regression_with_reg(X_train_zero, y_train, r)

    X_val_zero = X_val.copy()
    X_val_zero = X_val_zero.fillna(0)

    y_train_zero_pred = predict(X_val_zero, w0, w)
    error = rmse(y_val, y_train_zero_pred)

    if not min_rmse:
        min_rmse = (error, r)
    elif error < min_rmse[0]:
        min_rmse = (error, r)

min_rmse

(np.float64(0.5171115525773323), 0.01)

## Question 5

We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.

Try different seed values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].

For each seed, do the train/validation/test split with 60%/20%/20% distribution.

Fill the missing values with 0 and train a model without regularization.

For each seed, evaluate the model on the validation dataset and collect the RMSE scores.

What's the standard deviation of all the scores? To compute the standard deviation, use np.std.

Round the result to 3 decimal digits (round(std, 3))

In [74]:
errors = []
for seed in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]:
    X_train, X_val, X_test, y_train, y_val, y_test = get_datasets(seed)
    X_train_zero = X_train.copy()
    X_train_zero = X_train_zero.fillna(0)
    
    w0, w = train_linear_regression(X_train_zero, y_train)

    X_val_zero = X_val.copy()
    X_val_zero = X_val_zero.fillna(0)
    y_train_zero_pred = predict(X_val_zero, w0, w)
    errors.append(rmse(y_val, y_train_zero_pred))

round(np.std(errors), 3)

np.float64(0.007)

## Question 6


Split the dataset like previously, use seed 9.

Combine train and validation datasets.

Fill the missing values with 0 and train a model with r=0.001.

What's the RMSE on the test dataset?

In [77]:
X_train, X_val, X_test, y_train, y_val, y_test = get_datasets(9)

X_train = pd.concat([X_train, X_val], axis=0).reset_index(drop=True)
X_train = X_train.fillna(0)

y_train = np.concatenate([y_train, y_val])

w0, w = train_linear_regression_with_reg(X_train, y_train, 0.001)

X_test = X_test.fillna(0)
y_pred = predict(X_test, w0, w)
round(rmse(y_test, y_pred), 3)

np.float64(0.516)