# Homework

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column 'fuel_efficiency_mpg').

Preparing the dataset

Use only the following columns:

- `engine_displacement`
- `horsepower`
- `vehicle_weight`
- `model_year`
- `fuel_efficiency_mpg`

## Question 1
### There's one column with missing values. What is it?

In [23]:
import pandas as pd

In [24]:
df = pd.read_csv('../data/car_fuel_efficiency.csv')

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9704 entries, 0 to 9703
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   engine_displacement  9704 non-null   int64  
 1   num_cylinders        9222 non-null   float64
 2   horsepower           8996 non-null   float64
 3   vehicle_weight       9704 non-null   float64
 4   acceleration         8774 non-null   float64
 5   model_year           9704 non-null   int64  
 6   origin               9704 non-null   object 
 7   fuel_type            9704 non-null   object 
 8   drivetrain           9704 non-null   object 
 9   num_doors            9202 non-null   float64
 10  fuel_efficiency_mpg  9704 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 834.1+ KB


In [None]:
df.columns[df.isna().any()].tolist()


['num_cylinders', 'horsepower', 'acceleration', 'num_doors']

## Question 2
### What's the median (50% percentile) for variable 'horsepower'?

In [10]:
df['horsepower'].median()

149.0

In [13]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

## Question 3
    We need to deal with missing values for the column from Q1.
    We have two options: fill it with 0 or with the mean of this variable.
    Try both options. For each, train a linear regression model without regularization using the code from the lessons.
    For computing the mean, use the training only!
    Use the validation dataset to evaluate the models and compare the RMSE of each option.
    Round the RMSE scores to 2 decimal digits using round(score, 2)
    Which option gives better RMSE?

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

columns_to_use = [
    'engine_displacement',
    'horsepower', 
    'vehicle_weight',
    'model_year',
    'fuel_efficiency_mpg'
]

df_filtered = df[columns_to_use].copy()

print(df_filtered.head())
print(df_filtered.isnull().sum())

X = df_filtered.drop('fuel_efficiency_mpg', axis=1)
y = df_filtered['fuel_efficiency_mpg']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.4, random_state=42)

print(f"Train: {X_train.shape[0]} samples")
print(f"Validation: {X_val.shape[0]} samples")

# OPTION 1: Fill missing values with 0
print("\n=== OPTION 1: Fill missing values with 0 ===")
X_train_zero = X_train.copy()
X_val_zero = X_val.copy()

X_train_zero = X_train_zero.fillna(0)
X_val_zero = X_val_zero.fillna(0)

# Training model
model_zero = LinearRegression()
model_zero.fit(X_train_zero, y_train)

# Evaluate
y_pred_zero = model_zero.predict(X_val_zero)
rmse_zero = root_mean_squared_error(y_val, y_pred_zero)
print(f"RMSE (fill with 0): {round(rmse_zero, 2)}")

# OPTION 2: Fill missing values with the mean (ONLY from the training set)
print("\n=== OPTION 2: Fill missing values with the mean ===")
X_train_mean = X_train.copy()
X_val_mean = X_val.copy()

train_means = X_train_mean.mean()
print(f"Mean values from the training set: {train_means}")

X_train_mean = X_train_mean.fillna(train_means)
X_val_mean = X_val_mean.fillna(train_means)

# Training model
model_mean = LinearRegression()
model_mean.fit(X_train_mean, y_train)

# Evaluate
y_pred_mean = model_mean.predict(X_val_mean)
rmse_mean = root_mean_squared_error(y_val, y_pred_mean)
print(f"RMSE (fill with mean): {round(rmse_mean, 2)}")

# Compare results
print(f"\nCOMPARE")
print(f"RMSE w 0: {round(rmse_zero, 2)}")
print(f"RMSE w mean: {round(rmse_mean, 2)}")

if rmse_zero < rmse_mean:
    print("Option 1 (fill with 0) is better.")
else:
    print("Option 2 (fill with mean) is better.")

   engine_displacement  horsepower  vehicle_weight  model_year  \
0                  170       159.0     3413.433759        2003   
1                  130        97.0     3149.664934        2007   
2                  170        78.0     3079.038997        2018   
3                  220         NaN     2542.392402        2009   
4                  210       140.0     3460.870990        2009   

   fuel_efficiency_mpg  
0            13.231729  
1            13.688217  
2            14.246341  
3            16.912736  
4            12.488369  
engine_displacement      0
horsepower             708
vehicle_weight           0
model_year               0
fuel_efficiency_mpg      0
dtype: int64
Train: 5822 samples
Validation: 3882 samples

=== OPCIÓN 1: Llenar con 0 ===
RMSE (fill with 0): 0.52

=== OPCIÓN 2: Llenar con la media ===
Medias calculadas del train: engine_displacement     200.372724
horsepower              149.752081
vehicle_weight         2999.471344
model_year             2011.51

In [None]:
print("Available Columns:")
print(df.columns.tolist())
print("\nDataset:", df.shape)

Columnas disponibles:
['engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight', 'acceleration', 'model_year', 'origin', 'fuel_type', 'drivetrain', 'num_doors', 'fuel_efficiency_mpg']

Forma del dataset: (9704, 11)


## Question 4
    Now let's train a regularized linear regression.
    For this question, fill the NAs with 0.
    Try different values of r from this list: [0, 0.01, 0.1, 1, 5, 10, 100].
    Use RMSE to evaluate the model on the validation dataset.
    Round the RMSE scores to 2 decimal digits.
    Which r gives the best RMSE?
    If multiple options give the same best RMSE, select the smallest r.

In [None]:
from sklearn.linear_model import Ridge
r_values = [0, 0.01, 0.1, 1, 5, 10, 100]
best_rmse = float('inf')
best_r = None
X_train_ridge = X_train.fillna(0)
X_val_ridge = X_val.fillna(0)
for r in r_values:
    model_ridge = Ridge(alpha=r)
    model_ridge.fit(X_train_ridge, y_train)
    y_pred_ridge = model_ridge.predict(X_val_ridge)
    rmse_ridge = root_mean_squared_error(y_val, y_pred_ridge)
    rmse_ridge_rounded = round(rmse_ridge, 2)
    print(f"r: {r}, RMSE: {rmse_ridge_rounded}")
    if rmse_ridge_rounded < best_rmse:
        best_rmse = rmse_ridge_rounded
        best_r = r
print(f"\nBest r: {best_r} with RMSE: {best_rmse}")

print(f"\n=== COMPARACIÓN ===")
print(f"RMSE with 0: {round(rmse_zero, 2)}")
print(f"RMSE with mean: {round(rmse_mean, 2)}")


r: 0, RMSE: 0.52
r: 0.01, RMSE: 0.52
r: 0.1, RMSE: 0.52
r: 1, RMSE: 0.52
r: 5, RMSE: 0.52
r: 10, RMSE: 0.52
r: 100, RMSE: 0.52

Best r: 0 with RMSE: 0.52

=== COMPARACIÓN ===
RMSE con 0: 0.52
RMSE con media: 0.46


## Question 5
    We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
    Try different seed values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
    For each seed, do the train/validation/test split with 60%/20%/20% distribution.
    Fill the missing values with 0 and train a model without regularization.
    For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
    What's the standard deviation of all the scores? To compute the standard deviation, use np.std.
    Round the result to 3 decimal digits (round(std, 3))

In [None]:
import numpy as np
seeds = list(range(10))
rmse_scores = []
for seed in seeds:
    X_train_seed, X_val_seed, y_train_seed, y_val_seed = train_test_split(X, y, test_size=0.4, random_state=seed)
    X_train_seed = X_train_seed.fillna(0)
    X_val_seed = X_val_seed.fillna(0)
    model_seed = LinearRegression()
    model_seed.fit(X_train_seed, y_train_seed)
    y_pred_seed = model_seed.predict(X_val_seed)
    rmse_seed = root_mean_squared_error(y_val_seed, y_pred_seed)
    rmse_scores.append(rmse_seed)
std_rmse = np.std(rmse_scores, ddof=1)
print(f"\nStandard deviation of RMSE scores: {round(std_rmse, 3)}")




Standard deviation of RMSE scores: 0.005


## Question 6
    Split the dataset like previously, use seed 9.
    Combine train and validation datasets.
    Fill the missing values with 0 and train a model with r=0.001.
    What's the RMSE on the test dataset?

In [None]:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=9)
X_train_final, X_val_final, y_train_final, y_val_final = train_test_split(X_temp, y_temp, test_size=0.25, random_state=9)  # 0.25 x 0.8 = 0.2
X_train_final = X_train_final.fillna(0)
X_val_final = X_val_final.fillna(0)
X_test = X_test.fillna(0)
model_final = Ridge(alpha=0.001)
model_final.fit(X_train_final, y_train_final)
y_pred_test = model_final.predict(X_test)
rmse_test = root_mean_squared_error(y_test, y_pred_test)
print(f"\nRMSE on the test dataset: {round(rmse_test, 2)}")



RMSE on the test dataset: 0.52
