### Regression Homework by sameh shehata 

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Data Preparation

In [None]:
df = pd.read_csv('car_fuel_efficiency.csv')

In [None]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [None]:
df.head()

In [None]:
strings = list(df.dtypes[df.dtypes == 'object'].index)
strings

In [None]:
for col in strings:
    df[col] = df[col].str.lower().str.replace(' ', '_')

In [None]:
df.dtypes

### Question 1. Missing values



In [None]:
df.isnull().sum()

### Question 2. Median for horse power


In [None]:
df['horsepower'].median()

### Question 3. Filling NAs



### Prepare and split the dataset

* Shuffle the dataset (the filtered one you created above), use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.

Use the same code as in the lectures


### Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?

Options:

- With 0
- With mean
- Both are equally good


In [None]:
n = len(df)

n_val = int(n * 0.2)
n_test = int(n * 0.2)
n_train = n - n_val - n_test

In [None]:
n

In [None]:
n_val, n_test, n_train

In [None]:
idx = np.arange(n)

In [None]:
np.random.seed(42)
np.random.shuffle(idx)

In [None]:


train_idx = idx[:n_train]
val_idx = idx[n_train:n_train + n_val]
test_idx = idx[n_train + n_val:]

In [None]:
X = df.drop(['fuel_efficiency_mpg',"origin","fuel_type","drivetrain"], axis=1)
y = df.fuel_efficiency_mpg.values

In [None]:
X_train = X.iloc[train_idx].reset_index(drop=True)
X_val = X.iloc[val_idx].reset_index(drop=True)
X_test = X.iloc[test_idx].reset_index(drop=True)

y_train = y[train_idx]
y_val = y[val_idx]
y_test = y[test_idx]

# Option A: fill missing values with 0
X_train_zero = X_train.fillna(0)
X_val_zero = X_val.fillna(0)

model_zero = LinearRegression()
model_zero.fit(X_train_zero, y_train)
y_pred_zero = model_zero.predict(X_val_zero)
rmse_zero = float(np.sqrt(((y_val - y_pred_zero) ** 2).mean()))
rmse_zero_round = round(rmse_zero, 2)

# Option B: fill missing values with mean (computed from training set)
train_mean = X_train.mean()
X_train_mean = X_train.fillna(train_mean)
X_val_mean = X_val.fillna(train_mean)

model_mean = LinearRegression()
model_mean.fit(X_train_mean, y_train)
y_pred_mean = model_mean.predict(X_val_mean)
rmse_mean = float(np.sqrt(((y_val - y_pred_mean) ** 2).mean()))
rmse_mean_round = round(rmse_mean, 2)

print("RMSE (fill 0):", rmse_zero_round)
print("RMSE (fill mean):", rmse_mean_round)



In [None]:
# Regularized linear regression (Ridge) — Question 4
rs = [0, 0.01, 0.1, 1, 5, 10, 100]

scores = {}
for r in rs:
    model_r = Ridge(alpha=r)
    model_r.fit(X_train_zero, y_train)
    y_pred_r = model_r.predict(X_val_zero)
    rmse_r = float(np.sqrt(((y_val - y_pred_r) ** 2).mean()))
    scores[r] = round(rmse_r, 4)

print("RMSE (rounded) for each r:", scores)

# select best r (if ties, choose smallest r)
best_rmse = min(scores.values())
best_rs = [r for r, s in scores.items() if s == best_rmse]
best_r = min(best_rs)
print("Best r:", best_r)



### Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0. 
* Try different values of `r` from this list: `[0, 0.01, 0.1, 1, 5, 10, 100]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If multiple options give the same best RMSE, select the smallest `r`.

Options:

- 0
- 0.01
- 1
- 10
- 100




### Question 5 

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)

What's the value of std?

- 0.001
- 0.006
- 0.060
- 0.600

> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different. 
> If standard deviation of scores is low, then our model is *stable*.



## Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`. 
* What's the RMSE on the test dataset?

Options:

- 0.15
- 0.515
- 5.15
- 51.5

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw02
* If your answer doesn't match options exactly, select the closest one