0. Preparation
Before any analyses we will prepare the dataset for the subsequent modelling.

1. Load the Auto dataset into R or Python.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt

Auto=pd.read_csv("Auto.csv")
Auto

2. Drop all variables except the (potential) predictors ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year'] and the target variable 'mpg'.

In [None]:
# keep the predictors and the outcome varibales 
predictors = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
target = 'mpg'
auto = Auto[predictors + [target]]
auto

3. Split the dataset into a training set (80%) and a validation set (20%). It is probably a good idea to set a random seed and shuffle the dataset prior to this.

In [None]:
#split the dataset to training and validation sets(80%,20%)
train_data, val_data = train_test_split(auto, test_size=0.2, shuffle=True, random_state=100)

4. Replace missing values (coded as '?') in the both datasets with the mean of the given variable in the training set.

In [None]:

# check ? values
# print((train_data == '?').sum())
# print((val_data == '?').sum())
# print(train_data.isna().sum())
# print(val_data.isna().sum())

# set the data to numerics and change ? to na
num_train = train_data
for col in predictors + [target]:
    num_train[col] = pd.to_numeric(train_data[col], errors='coerce')

# replace the ? with mean values
for col in predictors:
    mean_val = num_train[col].mean()
    train_data[col] = train_data[col].replace('?',mean_val)
    val_data[col] = val_data[col].replace('?',mean_val)


cylinders: 5.43 (1.71)
displacement: 191.34 (104.02)
horsepower: 104.22 (38.47)
weight: 2949.35 (847.54)
acceleration: 15.45 (2.72)
year: 76.00 (3.71)
cylinders after: 0.00 (1.00)
displacement after: -0.00 (1.00)
horsepower after: 0.00 (1.00)
weight after: 0.00 (1.00)
acceleration after: 0.00 (1.00)
year after: 0.00 (1.00)


5. Standardize the predictors in the training set using z-score standardization.

In [None]:
# Standardize the predictors in the training set using z-score standardization.
for col in predictors:
    print(f'{col}: {np.mean(train_data[col]):.2f} ({np.std(train_data[col]):.2f})')

# z-score standardization
for col in predictors:
    train_data[col] = (train_data[col] - np.mean(train_data[col])) / np.std(train_data[col])

for col in predictors:
    print(f'{col} after: {np.mean(train_data[col]):.2f} ({np.std(train_data[col]):.2f})')

6. Standardize the predictors in validation set based on the means and standard deviations from the training set.

In [65]:
for col in predictors:
    print(f'{col}: {np.mean(train_data[col]):.2f} ({np.std(train_data[col]):.2f})')

# z-score standardization
for col in predictors:
    val_data[col] = (train_data[col] - np.mean(train_data[col])) / np.std(train_data[col])

for col in predictors:
    print(f'{col} after: {np.mean(train_data[col]):.2f} ({np.std(train_data[col]):.2f})')

cylinders: 0.00 (1.00)
displacement: -0.00 (1.00)
horsepower: 0.00 (1.00)
weight: 0.00 (1.00)
acceleration: 0.00 (1.00)
year: 0.00 (1.00)
cylinders after: 0.00 (1.00)
displacement after: -0.00 (1.00)
horsepower after: 0.00 (1.00)
weight after: 0.00 (1.00)
acceleration after: 0.00 (1.00)
year after: 0.00 (1.00)


7. Reflection: Discuss briefly why it is a good idea (or even necessary?) to standardize the variables before fitting the LASSO models in assignment 2. Why do we mean-fill and standardize the validation set based on information from the training set?

It is a good idea and necessary to standardize the variables because this gives us the same scale/unit of the coefficients. Shrinkage penalizes  the coefficients directly regardless of the scales. If we do not standardize it, variables on larger scales will be penalized less than those on smaller scales. We use the information from the training set because we want to keep the model agnostic to the validation set during training.