# ML Zoomcamp 2023 - Homework #3

Name: Wong Chee Fah

Email: wongcheefah@gmail.com

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import accuracy_score, mean_squared_error

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv')
len(df)

11914

In [3]:
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


### Selected features

For this homework, we will use only these columns:

* `Make`,
* `Model`,
* `Year`,
* `Engine HP`,
* `Engine Cylinders`,
* `Transmission Type`,
* `Vehicle Style`,
* `highway MPG`,
* `city mpg`,
* `MSRP`

In [4]:
df.drop(['Engine Fuel Type', 'Driven_Wheels', 'Number of Doors', 'Market Category', 'Vehicle Size', 'Popularity'], axis=1, inplace=True)

In [5]:
df.columns = df.columns.str.replace(' ', '_').str.lower()

In [6]:
df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,msrp
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500


### Preparing the dataset

* Fill in the missing values of the selected features with 0.
* Rename `MSRP` variable to `price`.

In [7]:
df.isnull().sum()

make                  0
model                 0
year                  0
engine_hp            69
engine_cylinders     30
transmission_type     0
vehicle_style         0
highway_mpg           0
city_mpg              0
msrp                  0
dtype: int64

In [8]:
df.fillna(0, inplace=True)

In [9]:
df.isnull().sum()

make                 0
model                0
year                 0
engine_hp            0
engine_cylinders     0
transmission_type    0
vehicle_style        0
highway_mpg          0
city_mpg             0
msrp                 0
dtype: int64

In [10]:
df.rename(columns={'msrp': 'price'}, inplace=True)

In [11]:
df.nunique()

make                   48
model                 915
year                   28
engine_hp             357
engine_cylinders        9
transmission_type       5
vehicle_style          16
highway_mpg            59
city_mpg               69
price                6049
dtype: int64

### Exploratory data analysis

### Question 1

##### What is the most frequent observation (mode) for the column `transmission_type`?
- `AUTOMATIC`
- `MANUAL`
- `AUTOMATED_MANUAL`
- `DIRECT_DRIVE`

In [12]:
df['transmission_type'].value_counts(ascending=False)

transmission_type
AUTOMATIC           8266
MANUAL              2935
AUTOMATED_MANUAL     626
DIRECT_DRIVE          68
UNKNOWN               19
Name: count, dtype: int64

##### Ans: `AUTOMATIC`

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?
- `engine_hp` and `year`
- `engine_hp` and `engine_cylinders`
- `highway_mpg` and `engine_cylinders`
- `highway_mpg` and `city_mpg`

In [13]:
numerical = ['engine_hp', 'highway_mpg', 'city_mpg']

df[numerical].corr()

Unnamed: 0,engine_hp,highway_mpg,city_mpg
engine_hp,1.0,-0.415707,-0.424918
highway_mpg,-0.415707,1.0,0.886829
city_mpg,-0.424918,0.886829,1.0


##### Ans: `highway_mpg` and `city_mpg`

### Make `price` binary

* Now we need to turn the `price` variable from numeric into a binary format.
* Let's create a variable `above_average` which is `1` if the `price` is above its mean value and `0` otherwise.

In [14]:
df['above_average'] = (df['price'] > df['price'].mean()).astype(int)

### Split the data

* Split the data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value (`above_average`) is not in your dataframe.

In [15]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [16]:
y_train = df_train['above_average']
y_val = df_val['above_average']
y_test = df_test['above_average']

price_train = df_train['price']
price_val = df_val['price']
price_test = df_test['price']

df_train.drop(['price', 'above_average'], axis=1, inplace=True)
df_val.drop(['price', 'above_average'], axis=1, inplace=True)
df_test.drop(['price', 'above_average'], axis=1, inplace=True)

In [17]:
df_train.shape, df_val.shape, df_test.shape

((7148, 9), (2383, 9), (2383, 9))

### Question 3

* Calculate the mutual information score between `above_average` and other categorical variables in our dataset. 
  Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which variable has the lowest mutual information score?
- `make`
- `model`
- `transmission_type`
- `vehicle_style`

In [18]:
def mutual_info_abv_avg_score(col):
    return mutual_info_score(col, y_train)

In [19]:
all_features = df_train.columns
categorical = list(filter(lambda i: i not in numerical, all_features))

mi = df_train[categorical].apply(mutual_info_abv_avg_score)
print('Mutual information between above_average and')
mi.sort_values()

Mutual information between above_average and


transmission_type    0.020958
year                 0.071544
vehicle_style        0.084143
engine_cylinders     0.115903
make                 0.239769
model                0.462344
dtype: float64

##### Ans: `transmission_type`

### One-hot encoding

In [20]:
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

test_dict = df_test.to_dict(orient='records')
X_test = dv.transform(test_dict)

### Logistic Regression

### Question 4

* Now let's train a logistic regression.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What is the accuracy?
- 0.60
- 0.72
- 0.84
- 0.95

In [21]:
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [22]:
model.get_params()

{'C': 10,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 1000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 42,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [23]:
y_pred = model.predict_proba(X_val)[:, 1]
pred_abv_avg = (y_pred >= 0.5)
accuracy = (y_val == pred_abv_avg).mean()
print(round(accuracy, 2))

0.95


In [24]:
# Using sklearn.metrics
print(round(accuracy_score(y_val, model.predict(X_val)), 2))

0.95


### Ans: 0.95

### Feature Elimination

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which feature has the smallest difference?
- `year`
- `engine_hp`
- `transmission_type`
- `city_mpg`

In [25]:
print(f'Accuracy with all features: {accuracy}\n')
min_diff = 1
least_useful_feature = None

for feature in all_features:
    features = list(filter(lambda i: i != feature, all_features))
    
    train_dict = df_train[features].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    val_dict = df_val[features].to_dict(orient='records')
    X_val = dv.transform(val_dict)

    test_dict = df_test[features].to_dict(orient='records')
    X_test = dv.transform(test_dict)

    model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    y_pred = model.predict_proba(X_val)[:, 1]
    pred_abv_avg = (y_pred >= 0.5)
    reduced_feat_set_acc = (y_val == pred_abv_avg).mean()
    difference = abs(accuracy - reduced_feat_set_acc)
    if difference < min_diff:
        min_diff = difference
        least_useful_feature = feature
    
    print(f'Accuracy without {feature}: {reduced_feat_set_acc}')
    print(f'Difference: {difference}\n')

print(f'Least useful feature is {least_useful_feature}')

Accuracy with all features: 0.9454469156525388

Accuracy without make: 0.9467058329836341
Difference: 0.0012589173310952884

Accuracy without model: 0.9194292908099034
Difference: 0.026017624842635367

Accuracy without year: 0.9471254720939991
Difference: 0.0016785564414603105

Accuracy without engine_hp: 0.9227864036928242
Difference: 0.022660511959714635



Accuracy without engine_cylinders: 0.9454469156525388
Difference: 0.0

Accuracy without transmission_type: 0.9404112463281578
Difference: 0.005035669324381042

Accuracy without vehicle_style: 0.9320184641208561
Difference: 0.013428451531682706

Accuracy without highway_mpg: 0.9467058329836341
Difference: 0.0012589173310952884

Accuracy without city_mpg: 0.9458665547629039
Difference: 0.00041963911036513313

Least useful feature is engine_cylinders


### Ans: Amongst the given choices, the least useful feature is city_mpg.

### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver `'sag'`. Set the seed to `42`.
* This model also has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`.
* Round the RMSE scores to 3 decimal digits.

Which alpha value leads to the best RMSE on the validation set?

- 0
- 0.01
- 0.1
- 1
- 10

In [26]:
y_train = np.log1p(price_train)
y_val = np.log1p(price_val)
y_test = np.log1p(price_test)

In [27]:
train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

test_dict = df_test.to_dict(orient='records')
X_test = dv.transform(test_dict)

In [28]:
min_RMSE = max(y_train)
min_RMSE_alpha = 0
for a in [0, 0.01, 0.1, 1, 10]:
    clf = Ridge(solver='sag', alpha=a, max_iter=10000, random_state=42)
    clf.fit(X_train, y_train)
    
    rmse = mean_squared_error(y_val, clf.predict(X_val), squared=False)
    
    if rmse < min_RMSE:
        min_RMSE = rmse
        min_RMSE_alpha = a
    
    print(f'Alpha: {a:<6}RMSE: {round(rmse, 3)} ({rmse})')

print(f'Best RMSE ({round(min_RMSE, 3)}) occurs when alpha equals {min_RMSE_alpha}')

Alpha: 0     RMSE: 0.421 (0.42082120603369644)
Alpha: 0.01  RMSE: 0.421 (0.42082495786127977)
Alpha: 0.1   RMSE: 0.421 (0.42088347607574905)
Alpha: 1     RMSE: 0.421 (0.4214776197656962)
Alpha: 10    RMSE: 0.427 (0.42733214479855136)
Best RMSE (0.421) occurs when alpha equals 0


### Ans: 0