# 1. Info

Notebook with all the code needed to solve the homework for the third week of the machine learning zoomcamp.

## Install the required libraries

In [41]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split # SplitData
from sklearn.metrics import mutual_info_score #MI
from sklearn.preprocessing import OneHotEncoder # OneHotEncoding
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

## Getting the data

 
For this homework, we'll use the Car price dataset. Download it from here.

In [2]:
# !wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

We'll keep working with the MSRP variable, and we'll transform it to a classification task.

## Features

For the rest of the homework, you'll need to use only these columns:

* Make
* Model
* Year
* Engine HP
* Engine Cylinders
* Transmission Type
* Vehicle Style
* highway MPG
* city mpg

## Data preparation

Select only the features from above and transform their names using next line:

__data.columns = data.columns.str.replace(' ', '_').str.lower()__

Fill in the missing values of the selected features with 0.

Rename MSRP variable to price.

In [3]:
data = pd.read_csv('./data.csv')

In [4]:
data.columns = data.columns.str.replace(' ','_').str.lower()

In [5]:
df = data[['make','model','year','engine_hp','engine_cylinders','transmission_type','vehicle_style','highway_mpg','city_mpg','msrp']].copy()

In [6]:
df = df.rename(columns={'msrp':'price'})

In [7]:
df.fillna(0, inplace=True)

# Question 1

What is the most frequent observation (mode) for the column transmission_type?

* AUTOMATIC
* MANUAL
* AUTOMATED_MANUAL
* DIRECT_DRIVE

In [8]:
df['transmission_type'].value_counts()

AUTOMATIC           8266
MANUAL              2935
AUTOMATED_MANUAL     626
DIRECT_DRIVE          68
UNKNOWN               19
Name: transmission_type, dtype: int64

The most frequent observation is __Automatic__.

# Question 2

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

* engine_hp and year
* engine_hp and engine_cylinders
* highway_mpg and engine_cylinders
* highway_mpg and city_mpg

Make price binary

* Now we need to turn the price variable from numeric into a binary format.
* Let's create a variable above_average which is 1 if the price is above its mean value and 0 otherwise.

Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
* Make sure that the target value (price) is not in your dataframe.

In [29]:
numerical_columns = ['year','engine_hp','engine_cylinders','highway_mpg','city_mpg']

In [10]:
df[numerical_columns].corr()

Unnamed: 0,year,engine_hp,engine_cylinders,highway_mpg,city_mpg
year,1.0,0.338714,-0.040708,0.25824,0.198171
engine_hp,0.338714,1.0,0.774851,-0.415707,-0.424918
engine_cylinders,-0.040708,0.774851,1.0,-0.614541,-0.587306
highway_mpg,0.25824,-0.415707,-0.614541,1.0,0.886829
city_mpg,0.198171,-0.424918,-0.587306,0.886829,1.0


The biggest correlation matrix is __highway_mpg and city_mpg__.

In [11]:
# Make price binary
df['above_average'] = (df['price'] > np.mean(df['price'])).astype(int)

In [12]:
# split the data
seed = 42

df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=seed)
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=seed)

In [13]:
df_val.shape[0], df_train.shape[0], df_test.shape[0]

(2383, 7148, 2383)

In [14]:
y_train = df_train['above_average']
y_val = df_val['above_average']
y_test = df_test['above_average']

del df_train['above_average']
del df_val['above_average']
del df_test['above_average']
del df_train['price']
del df_val['price']
del df_test['price']

In [15]:
df_val.head(2)

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg
1918,Volkswagen,Beetle,2015,210.0,4.0,MANUAL,2dr Hatchback,31,23
9951,Audi,SQ5,2015,354.0,6.0,AUTOMATIC,4dr SUV,24,17


# Question 3

* Calculate the mutual information score between above_average and other categorical variables in our dataset. Use the training set only.
* Round the scores to 2 decimals using round(score, 2).

Which of these variables has the lowest mutual information score?

* make
* model
* transmission_type
* vehicle_style

In [30]:
categorical_variables = ['make','model','transmission_type','vehicle_style']

def calculate_mi(series):
    return mutual_info_score(series, y_train)

df_mi = df_train[categorical_variables].apply(calculate_mi).round(2)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')

In [17]:
df_mi

Unnamed: 0,MI
model,0.46
make,0.24
vehicle_style,0.08
transmission_type,0.02


The lowest mutual info score is __transmission_type__.

#  Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.

    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)

* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

* 0.60
* 0.72
* 0.84
* 0.95

In [18]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical_variables + numerical_columns].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val[categorical_variables + numerical_columns].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [19]:
# initialize and train the Logistic Regression model
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [20]:
# make predictions
y_pred = model.predict(X_val)

# accuracy
accuracy = accuracy_score(y_val, y_pred)
print(f"accuracy = {accuracy.round(2)}")

accuracy = 0.93


# Question 5

* Let's find the least useful feature using the feature elimination technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

* year
* engine_hp
* transmission_type
* city_mpg

Note: the difference doesn't have to be positive



In [33]:
for feature_col in ['year','engine_hp','transmission_type','city_mpg']:

    # create a copy of the data
    df_train_new = df_train.copy()
    df_val_new = df_val.copy()

    # encoding
    dv = DictVectorizer(sparse=False)

    train_dict = df_train_new[categorical_variables + numerical_columns].drop(columns=feature_col, axis=1).to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)
    val_dict = df_val_new[categorical_variables + numerical_columns].drop(columns=feature_col, axis=1).to_dict(orient='records')
    X_val = dv.transform(val_dict)

    # initialize and train the Logistic Regression model
    model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    # make predictions
    y_pred = model.predict(X_val)

    # accuracy
    new_accuracy = accuracy_score(y_val, y_pred)
    print(f"accuracy = {new_accuracy.round(5)} without {feature_col}, difference with previous = {abs((accuracy-new_accuracy).round(5))}")

accuracy = 0.94713 without year, difference with previous = 0.01259
accuracy = 0.93034 without engine_hp, difference with previous = 0.0042
accuracy = 0.94587 without transmission_type, difference with previous = 0.01133
accuracy = 0.94629 without city_mpg, difference with previous = 0.01175


The smallest difference is 0.0042 and it is get without enigne_hp

# Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column price. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver 'sag'. Set the seed to 42.
* This model also has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10].
* Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

* 0
* 0.01
* 0.1
* 1
* 10

Note: If there are multiple options, select the smallest alpha.

In [44]:
# get the data
df_6 = df[['make','model','year','engine_hp','engine_cylinders','transmission_type','vehicle_style','highway_mpg','city_mpg','price']].copy()

In [45]:
df_full_train, df_test = train_test_split(df_6, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

df_full_train = df_full_train.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [46]:
y_train = np.log1p(df_train['price'])
y_val = np.log1p(df_val['price'])
y_test = np.log1p(df_test['price'])

del df_train['price']
del df_val['price']
del df_test['price']

In [47]:
# Vectorize
dv = DictVectorizer(sparse=False)

train_dicts = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val.to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [48]:
# only this time because is not a big warning
import warnings
warnings.filterwarnings("ignore")

In [49]:
train_dicts = df_train.to_dict(orient='records')
val_dicts = df_val.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)
X_val = dv.transform(val_dicts)
alphas = [0, 0.01, 0.1, 1, 10]

for a in alphas:
    model = Ridge(solver='sag', alpha=a, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    print(f"alpha = {a}, mean squared error  {round(mean_squared_error(y_val, y_pred, squared=False), 3)}")

alpha = 0, mean squared error  0.487
alpha = 0.01, mean squared error  0.487
alpha = 0.1, mean squared error  0.487
alpha = 1, mean squared error  0.487
alpha = 10, mean squared error  0.487


All of them return the same RMSE, so the lowest alpha will be selected.