# Homework 4

dataset - !wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv


In [53]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression



## Data preparation

* Take a subset of columns
* Keep only the columns above
* Lowercase the column names and replace spaces with underscores
* Fill the missing values with 0
* Make the price binary (1 if above the average, 0 otherwise) - this will be our target variable `above_average`

In [20]:
df = pd.read_csv("data.csv")
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [21]:
columns = ['Make', 'Model', 'Engine HP', 'Engine Cylinders', 'Transmission Type','Vehicle Style','highway MPG','city mpg','MSRP']

df = df[columns]
df.head()

Unnamed: 0,Make,Model,Engine HP,Engine Cylinders,Transmission Type,Vehicle Style,highway MPG,city mpg,MSRP
0,BMW,1 Series M,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,230.0,6.0,MANUAL,Convertible,28,18,34500


In [22]:
df.columns = df.columns.str.lower().str.replace(" ",'_')

In [23]:
df = df.fillna(0)

In [30]:
mean_price = df['msrp'].mean()
df['above_average'] = np.where(df['msrp'] >= mean_price, 1, 0)

del df['msrp']

In [31]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size= .2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size= .25, random_state=1)

df_train = df_train.reset_index(drop =True)
df_val = df_val.reset_index(drop =True)
df_test = df_test.reset_index(drop =True)

y_train = df_train.above_average.values
y_val = df_val.above_average.values
y_test = df_test.above_average.values

del df_train['above_average']
del df_val['above_average']
del df_test['above_average']

# Question 1

ROC AUC could also be used to evaluate feature importance of numerical variables.

Let's do that

* For each numerical variable, use it as score and compute AUC with the `above_average` variable
* Use the training dataset for that

If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. `-df_train['engine_hp']`)

AUC can go below 0.5 if the variable is negatively correlated with the target varialble. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

- `engine_hp`
- `engine_cylinders`
- `highway_mpg`
- `city_mpg`

Answer -> 'engine_hp' has the highest AUC score with 0.92

Higher AUC score means that feature is more important to include in the model

In [49]:
numerical = ['engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']
categorical = list(df.dtypes[df.dtypes == 'object'].index)

print(f"Numerical-> {numerical}")
print(f'Categorical -> {categorical}')

Numerical-> ['engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']
Categorical -> ['make', 'model', 'transmission_type', 'vehicle_style']


In [51]:
for f in numerical:
    auc = roc_auc_score(y_train, df_train[f])
    print(f'{f}: auc score-> {round(auc,2)}')

    

engine_hp: auc score-> 0.92
engine_cylinders: auc score-> 0.77
highway_mpg: auc score-> 0.37
city_mpg: auc score-> 0.33


# Question 2

Apply one-hot-encoding using DictVectorizer and train the logistic regression with these parameters:

LogisticRegression(solver='liblinear', C=1.0, max_iter=1000) What's the AUC of this model on the validation dataset? (round to 3 digits)

0.678

0.779

0.878

0.979

Answer -> 0.979 approx

In [52]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

In [55]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
model.fit(X_train, y_train)

In [60]:
y_pred = model.predict_proba(X_val)[:,1]

In [65]:
roc_auc_score(y_val, y_pred)

0.9813431779873112

# Question 3