# Classification

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

## Features

For the rest of the homework, you'll need to use only these columns:

Make,
Model,
Year,
Engine HP,
Engine Cylinders,
Transmission Type,
Vehicle Style,
highway MPG,
city mpg

In [2]:
df = pd.read_csv("car-price_data.csv")
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [3]:
col_selected = ["Make", "Model", "Year", "Engine HP", "Engine Cylinders", 
                "Transmission Type", "Vehicle Style", "highway MPG", "city mpg", "MSRP"]
df_sub = df[col_selected]
df_sub.head()

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle Style,highway MPG,city mpg,MSRP
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500


## Data preparation 

* Select only the features from above and transform their names using next line: 
  data.columns = data.columns.str.replace(' ', '_').str.lower()
* Fill in the missing values of the selected features with 0.
* Rename MSRP variable to price.

In [4]:
strings_col = df_sub.columns[df_sub.dtypes == "object"]
strings_col

Index(['Make', 'Model', 'Transmission Type', 'Vehicle Style'], dtype='object')

In [5]:
for col in strings_col:
    df_sub[col] = df[col].str.replace(' ', '_').str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub[col] = df[col].str.replace(' ', '_').str.lower()


In [6]:
df_sub.head()

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle Style,highway MPG,city mpg,MSRP
0,bmw,1_series_m,2011,335.0,6.0,manual,coupe,26,19,46135
1,bmw,1_series,2011,300.0,6.0,manual,convertible,28,19,40650
2,bmw,1_series,2011,300.0,6.0,manual,coupe,28,20,36350
3,bmw,1_series,2011,230.0,6.0,manual,coupe,28,18,29450
4,bmw,1_series,2011,230.0,6.0,manual,convertible,28,18,34500


In [7]:
df_sub.rename(columns={"MSRP":"price"}, inplace=True)
df_sub.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub.rename(columns={"MSRP":"price"}, inplace=True)


Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle Style,highway MPG,city mpg,price
0,bmw,1_series_m,2011,335.0,6.0,manual,coupe,26,19,46135
1,bmw,1_series,2011,300.0,6.0,manual,convertible,28,19,40650
2,bmw,1_series,2011,300.0,6.0,manual,coupe,28,20,36350
3,bmw,1_series,2011,230.0,6.0,manual,coupe,28,18,29450
4,bmw,1_series,2011,230.0,6.0,manual,convertible,28,18,34500


## Question 1

In [8]:
df_sub.columns = df_sub.columns.str.replace(' ', '_').str.lower()
df_sub.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price
0,bmw,1_series_m,2011,335.0,6.0,manual,coupe,26,19,46135
1,bmw,1_series,2011,300.0,6.0,manual,convertible,28,19,40650
2,bmw,1_series,2011,300.0,6.0,manual,coupe,28,20,36350
3,bmw,1_series,2011,230.0,6.0,manual,coupe,28,18,29450
4,bmw,1_series,2011,230.0,6.0,manual,convertible,28,18,34500


In [9]:
df_sub["transmission_type"].describe()

count         11914
unique            5
top       automatic
freq           8266
Name: transmission_type, dtype: object

The most frequent observation (mode) for the column **transmission_type** is "automatic"

**Missing values**

In [10]:
df_sub.isnull().sum()

make                  0
model                 0
year                  0
engine_hp            69
engine_cylinders     30
transmission_type     0
vehicle_style         0
highway_mpg           0
city_mpg              0
price                 0
dtype: int64

In [11]:
# impute missing values
missing_col = ["engine_hp","engine_cylinders"]
for col in missing_col:
    mean_col = df_sub[col].mean()
    df_sub[col].fillna(mean_col, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub[col].fillna(mean_col, inplace=True)


In [12]:
# impute missing values
#df_sub.fillna(0, inplace=True)

## Question 2

In [13]:
df_sub.corr()

Unnamed: 0,year,engine_hp,engine_cylinders,highway_mpg,city_mpg,price
year,1.0,0.351288,-0.041446,0.25824,0.198171,0.22759
engine_hp,0.351288,1.0,0.764986,-0.353343,-0.346308,0.661644
engine_cylinders,-0.041446,0.764986,1.0,-0.602294,-0.56698,0.531272
highway_mpg,0.25824,-0.353343,-0.602294,1.0,0.886829,-0.160043
city_mpg,0.198171,-0.346308,-0.56698,0.886829,1.0,-0.157676
price,0.22759,0.661644,0.531272,-0.160043,-0.157676,1.0


The two features that have the biggest correlation in this dataset iq **highway_mpg** and **city_mpg**

## Make price binary

* Now we need to turn the price variable from numeric into a binary format.
* Let's create a variable above_average which is 1 if the price is above its mean value and 0 otherwise.

In [14]:
mean_price = df_sub["price"].mean()
df_sub["above_average"] = (df_sub.price > mean_price).astype(int)
#df_sub.drop(columns="price", inplace=True)
df_sub.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["above_average"] = (df_sub.price > mean_price).astype(int)


Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price,above_average
0,bmw,1_series_m,2011,335.0,6.0,manual,coupe,26,19,46135,1
1,bmw,1_series,2011,300.0,6.0,manual,convertible,28,19,40650,1
2,bmw,1_series,2011,300.0,6.0,manual,coupe,28,20,36350,0
3,bmw,1_series,2011,230.0,6.0,manual,coupe,28,18,29450,0
4,bmw,1_series,2011,230.0,6.0,manual,convertible,28,18,34500,0


## Split the data 

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
* Make sure that the target value (above_average) is not in your dataframe.

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
data = df_sub.copy()
data.drop(columns="price", inplace=True)

In [17]:
df_full_train, df_test = train_test_split(data, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [18]:
len(df_train), len(df_val), len(df_test)

(7148, 2383, 2383)

In [19]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.above_average.values
y_val = df_val.above_average.values
y_test = df_test.above_average.values

del df_train['above_average']
del df_val['above_average']
del df_test['above_average']

## Question 3

* Calculate the mutual information score between above_average and other categorical variables in our dataset. Use the training set only.
* Round the scores to 2 decimals using round(score, 2).

In [20]:
from sklearn.metrics import mutual_info_score

In [21]:
df_full_train.isnull().sum()

make                 0
model                0
year                 0
engine_hp            0
engine_cylinders     0
transmission_type    0
vehicle_style        0
highway_mpg          0
city_mpg             0
above_average        0
dtype: int64

In [22]:
data.dtypes

make                  object
model                 object
year                   int64
engine_hp            float64
engine_cylinders     float64
transmission_type     object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
above_average          int32
dtype: object

In [23]:
categorical_var = list(df_train.dtypes[df_train.dtypes == "object"].index)
categorical_var

['make', 'model', 'transmission_type', 'vehicle_style']

In [24]:
numerical_var = list(df_train.dtypes[df_train.dtypes != "object"].index)
numerical_var

['year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']

In [25]:
for var in categorical_var :
    mut_score = mutual_info_score(df_full_train.above_average, df_full_train[var])
    print(f"Mutual information score between above_average and {var} is {round(mut_score,2)}")

Mutual information score between above_average and make is 0.24
Mutual information score between above_average and model is 0.46
Mutual information score between above_average and transmission_type is 0.02
Mutual information score between above_average and vehicle_style is 0.08


The variable which has the lowest mutual information score is **transmission_type**

## Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
  * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
  * model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

### (a) Use One-hot encoding

In [26]:
from sklearn.feature_extraction import DictVectorizer

In [27]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical_var + numerical_var].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val[categorical_var + numerical_var].to_dict(orient='records')
X_val = dv.transform(val_dict)

### (b) Logistic Regression

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [29]:
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

LogisticRegression(C=10, max_iter=1000, random_state=42, solver='liblinear')

In [30]:
y_pred = model.predict_proba(X_val)[:, 1]
y_pred

array([7.03421675e-01, 4.33307937e-04, 2.63429099e-01, ...,
       1.23003688e-03, 9.99999683e-01, 2.61551082e-02])

In [31]:
price_decision = (y_pred >= 0.5)
price_decision

array([ True, False, False, ..., False,  True, False])

In [32]:
score_init = accuracy_score(y_val, price_decision)
round(score_init,2)

0.94

## Question 5

* Let's find the least useful feature using the feature elimination technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

In [33]:
def col_test(l_col):
    
    dv = DictVectorizer(sparse=False)
    
    train_dict = df_train[l_col].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    val_dict = df_val[l_col].to_dict(orient='records')
    X_val = dv.transform(val_dict)
    
    model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    y_pred = model.predict_proba(X_val)[:, 1]

    
    price_decision = (y_pred >= 0.5)
    
    score_var = accuracy_score(y_val, price_decision)
    return score_var
    
    
    

In [34]:
l_col = categorical_var + numerical_var
for col in l_col:
    new_col = l_col.copy()
    new_col.remove(col)
    
    score_var = col_test(new_col)
    
    diff = score_init - score_var
    
    print(col)
    print(f'accurracy score without {col} is {score_var}')
    print(f'Difference between the original accuracy and the accuracy without the feature {col} is {diff}')
    print()

make
accurracy score without make is 0.9437683592110785
Difference between the original accuracy and the accuracy without the feature make is -0.0054553084347461755

model
accurracy score without model is 0.9122954259336971
Difference between the original accuracy and the accuracy without the feature model is 0.026017624842635256

transmission_type
accurracy score without transmission_type is 0.9349559378934117
Difference between the original accuracy and the accuracy without the feature transmission_type is 0.003357112882920621

vehicle_style
accurracy score without vehicle_style is 0.936634494334872
Difference between the original accuracy and the accuracy without the feature vehicle_style is 0.0016785564414603105

year
accurracy score without year is 0.9496433067561897
Difference between the original accuracy and the accuracy without the feature year is -0.011330255979857373

engine_hp
accurracy score without engine_hp is 0.931598825010491
Difference between the original accuracy an

The feature which has the smallest difference is **transmission_type**

## Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column price. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver 'sag'. Set the seed to 42.
* This model also has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10].
* Round your RMSE scores to 3 decimal digits.

In [35]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

In [36]:
df_f = df_sub.copy()
df_f.drop(columns="above_average", inplace=True)

In [37]:
df_full_train, df_test = train_test_split(df_f, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [38]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.price.values
y_val = df_val.price.values
y_test = df_test.price.values

del df_train['price']
del df_val['price']
del df_test['price']

In [39]:
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)
y_test_log = np.log1p(y_test)

In [40]:
for i in [0, 0.01, 0.1, 1, 10]:
    model = Ridge(alpha=i, solver="sag", random_state=42)
    model.fit(X_train, y_train_log)
    
    y_pred = model.predict(X_val)
    
    rmse_score = np.sqrt(mean_squared_error(y_val_log, y_pred))
    
    print(f"For alpha {i} the RMSE score is {round(rmse_score, 3)}")

For alpha 0 the RMSE score is 0.485
For alpha 0.01 the RMSE score is 0.485
For alpha 0.1 the RMSE score is 0.485
For alpha 1 the RMSE score is 0.486
For alpha 10 the RMSE score is 0.486


The alpha which lead to the best RMSE on the validation set is 0