## Week 3 Homework Submission

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [167]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression, LinearRegression

### Dataset

In this homework, we will use the Car price dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```

We'll keep working with the `MSRP` variable, and we'll transform it to a classification task. 

### Features

For the rest of the homework, you'll need to use only these columns:

* `Make`,
* `Model`,
* `Year`,
* `Engine HP`,
* `Engine Cylinders`,
* `Transmission Type`,
* `Vehicle Style`,
* `highway MPG`,
* `city mpg`

### Data preparation

* Select only the features from above and transform their names using next line:
  ```
  data.columns = data.columns.str.replace(' ', '_').str.lower()
  ```
* Fill in the missing values of the selected features with 0.
* Rename `MSRP` variable to `price`.

In [3]:
!wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

--2023-10-02 05:29:54--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1475504 (1.4M) [text/plain]
Saving to: ‘data.csv’


2023-10-02 05:29:54 (2.95 MB/s) - ‘data.csv’ saved [1475504/1475504]



In [169]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


## EDA and Data Prep
- Filter the features to only the relevant set
- Select only the features from above and transform their names using next line: data.columns = data.columns.str.replace(' ', '_').str.lower()
- Fill in the missing values of the selected features with 0.
- Rename MSRP variable to price

In [170]:
features = ['Make', 'Model', 'Year', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Vehicle Style',
       'highway MPG', 'city mpg', 'MSRP']
df = df[features]
df.columns = df.columns.str.replace('MSRP','price').str.replace(' ','_').str.lower()
df = df.fillna(0)
df.head(), df.isnull().sum()

(  make       model  year  engine_hp  engine_cylinders transmission_type  \
 0  BMW  1 Series M  2011      335.0               6.0            MANUAL   
 1  BMW    1 Series  2011      300.0               6.0            MANUAL   
 2  BMW    1 Series  2011      300.0               6.0            MANUAL   
 3  BMW    1 Series  2011      230.0               6.0            MANUAL   
 4  BMW    1 Series  2011      230.0               6.0            MANUAL   
 
   vehicle_style  highway_mpg  city_mpg  price  
 0         Coupe           26        19  46135  
 1   Convertible           28        19  40650  
 2         Coupe           28        20  36350  
 3         Coupe           28        18  29450  
 4   Convertible           28        18  34500  ,
 make                 0
 model                0
 year                 0
 engine_hp            0
 engine_cylinders     0
 transmission_type    0
 vehicle_style        0
 highway_mpg          0
 city_mpg             0
 price                0
 dtype:

### Question 1

What is the most frequent observation (mode) for the column `transmission_type`?

- `AUTOMATIC` <--
- `MANUAL`
- `AUTOMATED_MANUAL`
- `DIRECT_DRIVE`

In [72]:
df.groupby('transmission_type').transmission_type.count()

transmission_type
AUTOMATED_MANUAL     626
AUTOMATIC           8266
DIRECT_DRIVE          68
MANUAL              2935
UNKNOWN               19
Name: transmission_type, dtype: int64

In [73]:
print("the most frequent observation for the column `transmission_type` is `AUTOMATIC`")

the most frequent observation for the column `transmission_type` is `AUTOMATIC`


### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

- `engine_hp` and `year`
- `engine_hp` and `engine_cylinders`
- `highway_mpg` and `engine_cylinders`
- `highway_mpg` and `city_mpg` <--

In [171]:
categorical = list(df.dtypes[df.dtypes == 'object'].index)
numerical = list(df.dtypes[df.dtypes != 'object'].index)

In [172]:
# make all categorical variables lower case and replace spaces with '_'
for c in categorical:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.head().T

Unnamed: 0,0,1,2,3,4
make,bmw,bmw,bmw,bmw,bmw
model,1_series_m,1_series,1_series,1_series,1_series
year,2011,2011,2011,2011,2011
engine_hp,335.0,300.0,300.0,230.0,230.0
engine_cylinders,6.0,6.0,6.0,6.0,6.0
transmission_type,manual,manual,manual,manual,manual
vehicle_style,coupe,convertible,coupe,coupe,convertible
highway_mpg,26,28,28,28,28
city_mpg,19,19,20,18,18
price,46135,40650,36350,29450,34500


In [173]:
# create the correlation matrix between all numerical variables
for c in list(df[numerical]):
    print(f"the correlation matrix for {c} is: ")
    print(df[numerical].corrwith(df[c]).abs())

the correlation matrix for year is: 
year                1.000000
engine_hp           0.338714
engine_cylinders    0.040708
highway_mpg         0.258240
city_mpg            0.198171
price               0.227590
dtype: float64
the correlation matrix for engine_hp is: 
year                0.338714
engine_hp           1.000000
engine_cylinders    0.774851
highway_mpg         0.415707
city_mpg            0.424918
price               0.650095
dtype: float64
the correlation matrix for engine_cylinders is: 
year                0.040708
engine_hp           0.774851
engine_cylinders    1.000000
highway_mpg         0.614541
city_mpg            0.587306
price               0.526274
dtype: float64
the correlation matrix for highway_mpg is: 
year                0.258240
engine_hp           0.415707
engine_cylinders    0.614541
highway_mpg         1.000000
city_mpg            0.886829
price               0.160043
dtype: float64
the correlation matrix for city_mpg is: 
year                0.198171
en

In [94]:
print("the two features that have the biggest correlation are highway_mpg and city_mpg")

the two features that have the biggest correlation are highway_mpg and city_mpg


### Make `price` binary

* Now we need to turn the `price` variable from numeric into a binary format.
* Let's create a variable `above_average` which is `1` if the `price` is above its mean value and `0` otherwise.

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value (`above_average`) is not in your dataframe.

### Question 3

* Calculate the mutual information score between `above_average` and other categorical variables in our dataset. 
  Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the lowest mutual information score?
  
- `make`
- `model`
- `transmission_type`
- `vehicle_style`


In [175]:
# del df['above_average']
df['above_average'] = (df.price > df.price.mean()).astype(int)
df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price,above_average
0,bmw,1_series_m,2011,335.0,6.0,manual,coupe,26,19,46135,1
1,bmw,1_series,2011,300.0,6.0,manual,convertible,28,19,40650,1
2,bmw,1_series,2011,300.0,6.0,manual,coupe,28,20,36350,0
3,bmw,1_series,2011,230.0,6.0,manual,coupe,28,18,29450,0
4,bmw,1_series,2011,230.0,6.0,manual,convertible,28,18,34500,0


In [176]:
df['above_average'].value_counts(normalize=False)

above_average
0    8645
1    3269
Name: count, dtype: int64

In [178]:
# split the data
seed = 42

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=seed)
df_train, df_val = train_test_split(df_full_train, test_size=0.2, random_state=seed)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.above_average.values
y_val = df_val.above_average.values
y_test = df_test.above_average.values

del df_train['price']
del df_train['above_average']

del df_val['price']
del df_val['above_average']

del df_test['price']
del df_test['above_average']

len(df_train), len(df_val), len(df_test)

(7624, 1907, 2383)

In [186]:
# calculate mutual information score between above_average and categorical variables
# use the training set only

def mutual_info_churn_score(series):
    return mutual_info_score(series, df_full_train.above_average)

mi = df_full_train[categorical].apply(mutual_info_churn_score)
mi.sort_values(ascending=False)

model                0.460994
make                 0.238724
vehicle_style        0.083390
transmission_type    0.020884
dtype: float64

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.60
- 0.72
- 0.84
- 0.95


In [191]:
dv = DictVectorizer(sparse=False)

features = categorical+numerical
features.remove('price')

# one hot encoding of training set
train_dict = df_train[features].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

# one hot encoding of validation set
val_dict = df_val[features].to_dict(orient='records')
X_val = dv.transform(val_dict)

# create a LogisticRegression model object
# solver='lbfgs' is the default solver in newer version of sklearn
# for older versions, you need to specify it explicitly
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=seed)

# call the model on training data
model.fit(X_train, y_train)

In [192]:
model.intercept_[0]
model.coef_[0].round(3)

# create predictions on the validation set
y_pred = model.predict_proba(X_val)[:, 1]
above_average_pred = (y_pred >= 0.5)
(y_val == above_average_pred).mean()

# create dataframe to store predicted values on validation set, binary values of those predicted values and the actuals from the validation set
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = above_average_pred.astype(int)
df_pred['actual'] = y_val

# compare frequency of correct predictions
df_pred['correct'] = df_pred.prediction == df_pred.actual
df_pred.correct.mean()

0.9375983219716832

In [194]:
df_pred

Unnamed: 0,probability,prediction,actual,correct
0,0.003635,0,0,True
1,0.992688,1,1,True
2,0.001145,0,0,True
3,0.263359,0,0,True
4,0.003149,0,0,True
...,...,...,...,...
1902,0.106142,0,0,True
1903,0.133929,0,0,True
1904,0.010017,0,0,True
1905,0.000059,0,0,True


### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `year`
- `engine_hp`
- `transmission_type`
- `city_mpg`

> **Note**: the difference doesn't have to be positive

In [201]:
# dv = DictVectorizer(sparse=False)

# make sure price and above_average aren'tin your fields

all_features = (categorical + numerical)

if ('price' in all_features): 
        all_features.remove('price')

if ('above_average' in all_features):
    all_features.remove('above_average')

print(all_features)

results = {}
    
for eliminated_feature in all_features:

    print(f"all features are: {all_features}")
    print(f"eliminating feature: {eliminated_feature}")
    features = all_features.copy()
    features.remove(eliminated_feature)
    print(f"features left are: {features}")
        
    train_dict = df_train[features].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)
    
    # one hot encoding of validation set
    val_dict = df_val[features].to_dict(orient='records')
    X_val = dv.transform(val_dict)
    
    # create a LogisticRegression model object
    # solver='lbfgs' is the default solver in newer version of sklearn
    # for older versions, you need to specify it explicitly
    model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=seed)
    
    # call the model on training data
    model.fit(X_train, y_train)

    y_pred = model.predict_proba(X_val)[:, 1]
    above_average_pred = (y_pred >= 0.5)
    (y_val == above_average_pred).mean()
    
    # create dataframe to store predicted values on validation set, binary values of those predicted values and the actuals from the validation set
    df_pred = pd.DataFrame()
    df_pred['probability'] = y_pred
    df_pred['prediction'] = above_average_pred.astype(int)
    df_pred['actual'] = y_val
    
    # compare frequency of correct predictions
    df_pred['correct'] = df_pred.prediction == df_pred.actual
    accuracy = df_pred.correct.mean()
    print(f"The accuracy for the model without {eliminated_feature} is {accuracy}")

    results[eliminated_feature] = accuracy

print(results)
    # create a loop that goes through each feature and train a model without that feature

['make', 'model', 'transmission_type', 'vehicle_style', 'year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']
all features are: ['make', 'model', 'transmission_type', 'vehicle_style', 'year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']
eliminating feature: make
features left are: ['model', 'transmission_type', 'vehicle_style', 'year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']
The accuracy for the model without make is 0.9276350288411117
all features are: ['make', 'model', 'transmission_type', 'vehicle_style', 'year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']
eliminating feature: model
features left are: ['make', 'transmission_type', 'vehicle_style', 'year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']
The accuracy for the model without model is 0.9202936549554274
all features are: ['make', 'model', 'transmission_type', 'vehicle_style', 'year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']
eli

In [202]:
sorted_results = sorted(results.items(), key=lambda x: x[1])

# Print the sorted results
for feature, accuracy in sorted_results:
    print(f"{feature}: {accuracy}")

model: 0.9202936549554274
make: 0.9276350288411117
vehicle_style: 0.9292081803880441
engine_hp: 0.9292081803880441
engine_cylinders: 0.9402202412165706
year: 0.9449396958573676
highway_mpg: 0.9449396958573676
transmission_type: 0.9459884635553225
city_mpg: 0.9465128474043


### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver `'sag'`. Set the seed to `42`.
* This model also has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`.
* Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

- 0
- 0.01
- 0.1
- 1
- 10

> **Note**: If there are multiple options, select the smallest `alpha`.

In [205]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# split the data
seed = 42

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=seed)
df_train, df_val = train_test_split(df_full_train, test_size=0.2, random_state=seed)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

# 1. Apply logarithmic transformation to the 'price' column
df_train['log_price'] = np.log1p(df_train['price'])
df_val['log_price'] = np.log1p(df_val['price'])

y_train = df_train.above_average.values
y_val = df_val.above_average.values
y_test = df_test.above_average.values

del df_train['price']
del df_train['above_average']

del df_val['price']
del df_val['above_average']

del df_test['price']
del df_test['above_average']

len(df_train), len(df_val), len(df_test)

# 2. Prepare the training data
train_dict = df_train[all_features].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
y_train = df_train['log_price'].values

val_dict = df_val[all_features].to_dict(orient='records')
X_val = dv.transform(val_dict)
y_val = df_val['log_price'].values

# Scaling the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Fit the Ridge regression model with increased max_iter and using scaled data
alphas = [0, 0.01, 0.1, 1, 10]

for alpha in alphas:
    model = Ridge(alpha=alpha, solver='sag', max_iter=10000, random_state=42)
    model.fit(X_train_scaled, y_train)
    
    y_pred = model.predict(X_val_scaled)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    
    print(f"Alpha: {alpha}, RMSE: {round(rmse, 3)}")

Alpha: 0, RMSE: 0.23
Alpha: 0.01, RMSE: 0.23
Alpha: 0.1, RMSE: 0.23
Alpha: 1, RMSE: 0.229
Alpha: 10, RMSE: 0.226
