## Homework

> Note: sometimes your answer doesn't match one of the options exactly. That's fine. 
Select the option that's closest to your solution.

### Dataset

In this homework, we will use the Car price dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```

In [3]:
!mkdir data/homework03
!wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv -P data/homework03

mkdir: cannot create directory ‘data/homework03’: File exists
--2023-09-30 02:58:05--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1475504 (1.4M) [text/plain]
Saving to: ‘data/homework03/data.csv’


2023-09-30 02:58:05 (24.6 MB/s) - ‘data/homework03/data.csv’ saved [1475504/1475504]



In [5]:
import pandas as pd

In [7]:
df = pd.read_csv("data/homework03/data.csv")
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500



We'll keep working with the `MSRP` variable, and we'll transform it to a classification task. 

### Features

For the rest of the homework, you'll need to use only these columns:

* `Make`,
* `Model`,
* `Year`,
* `Engine HP`,
* `Engine Cylinders`,
* `Transmission Type`,
* `Vehicle Style`,
* `highway MPG`,
* `city mpg`


In [10]:
features = ["make", "model", "year", "engine_hp", "engine_cylinders", "transmission_type", "vehicle_style", "highway_mpg", "city_mpg"]


### Data preparation

* Select only the features from above and transform their names using next line:
  ```
  data.columns = data.columns.str.replace(' ', '_').str.lower()
  ```
* Fill in the missing values of the selected features with 0.
* Rename `MSRP` variable to `price`.


In [91]:
df[["make", "model", "year", "transmission_type", "vehicle_style"]].nunique()

make                  48
model                915
year                  28
transmission_type      5
vehicle_style         16
dtype: int64

In [100]:
makes = df["vehicle_style"].unique()
makes.sort()
makes

array(['2dr Hatchback', '2dr SUV', '4dr Hatchback', '4dr SUV',
       'Cargo Minivan', 'Cargo Van', 'Convertible', 'Convertible SUV',
       'Coupe', 'Crew Cab Pickup', 'Extended Cab Pickup',
       'Passenger Minivan', 'Passenger Van', 'Regular Cab Pickup',
       'Sedan', 'Wagon'], dtype=object)

In [24]:
df.columns = df.columns.str.replace(' ', '_').str.lower()

Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'driven_wheels',
       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'popularity', 'price'],
      dtype='object')

In [63]:
df['engine_hp'] = df.engine_hp.fillna(0)
df['engine_cylinders'] = df.engine_cylinders.fillna(0)
df[features].isnull().sum()

make                 0
model                0
year                 0
engine_hp            0
engine_cylinders     0
transmission_type    0
vehicle_style        0
highway_mpg          0
city_mpg             0
dtype: int64

In [None]:

df.msrp.name = 'price'
df['price'] = df.msrp
del df['msrp']
df.columns


### Question 1

What is the most frequent observation (mode) for the column `transmission_type`?

- `AUTOMATIC`
- `MANUAL`
- `AUTOMATED_MANUAL`
- `DIRECT_DRIVE`



In [18]:
df.transmission_type.value_counts()

transmission_type
AUTOMATIC           8266
MANUAL              2935
AUTOMATED_MANUAL     626
DIRECT_DRIVE          68
UNKNOWN               19
Name: count, dtype: int64


### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

- `engine_hp` and `year`
- `engine_hp` and `engine_cylinders`
- `highway_mpg` and `engine_cylinders`
- `highway_mpg` and `city_mpg`


In [25]:
df[["engine_hp", "year", "engine_cylinders", "highway_mpg", "city_mpg"]].corr()

Unnamed: 0,engine_hp,year,engine_cylinders,highway_mpg,city_mpg
engine_hp,1.0,0.351794,0.779988,-0.406563,-0.439371
year,0.351794,1.0,-0.041479,0.25824,0.198171
engine_cylinders,0.779988,-0.041479,1.0,-0.621606,-0.600776
highway_mpg,-0.406563,0.25824,-0.621606,1.0,0.886829
city_mpg,-0.439371,0.198171,-0.600776,0.886829,1.0




### Make `price` binary

* Now we need to turn the `price` variable from numeric into a binary format.
* Let's create a variable `above_average` which is `1` if the `price` is above its mean value and `0` otherwise.


In [28]:
df["above_average"] = (df.price > df.price.mean()).astype(int)
df[["price", "above_average"]]

Unnamed: 0,price,above_average
0,46135,1
1,40650,1
2,36350,0
3,29450,0
4,34500,0
...,...,...
11909,46120,1
11910,56670,1
11911,50620,1
11912,50920,1



### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value (`above_average`) is not in your dataframe.


In [29]:
from sklearn.model_selection import train_test_split

In [125]:
df.shape, 11914*0.6, 11914*0.2, 7148.4*(2/6), 7148.4*0.3333, 7148*(1/3), 7148.4*0.25, 7148*0.25

((11914, 17),
 7148.4,
 2382.8,
 2382.7999999999997,
 2382.5617199999997,
 2382.6666666666665,
 1787.1,
 1787.0)

In [132]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)



In [134]:
df_train.shape, df_val.shape, df_test.shape


((7148, 17), (2383, 17), (2383, 17))


### Question 3

* Calculate the mutual information score between `above_average` and other categorical variables in our dataset. 
  Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the lowest mutual information score?
  
- `make`
- `model`
- `transmission_type`
- `vehicle_style`



In [135]:
from sklearn.metrics import mutual_info_score


In [136]:
round(mutual_info_score(df_full_train.above_average, df_full_train.make), 2)

0.24

In [137]:
round(mutual_info_score(df_full_train.above_average, df_full_train.model), 2)

0.46

In [138]:
round(mutual_info_score(df_full_train.above_average, df_full_train.transmission_type), 2)

0.02

In [139]:
round(mutual_info_score(df_full_train.above_average, df_full_train.vehicle_style), 2)

0.08


### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.60
- 0.72
- 0.84
- 0.95


In [140]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [141]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[features].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val[features].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [142]:
df_train[["make", "model", "year", "transmission_type", "vehicle_style"]].nunique()

make                  48
model                880
year                  28
transmission_type      5
vehicle_style         16
dtype: int64

In [143]:
y_train = df_train.above_average.values
y_train

y_val = df_val.above_average.values
y_val

array([1, 0, 0, ..., 0, 1, 0])

In [151]:
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [153]:
y_pred = model.predict_proba(X_val)[:, 1]
above_avg_decision = (y_pred >= 0.5)
round((y_val == above_avg_decision).mean(), 2)

0.94



### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `year`
- `engine_hp`
- `transmission_type`
- `city_mpg`

> **Note**: the difference doesn't have to be positive



In [154]:
def remove_feature_and_train(exclude_feature, df_train, df_val, y_train, y_val):
    df_train = df_train[features]
    del df_train[exclude_feature]

    df_val = df_val[features]
    del df_val[exclude_feature]
    
    dv = DictVectorizer(sparse=False)

    train_dict = df_train.to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    val_dict = df_val.to_dict(orient='records')
    X_val = dv.transform(val_dict)

    model = LogisticRegression(
        solver='liblinear', C=10, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict_proba(X_val)[:, 1]
    above_avg_decision = (y_pred >= 0.5)
    return (y_val == above_avg_decision).mean()

In [162]:
import numpy as np

In [167]:
features_to_remove = np.array(["year", "engine_hp", "transmission_type", "city_mpg"])

accuracy_target = (y_val == above_avg_decision).mean()

smallest_feature = None
smallest_accuracy = 1

for feature_to_remove in features_to_remove:
    accuracy_removal = remove_feature_and_train(
        feature_to_remove, df_train, df_val, y_train, y_val)
    
    print(feature_to_remove, ": ", accuracy_removal - accuracy_target)

    if accuracy_removal - accuracy_target < smallest_accuracy:
        smallest_accuracy = accuracy_removal - accuracy_target
        smallest_feature = feature_to_remove

print("Feature with the smallest accuracy diff: ", smallest_feature, ": ", smallest_accuracy)

year :  0.015107007973143127
engine_hp :  0.0020981955518254436
transmission_type :  0.011749895090222395
city_mpg :  0.0012589173310952884
Feature with the smallest accuracy diff:  city_mpg :  0.0012589173310952884



### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver `'sag'`. Set the seed to `42`.
* This model also has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`.
* Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

- 0
- 0.01
- 0.1
- 1
- 10

> **Note**: If there are multiple options, select the smallest `alpha`.


In [174]:
import sys 
from sklearn.linear_model import Ridge

In [184]:
def ridge_train_eval(alpha, df_train, df_val, y_train, y_val):
    y_train = np.log1p(y_train)
    y_val = np.log1p(y_val)

    df_train = df_train[features]

    df_val = df_val[features]
    
    dv = DictVectorizer(sparse=True)

    train_dict = df_train.to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    val_dict = df_val.to_dict(orient='records')
    X_val = dv.transform(val_dict)

    model = Ridge(
        alpha=alpha, 
        solver='sag', 
        max_iter=1000,
        random_state=42, )
    model.fit(X_train, y_train)

    y_pred = model.predict(X_val)
    return np.sqrt(((y_pred - y_val)**2).mean())

In [178]:
df_train[features].columns

Index(['make', 'model', 'year', 'engine_hp', 'engine_cylinders',
       'transmission_type', 'vehicle_style', 'highway_mpg', 'city_mpg'],
      dtype='object')

In [185]:
alpha_values = np.array(
    [0, 0.01, 0.1, 1, 10])

smallest_valid_alpha = None
smallest_rmse = sys.maxsize

for alpha in alpha_values:
    rmse = ridge_train_eval(
        alpha, df_train, df_val, y_train, y_val)
    
    print(alpha, ": ", rmse)

    if rmse < smallest_rmse:
        smallest_rmse = rmse
        smallest_valid_alpha = alpha

print("Alpha with the smallest RMSE: ", smallest_valid_alpha, ": ", smallest_rmse)

0.0 :  0.1524735169869431
0.01 :  0.15204317715955382
0.1 :  0.15291499944397333
1.0 :  0.15442193314592131
10.0 :  0.1699586842611854
Alpha with the smallest RMSE:  0.01 :  0.15204317715955382




## Submit the results

* Submit your results here: https://forms.gle/FFfNjEP4jU4rxnL26
* You can submit your solution multiple times. In this case, only the last submission will be used 
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 2 October (Monday), 23:00 CEST.

After that, the form will be closed.
