## Homework

> Note: sometimes your answer doesn't match one of the options exactly. That's fine. 
Select the option that's closest to your solution.

### Dataset

In this homework, we will use the Car price dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```

We'll keep working with the `MSRP` variable, and we'll transform it to a classification task. 

### Features

For the rest of the homework, you'll need to use only these columns:

* `Make`,
* `Model`,
* `Year`,
* `Engine HP`,
* `Engine Cylinders`,
* `Transmission Type`,
* `Vehicle Style`,
* `highway MPG`,
* `city mpg`,
* `MSRP`

In [2]:
import pandas as pd

In [3]:
df_car_price = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv')

In [5]:
df_car_price.columns

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP'],
      dtype='object')

### Data preparation

* Select only the features from above and transform their names using the next line:
  ```
  data.columns = data.columns.str.replace(' ', '_').str.lower()
  ```
* Fill in the missing values of the selected features with 0.
* Rename `MSRP` variable to `price`.


In [16]:
# Select the desired columns
selected_columns = ['Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
                    'Transmission Type', 'Vehicle Style', 'highway MPG',
                    'city mpg', 'MSRP']

# Subset the DataFrame with the selected columns
df_selected = df_car_price[selected_columns].copy()

# Transform column names
df_selected.columns = df_selected.columns.str.replace(' ', '_').str.lower()

In [17]:
# Fill missing values with 0
df_selected.fillna(0, inplace=True)

# Rename 'MSRP' variable to 'price'
df_selected.rename(columns={'msrp': 'price'}, inplace=True)

# Display the updated DataFrame
df_selected.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500


### Question 1

What is the most frequent observation (mode) for the column `transmission_type`?

- `AUTOMATIC`
- `MANUAL`
- `AUTOMATED_MANUAL`
- `DIRECT_DRIVE`

In [19]:
df_selected['transmission_type'].value_counts()

AUTOMATIC           8266
MANUAL              2935
AUTOMATED_MANUAL     626
DIRECT_DRIVE          68
UNKNOWN               19
Name: transmission_type, dtype: int64

In [20]:
df_selected['transmission_type'].mode()

0    AUTOMATIC
dtype: object

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

- `engine_hp` and `year`
- `engine_hp` and `engine_cylinders`
- `highway_mpg` and `engine_cylinders`
- `highway_mpg` and `city_mpg`




In [35]:
# Calculate the correlation matrix
correlation_matrix = df_selected.corr()

# Print the correlation matrix
print("Correlation Matrix:")
correlation_matrix

Correlation Matrix:


Unnamed: 0,year,engine_hp,engine_cylinders,highway_mpg,city_mpg,price
year,1.0,0.338714,-0.040708,0.25824,0.198171,0.22759
engine_hp,0.338714,1.0,0.774851,-0.415707,-0.424918,0.650095
engine_cylinders,-0.040708,0.774851,1.0,-0.614541,-0.587306,0.526274
highway_mpg,0.25824,-0.415707,-0.614541,1.0,0.886829,-0.160043
city_mpg,0.198171,-0.424918,-0.587306,0.886829,1.0,-0.157676
price,0.22759,0.650095,0.526274,-0.160043,-0.157676,1.0


In [29]:
correlation_matrix.abs().unstack().sort_values(ascending=False)

year              year                1.000000
engine_hp         engine_hp           1.000000
city_mpg          city_mpg            1.000000
highway_mpg       highway_mpg         1.000000
engine_cylinders  engine_cylinders    1.000000
price             price               1.000000
highway_mpg       city_mpg            0.886829
city_mpg          highway_mpg         0.886829
engine_hp         engine_cylinders    0.774851
engine_cylinders  engine_hp           0.774851
engine_hp         price               0.650095
price             engine_hp           0.650095
engine_cylinders  highway_mpg         0.614541
highway_mpg       engine_cylinders    0.614541
engine_cylinders  city_mpg            0.587306
city_mpg          engine_cylinders    0.587306
engine_cylinders  price               0.526274
price             engine_cylinders    0.526274
city_mpg          engine_hp           0.424918
engine_hp         city_mpg            0.424918
highway_mpg       engine_hp           0.415707
engine_hp    

In [36]:
# Find the two features with the biggest correlation
max_corr = correlation_matrix.abs().unstack().sort_values(ascending=False)
print("\nThe two features with the biggest correlation:")
max_corr[max_corr < 1].head(2)


The two features with the biggest correlation:


highway_mpg  city_mpg       0.886829
city_mpg     highway_mpg    0.886829
dtype: float64

### Make `price` binary

* Now we need to turn the `price` variable from numeric into a binary format.
* Let's create a variable `above_average` which is `1` if the `price` is above its mean value and `0` otherwise.



In [39]:
df_selected['price'].mean()

40594.737032063116

In [38]:
# Calculate the mean of the 'price' variable
mean_price = df_selected['price'].mean()

# Create a new column 'above_average' with binary values
df_selected['above_average'] = df_selected['price'].apply(lambda x: 1 if x > mean_price else 0)

# Display the updated DataFrame
df_selected.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price,above_average
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135,1
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650,1
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350,0
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450,0
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500,0


### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value (`above_average`) is not in your dataframe.

In [56]:
from sklearn.model_selection import train_test_split

# Separate features (X) from the target (y)
X = df_selected.drop(columns=['above_average','price'])  # Features
y = df_selected['above_average']  # Target

# Split the data into train, validation, and test sets
# Use a 60%/20%/20% distribution and set the random seed to 42 for reproducibility
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Display the shapes of the resulting sets
print("Train set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_val.shape, y_val.shape)
print("Test set shape:", X_test.shape, y_test.shape)

Train set shape: (7148, 9) (7148,)
Validation set shape: (2383, 9) (2383,)
Test set shape: (2383, 9) (2383,)


In [57]:
X.head(2)

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19


In [43]:
y.head(2)

0    1
1    1
Name: above_average, dtype: int64

### Question 3

* Calculate the mutual information score between `above_average` and other categorical variables in our dataset. 
  Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the lowest mutual information score?
  
- `make`
- `model`
- `transmission_type`
- `vehicle_style`

In [58]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import mutual_info_classif

# Categorical variables for which we want to calculate mutual information
categorical_vars = ['make', 'model', 'transmission_type', 'vehicle_style']

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Label encode the categorical variables in the training set
X_train_encoded = X_train[categorical_vars].apply(label_encoder.fit_transform)

In [59]:
# Calculate mutual information scores using the training set
mi_scores = mutual_info_classif(X_train_encoded, y_train, discrete_features=True, random_state=42)

# Create a dictionary to store variable -> mutual information score
mi_scores_dict = dict(zip(categorical_vars, mi_scores))

# Print the mutual information scores for each variable, rounded to 2 decimals
for var, score in mi_scores_dict.items():
    print(f"Mutual Information score for {var}: {round(score, 2)}")

# Find the variable with the lowest mutual information score
lowest_mi_variable = min(mi_scores_dict, key=mi_scores_dict.get)
print(f"\nThe variable with the lowest mutual information score: {lowest_mi_variable}")

Mutual Information score for make: 0.24
Mutual Information score for model: 0.46
Mutual Information score for transmission_type: 0.02
Mutual Information score for vehicle_style: 0.08

The variable with the lowest mutual information score: transmission_type


### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.60
- 0.72
- 0.84
- 0.95

In [62]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Categorical variables for one-hot encoding
categorical_vars = ['make', 'model', 'transmission_type', 'vehicle_style']

# Create the logistic regression model with specified parameters
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)

# Create a ColumnTransformer to apply OneHotEncoder to specified columns
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False), categorical_vars)],
    remainder='passthrough'
)

# Update the pipeline to use the preprocessor
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)
])

# Fit the model on the training dataset
pipeline.fit(X_train, y_train)

# Predict on the validation dataset
y_val_pred = pipeline.predict(X_val)

# Calculate accuracy on the validation dataset
accuracy = accuracy_score(y_val, y_val_pred)

# Print the rounded accuracy
print("Accuracy on the validation dataset:", round(accuracy, 2))

Accuracy on the validation dataset: 0.94


### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `year`
- `engine_hp`
- `transmission_type`
- `city_mpg`

> **Note**: the difference doesn't have to be positive

In [93]:
X_train.columns

Index(['make', 'model', 'year', 'engine_hp', 'engine_cylinders',
       'transmission_type', 'vehicle_style', 'highway_mpg', 'city_mpg'],
      dtype='object')

In [None]:
# Create a ColumnTransformer to apply OneHotEncoder to specified columns
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False), categorical_vars)],
    remainder='passthrough'
)

# Update the pipeline to use the preprocessor
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)
])

In [96]:
# Features to be considered
features = ['year', 'engine_hp', 'transmission_type', 'city_mpg']
categorical_vars = ['make', 'model', 'transmission_type', 'vehicle_style']
# Train a model with all features
pipeline.fit(X_train, y_train)
accuracy_all_features = accuracy_score(y_val, pipeline.predict(X_val))

# Dictionary to store accuracy differences for each feature
accuracy_differences = {}

# Iterate through each feature and calculate accuracy difference when excluding it
for feature in features:
    # Exclude the feature
    features_subset = [f for f in categorical_vars if f != feature]
    print(features_subset)
    # Create a new preprocessor without the feature
    preprocessor_subset = ColumnTransformer(
        transformers=[
            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False), 
             features_subset)],
        remainder='passthrough'
    )
    
    # Replace the preprocessor in a new pipeline
    new_pipeline = Pipeline([
        ('preprocessor', preprocessor_subset),
        ('model', LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42))
    ])
    
    X_train_subset = X_train.drop(feature, axis=1)
    X_val_subset = X_val.drop(feature, axis=1)
    
    # Fit and predict with the modified pipeline
    new_pipeline.fit(X_train_subset, y_train)
    accuracy_without_feature = accuracy_score(y_val, new_pipeline.predict(X_val_subset))
    
    # Calculate the difference in accuracy
    difference = accuracy_all_features - accuracy_without_feature
    
    # Store the difference for the feature
    accuracy_differences[feature] = difference

# Find the feature with the smallest difference
smallest_difference_feature = min(accuracy_differences, key=accuracy_differences.get)
smallest_difference = accuracy_differences[smallest_difference_feature]

# Print the feature with the smallest difference and the difference itself
print("Feature with the smallest difference:", smallest_difference_feature)
print("Smallest difference:", smallest_difference)


['make', 'model', 'transmission_type', 'vehicle_style']
['make', 'model', 'transmission_type', 'vehicle_style']
['make', 'model', 'vehicle_style']
['make', 'model', 'transmission_type', 'vehicle_style']
Feature with the smallest difference: city_mpg
Smallest difference: -0.004196391103650887


### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver `'sag'`. Set the seed to `42`.
* This model also has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`.
* Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

- 0
- 0.01
- 0.1
- 1
- 10

> **Note**: If there are multiple options, select the smallest `alpha`.

In [100]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import numpy as np

In [101]:
# Apply logarithmic transformation to the 'price' column
df_selected['log_price'] = np.log1p(df_selected['price'])

# Separate features (X) and target (y)
X = df_selected.drop(['above_average', 'price'], axis=1)
y = df_selected['log_price']

In [105]:
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Categorical variables for one-hot encoding
categorical_vars = ['make', 'model', 'transmission_type', 'vehicle_style']

# Create a preprocessor to apply one-hot encoding to categorical variables
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False), categorical_vars)],
    remainder='passthrough'
)

# List of alpha values to try
alphas = [0, 0.01, 0.1, 1, 10]

# Dictionary to store RMSE for each alpha
rmse_scores = {}

# Fit Ridge regression models with different alpha values
for alpha in alphas:
    # Create a new pipeline with the current alpha
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', Ridge(alpha=alpha, solver='sag', random_state=42))
    ])
    
    # Fit the model and predict on the validation set
    pipeline.fit(X_train, y_train)
    y_val_pred = pipeline.predict(X_val)
    
    # Calculate RMSE for the current alpha
    rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
    rmse_scores[alpha] = round(rmse, 3)

# Find the alpha with the lowest RMSE
best_alpha = min(rmse_scores, key=rmse_scores.get)
best_rmse = rmse_scores[best_alpha]

# Print the results
print("RMSE scores for different alpha values:")
print(rmse_scores)
print("\nBest alpha with the lowest RMSE on the validation set:", best_alpha)
print("RMSE with the best alpha:", best_rmse)

RMSE scores for different alpha values:
{0: 0.097, 0.01: 0.097, 0.1: 0.097, 1: 0.097, 10: 0.099}

Best alpha with the lowest RMSE on the validation set: 0
RMSE with the best alpha: 0.097


## Submit the results

* Submit your results here: https://forms.gle/FFfNjEP4jU4rxnL26
* You can submit your solution multiple times. In this case, only the last submission will be used 
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 2 October (Monday), 23:00 CEST.

After that, the form will be closed.
