### Dataset

In this homework, we will continue the New York City Airbnb Open Data. You can take it from
[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

We'll keep working with the `'price'` variable, and we'll transform it to a classification task.


### Features

For the rest of the homework, you'll need to use the features from the previous homework with additional two `'neighbourhood_group'` and `'room_type'`. So the whole feature set will be set as follows:

* `'neighbourhood_group'`,
* `'room_type'`,
* `'latitude'`,
* `'longitude'`,
* `'price'`,
* `'minimum_nights'`,
* `'number_of_reviews'`,
* `'reviews_per_month'`,
* `'calculated_host_listings_count'`,
* `'availability_365'`

Select only them and fill in the missing values with 0.

In [1]:
# Importing Libraries

from IPython.display import display
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression


import dtale
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('AB_NYC_2019.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [3]:
cat_cols = list(df.dtypes[df.dtypes == 'O'].index)

for col in cat_cols:
    df[col] = df[col].str.lower().str.replace(' ', '_')
    
display(df.head())

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,clean_&_quiet_apt_home_by_the_park,2787,john,brooklyn,kensington,40.64749,-73.97237,private_room,149,1,9,2018-10-19,0.21,6,365
1,2595,skylit_midtown_castle,2845,jennifer,manhattan,midtown,40.75362,-73.98377,entire_home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,the_village_of_harlem....new_york_!,4632,elisabeth,manhattan,harlem,40.80902,-73.9419,private_room,150,3,0,,,1,365
3,3831,cozy_entire_floor_of_brownstone,4869,lisaroxanne,brooklyn,clinton_hill,40.68514,-73.95976,entire_home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,entire_apt:_spacious_studio/loft_by_central_park,7192,laura,manhattan,east_harlem,40.79851,-73.94399,entire_home/apt,80,10,9,2018-11-19,0.1,1,0


In [4]:
features = [
    'neighbourhood_group',
    'room_type',
    'latitude',
    'longitude',
    'price',
    'minimum_nights',
    'number_of_reviews',
    'reviews_per_month',
    'calculated_host_listings_count',
    'availability_365'
]

df = df[features]
df.fillna(0, inplace=True)
df.head()

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,brooklyn,private_room,40.64749,-73.97237,149,1,9,0.21,6,365
1,manhattan,entire_home/apt,40.75362,-73.98377,225,1,45,0.38,2,355
2,manhattan,private_room,40.80902,-73.9419,150,3,0,0.0,1,365
3,brooklyn,entire_home/apt,40.68514,-73.95976,89,1,270,4.64,1,194
4,manhattan,entire_home/apt,40.79851,-73.94399,80,10,9,0.1,1,0


### Question 1

What is the most frequent observation (mode) for the column `'neighbourhood_group'`?

In [92]:
df.neighbourhood_group.mode()

0    manhattan
dtype: object

### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value ('price') is not in your dataframe.

In [6]:
X = df.drop('price', axis=1)
y = df.price

In [121]:
X_f, X_test, y_f, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [122]:
X_train, X_val, y_train, y_val = train_test_split(X_f, y_f, test_size=0.25, random_state=42)

In [123]:
X_list = [X_train, X_val, X_test]
y_list = [y_train, y_val, y_test]
df_list = list(zip(X_list, y_list))

In [124]:
for i in df_list:
    print(i[0].shape, i[1].shape)

(29337, 9) (29337,)
(9779, 9) (9779,)
(9779, 9) (9779,)


In [125]:
display(X_train)
print(y_train)

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
13575,brooklyn,entire_home/apt,40.72760,-73.94495,3,29,0.70,13,50
48476,manhattan,private_room,40.70847,-74.00498,1,0,0.00,1,7
44499,bronx,entire_home/apt,40.83149,-73.92766,40,0,0.00,1,0
17382,brooklyn,entire_home/apt,40.66448,-73.99407,2,3,0.08,1,0
14638,manhattan,private_room,40.74118,-74.00012,1,48,1.80,2,67
...,...,...,...,...,...,...,...,...,...
13198,brooklyn,private_room,40.71748,-73.95685,6,5,0.13,1,0
14583,brooklyn,private_room,40.66397,-73.98538,1,7,0.17,2,0
6168,manhattan,private_room,40.79994,-73.97001,1,1,0.64,1,88
12248,brooklyn,private_room,40.69585,-73.96344,60,0,0.00,1,0


13575     99
48476     57
44499     70
17382    130
14638    110
        ... 
13198     50
14583    125
6168     299
12248     65
20523     92
Name: price, Length: 29337, dtype: int64


In [12]:
display(X_val)
print(y_val)

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
27408,brooklyn,private_room,40.70239,-73.92931,1,35,1.80,1,52
7741,brooklyn,entire_home/apt,40.68498,-73.96618,14,4,0.11,2,343
4771,brooklyn,entire_home/apt,40.66911,-73.94824,3,153,2.64,1,260
1719,manhattan,private_room,40.79767,-73.96114,3,2,0.02,3,0
19153,manhattan,entire_home/apt,40.76075,-73.99893,30,0,0.00,18,365
...,...,...,...,...,...,...,...,...,...
31286,manhattan,entire_home/apt,40.75974,-73.98924,4,2,0.14,1,5
35694,bronx,shared_room,40.80952,-73.93005,1,17,1.79,7,83
14003,manhattan,private_room,40.73058,-73.98286,1,4,0.10,1,0
13892,brooklyn,private_room,40.68858,-73.95387,1,0,0.00,1,0


27408     65
7741      89
4771     200
1719     120
19153    748
        ... 
31286    300
35694     28
14003     39
13892     70
10029     77
Name: price, Length: 9779, dtype: int64


### Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
   * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

Example of a correlation matrix for the car price dataset:

<img src="https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/03-classification/images/correlation-matrix.png?raw=true" />


In [13]:
num_cols = list(X_train.dtypes[X_train.dtypes != 'O'].index)
num_cols

['latitude',
 'longitude',
 'minimum_nights',
 'number_of_reviews',
 'reviews_per_month',
 'calculated_host_listings_count',
 'availability_365']

In [14]:
df_ = X_train.copy()
df_['price'] = y_train
display(df_.head())
df_[num_cols].corrwith(df_.price)

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,price
13575,brooklyn,entire_home/apt,40.7276,-73.94495,3,29,0.7,13,50,99
48476,manhattan,private_room,40.70847,-74.00498,1,0,0.0,1,7,57
44499,bronx,entire_home/apt,40.83149,-73.92766,40,0,0.0,1,0,70
17382,brooklyn,entire_home/apt,40.66448,-73.99407,2,3,0.08,1,0,130
14638,manhattan,private_room,40.74118,-74.00012,1,48,1.8,2,67,110


latitude                          0.035428
longitude                        -0.146318
minimum_nights                    0.046668
number_of_reviews                -0.048225
reviews_per_month                -0.052908
calculated_host_listings_count    0.053746
availability_365                  0.080121
dtype: float64

In [15]:
X_train[num_cols].corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.080301,0.027441,-0.006246,-0.007159,0.019375,-0.005891
longitude,0.080301,1.0,-0.06066,0.055084,0.134642,-0.117041,0.083666
minimum_nights,0.027441,-0.06066,1.0,-0.07602,-0.120703,0.118647,0.138901
number_of_reviews,-0.006246,0.055084,-0.07602,1.0,0.590374,-0.073167,0.174477
reviews_per_month,-0.007159,0.134642,-0.120703,0.590374,1.0,-0.048767,0.165376
calculated_host_listings_count,0.019375,-0.117041,0.118647,-0.073167,-0.048767,1.0,0.225913
availability_365,-0.005891,0.083666,0.138901,0.174477,0.165376,0.225913,1.0


In [117]:
cor = X_train[num_cols].corr().abs().unstack().sort_values(kind="quicksort", ascending=False)

corl = list(cor)

corl = sorted([(i, cor[i]) for i in cor.index if cor[i] != 1], key=lambda x: x[1])

cor_max = corl[-1]
cor_min = corl[0]

mx = cor_max if cor_max[1] > abs(cor_min[1]) else cor_min

print(mx)

(('reviews_per_month', 'number_of_reviews'), 0.5903739015971663)


### Make price binary

* We need to turn the price variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the price is above (or equal to) `152`.

In [16]:
gross_mean = df.price.mean()
print(f'Average Price is {gross_mean}')

Average Price is 152.7206871868289


In [127]:
y_train = np.where(y_train > gross_mean, 1, 0)
y_val = np.where(y_val > gross_mean, 1, 0)
y_test = np.where(y_test > gross_mean, 1, 0)
y_train, y_val, y_test

(array([0, 0, 0, ..., 1, 0, 0]),
 array([0, 0, 1, ..., 0, 0, 0]),
 array([0, 0, 0, ..., 0, 1, 0]))

In [18]:
global_mean = np.mean([y_train.mean(),  y_val.mean(), y_test.mean()])
print(f'Global Mean of binarized price is {round(global_mean, 2)}')

Global Mean of binarized price is 0.3


### Question 3

* Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using `round(score, 2)`


In [19]:
cat_feat = list(set(X_train.columns) - set(num_cols))
cat_feat

['neighbourhood_group', 'room_type']

In [20]:
df_.price = y_train
for c in cat_feat:
    print(c, '\n')
    df_group = df_.groupby(c).price.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_mean
    df_group['risk'] = df_group['mean'] / global_mean
    display(df_group)
    print('\n')

neighbourhood_group 



Unnamed: 0_level_0,mean,count,diff,risk
neighbourhood_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bronx,0.066563,646,-0.237001,0.219273
brooklyn,0.21237,12078,-0.091195,0.699587
manhattan,0.454762,13031,0.151197,1.498074
queens,0.118609,3364,-0.184956,0.39072
staten_island,0.123853,218,-0.179711,0.407997




room_type 



Unnamed: 0_level_0,mean,count,diff,risk
room_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
entire_home/apt,0.528654,15286,0.225089,1.741488
private_room,0.062874,13360,-0.24069,0.20712
shared_room,0.05644,691,-0.247124,0.185924






In [128]:
mis = lambda x: round(mutual_info_score(x, y_train), 2)
mutual = X_train[cat_feat].apply(mis)
print(mutual)

neighbourhood_group    0.05
room_type              0.14
dtype: float64


In [22]:
score = lambda x: round(mutual_info_score(X_train[x], y_train), 2)
score_df = pd.DataFrame({'Feature':cat_feat, 'Mutual Info Score':list(map(score, cat_feat))})
display(score_df)
print(score_df.max())

Unnamed: 0,Feature,Mutual Info Score
0,neighbourhood_group,0.05
1,room_type,0.14


Feature              room_type
Mutual Info Score         0.14
dtype: object


### Question 4

* Now let's train a logistic regression
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
   * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
   * `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)`
* Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.


In [23]:
# One Hot Encoding Using DictVectorizer

train_dict = X_train.to_dict(orient='records')
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)
X_train_v = dv.transform(train_dict)

In [24]:
X_train_v

array([[ 50.     ,  13.     ,  40.7276 , ...,   1.     ,   0.     ,
          0.     ],
       [  7.     ,   1.     ,  40.70847, ...,   0.     ,   1.     ,
          0.     ],
       [  0.     ,   1.     ,  40.83149, ...,   1.     ,   0.     ,
          0.     ],
       ...,
       [ 88.     ,   1.     ,  40.79994, ...,   0.     ,   1.     ,
          0.     ],
       [  0.     ,   1.     ,  40.69585, ...,   0.     ,   1.     ,
          0.     ],
       [281.     ,   2.     ,  40.64438, ...,   1.     ,   0.     ,
          0.     ]])

In [25]:
model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train_v, y_train)

LogisticRegression(random_state=42)

In [26]:
model.intercept_

array([-0.00260767])

In [27]:
model.coef_.round(3)

array([[ 0.003,  0.004, -0.228, -0.095, -0.011, -0.396,  0.124,  1.216,
        -0.817, -0.13 , -0.003, -0.041,  1.646, -1.179, -0.47 ]])

In [28]:
train_score = (model.predict(X_train_v) == y_train).mean()
train_score

0.7888673006783243

In [29]:
val_dict = X_val.to_dict(orient='records')
X_val_v = dv.transform(val_dict)

In [30]:
df_val_pred = pd.DataFrame()
df_val_pred['probability of 0'] = model.predict_proba(X_val_v)[:, 0]
df_val_pred['probability of 1'] = model.predict_proba(X_val_v)[:, 1]
df_val_pred['prediction'] = model.predict(X_val_v)
df_val_pred['actual'] = y_val
df_val_pred.head(20)

Unnamed: 0,probability of 0,probability of 1,prediction,actual
0,0.967616,0.032384,0,0
1,0.418744,0.581256,1,0
2,0.600452,0.399548,0,1
3,0.909354,0.090646,0,0
4,0.201689,0.798311,1,1
5,0.642367,0.357633,0,0
6,0.372467,0.627533,1,1
7,0.386564,0.613436,1,0
8,0.801806,0.198194,0,0
9,0.317338,0.682662,1,1


In [31]:
((model.predict_proba(X_val_v)[:, 1] >= 0.5) == y_val).mean()

0.7857654156866756

In [32]:
(model.predict(X_val_v) == y_val).mean()

0.7857654156866756

In [33]:
round((model.predict(X_val_v) == y_val).mean(), 2)

0.79

### Question 5

* We have 9 features: 7 numerical features and 2 categorical.
* Let's find the least useful one using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `neighbourhood_group`
   * `room_type` 
   * `number_of_reviews`
   * `reviews_per_month`

> **note**: the difference doesn't have to be positive

In [34]:
columns = dv.get_feature_names()

In [35]:
feature_imp = dict(zip(columns, model.coef_[0].round(3)))

In [36]:
sorted(feature_imp.items(), key=lambda x: x[1], reverse=True)

[('room_type=entire_home/apt', 1.646),
 ('neighbourhood_group=manhattan', 1.216),
 ('neighbourhood_group=brooklyn', 0.124),
 ('calculated_host_listings_count', 0.004),
 ('availability_365', 0.003),
 ('number_of_reviews', -0.003),
 ('minimum_nights', -0.011),
 ('reviews_per_month', -0.041),
 ('longitude', -0.095),
 ('neighbourhood_group=staten_island', -0.13),
 ('latitude', -0.228),
 ('neighbourhood_group=bronx', -0.396),
 ('room_type=shared_room', -0.47),
 ('neighbourhood_group=queens', -0.817),
 ('room_type=private_room', -1.179)]

In [131]:
def gen_scores(inputs, label):
    scores_vs = {}
    cols = num_cols + cat_feat
    
    for col in cols:
        feats = list(set(cols) - set([col]))
        k = inputs[feats]
        
        k_dict = k.to_dict(orient='records')
        dv = DictVectorizer(sparse=False)
        k = dv.fit_transform(k_dict)
        
        model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
        model.fit(k, label)
        score = (model.predict(k) == label).mean()
        scores_vs[col] = round(score - train_score, 6)
        
    
    return scores_vs

In [132]:
gen_scores(X_train, y_train)

{'latitude': -0.000136,
 'longitude': 0.000205,
 'minimum_nights': -0.001193,
 'number_of_reviews': -0.001091,
 'reviews_per_month': 0.0,
 'calculated_host_listings_count': -0.001023,
 'availability_365': -0.003715,
 'neighbourhood_group': -0.040734,
 'room_type': -0.071139}

In [56]:
X_train.head()

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
13575,brooklyn,entire_home/apt,40.7276,-73.94495,3,29,0.7,13,50
48476,manhattan,private_room,40.70847,-74.00498,1,0,0.0,1,7
44499,bronx,entire_home/apt,40.83149,-73.92766,40,0,0.0,1,0
17382,brooklyn,entire_home/apt,40.66448,-73.99407,2,3,0.08,1,0
14638,manhattan,private_room,40.74118,-74.00012,1,48,1.8,2,67


### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.

In [61]:
X = df.drop('price', axis=1)
y = df.price

In [62]:
X_f, X_test, y_f, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [63]:
X_train, X_val, y_train, y_val = train_test_split(X_f, y_f, test_size=0.25, random_state=42)

In [76]:
# One Hot Encoding

def infuse_dummy(df, feature_list):
    dummy = pd.get_dummies(df[feature_list], prefix=feature_list)
    df = pd.concat([df, dummy,], axis=1).drop(feature_list, axis=1)
    return df

In [78]:
#X_train = infuse_dummy(X_train, cat_feat)
X_train.head()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_group_bronx,neighbourhood_group_brooklyn,neighbourhood_group_manhattan,neighbourhood_group_queens,neighbourhood_group_staten_island,room_type_entire_home/apt,room_type_private_room,room_type_shared_room
13575,40.7276,-73.94495,3,29,0.7,13,50,0,1,0,0,0,1,0,0
48476,40.70847,-74.00498,1,0,0.0,1,7,0,0,1,0,0,0,1,0
44499,40.83149,-73.92766,40,0,0.0,1,0,1,0,0,0,0,1,0,0
17382,40.66448,-73.99407,2,3,0.08,1,0,0,1,0,0,0,1,0,0
14638,40.74118,-74.00012,1,48,1.8,2,67,0,0,1,0,0,0,1,0


In [79]:
#X_val = infuse_dummy(X_val, cat_feat)
X_val.head()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_group_bronx,neighbourhood_group_brooklyn,neighbourhood_group_manhattan,neighbourhood_group_queens,neighbourhood_group_staten_island,room_type_entire_home/apt,room_type_private_room,room_type_shared_room
27408,40.70239,-73.92931,1,35,1.8,1,52,0,1,0,0,0,0,1,0
7741,40.68498,-73.96618,14,4,0.11,2,343,0,1,0,0,0,1,0,0
4771,40.66911,-73.94824,3,153,2.64,1,260,0,1,0,0,0,1,0,0
1719,40.79767,-73.96114,3,2,0.02,3,0,0,0,1,0,0,0,1,0
19153,40.76075,-73.99893,30,0,0.0,18,365,0,0,1,0,0,1,0,0


In [80]:
#y_train = np.log1p(y_train)

In [81]:
y_train

13575    4.605170
48476    4.060443
44499    4.262680
17382    4.875197
14638    4.709530
           ...   
13198    3.931826
14583    4.836282
6168     5.703782
12248    4.189655
20523    4.532599
Name: price, Length: 29337, dtype: float64

In [82]:
#y_val = np.log1p(y_val)
y_val

27408    4.189655
7741     4.499810
4771     5.303305
1719     4.795791
19153    6.618739
           ...   
31286    5.707110
35694    3.367296
14003    3.688879
13892    4.262680
10029    4.356709
Name: price, Length: 9779, dtype: float64

In [87]:
from sklearn.linear_model import Ridge

RidgeCV(alphas=array([ 0.  ,  0.01,  0.1 ,  1.  , 10.  ]), cv=50,
        normalize=True)

In [91]:
alphas = [0, 0.01, 0.1, 1, 10]
rmse_ = []
for i in alphas:
    rig = Ridge(alpha=i)
    rig.fit(X_train, y_train)
    y_pred = rig.predict(X_val)
    rmse = np.sqrt(((y_pred - y_val)**2).mean())
    rmse_.append(round(rmse, 3))
else:
    score = dict(zip(alphas, rmse_))
    print(score)

0.25140246137666394
0.2471256145517215
0.24712662852330344
0.2471477185843995
0.2478910680405036
{0: 0.501, 0.01: 0.497, 0.1: 0.497, 1: 0.497, 10: 0.498}


## Submit the results

Submit your results here: https://forms.gle/xGpZhoq9Efm9E4RA9

It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Deadline

The deadline for submitting is 27 September 2021, 17:00 CET. After that, the form will be closed.


## Nagivation

* [Machine Learning Zoomcamp course](https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp)
* [Session 3: Machine Learning for Classification](https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp/03-classification)
* [Explore more](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/03-classification/14-explore-more.md)