# Practice assignment: Advanced ensembling techniques


In this programming assignment, you are going to work with a dataset based on the following data:

https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

_Citation:_

* _K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal._

The dataset contains the information about the internet news articles. In this assignment, you are going to predict a number of shares of the news article (target column: `shares`). The information about the features is available through the link above. You are going to construct several machine learning algorithms (XGBoost, LightGBM, CatBoost and Lasso) and blend them into the final ensemble.

In [1]:
import numpy as np
import pandas as pd
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

In [2]:
df = pd.read_csv('data.csv')

## 1

**q1:** How many missing values are there in the data? Provide the number of cells in the dataframe that contain NaNs.

In [3]:
def rmse_from_estimator(estimator, X_test, y_test):
    return round(np.sqrt(mean_squared_error(y_test, estimator.predict(X_test))), 5)

In [4]:
df.isna().sum()

n_tokens_title                   0
n_tokens_content                 0
n_unique_tokens                  0
n_non_stop_words                 0
n_non_stop_unique_tokens         0
num_hrefs                        0
num_self_hrefs                   0
num_imgs                         0
num_videos                       0
average_token_length             0
num_keywords                     0
data_channel_is_lifestyle        0
data_channel_is_entertainment    0
data_channel_is_bus              0
data_channel_is_socmed           0
data_channel_is_tech             0
data_channel_is_world            0
kw_min_min                       0
kw_max_min                       0
kw_avg_min                       0
kw_min_max                       0
kw_max_max                       0
kw_avg_max                       0
kw_min_avg                       0
kw_max_avg                       0
kw_avg_avg                       0
self_reference_min_shares        0
self_reference_max_shares        0
self_reference_avg_s

## 2

**q2:** What is the maximum number of shares among all the news articles presented in the data?

In [5]:
df.describe()

Unnamed: 0,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,average_token_length,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
count,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,...,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0
mean,10.398749,546.514731,0.548216,0.996469,0.689175,10.88369,3.293638,4.544143,1.249874,4.548239,...,0.095446,0.756728,-0.259524,-0.521944,-0.1075,0.282353,0.071425,0.341843,0.156064,3395.380184
std,2.114037,471.107508,3.520708,5.231231,3.264816,11.332017,3.855141,8.309434,4.107855,0.844406,...,0.071315,0.247786,0.127726,0.29029,0.095373,0.324247,0.26545,0.188791,0.226294,11626.950749
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,1.0
25%,9.0,246.0,0.47087,1.0,0.625739,4.0,1.0,1.0,0.0,4.478404,...,0.05,0.6,-0.328383,-0.7,-0.125,0.0,0.0,0.166667,0.0,946.0
50%,10.0,409.0,0.539226,1.0,0.690476,8.0,3.0,1.0,0.0,4.664082,...,0.1,0.8,-0.253333,-0.5,-0.1,0.15,0.0,0.5,0.0,1400.0
75%,12.0,716.0,0.608696,1.0,0.75463,14.0,4.0,4.0,1.0,4.854839,...,0.1,1.0,-0.186905,-0.3,-0.05,0.5,0.15,0.5,0.25,2800.0
max,23.0,8474.0,701.0,1042.0,650.0,304.0,116.0,128.0,91.0,8.041534,...,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.5,1.0,843300.0


## 3

**q3:** What is the median number of shares for the articles published on Monday?

In [6]:
df[df.weekday_is_monday == 1].shares.median()

1400.0

## 4

First, we separate the target from the dataframe with features (`df` -> `X`, `y`).

Next, let's split the data into train/val/test sets in the ratio 60:20:20. The idea is that we will use train set to train our models, val set to validate them and test set to calculate the final error of the blend. So, test set will be a completely unseen data.

To do this, use a regular `train_test_split` from `sklearn` to split `X` and `y` into train and val/test parts in the ratio 60:40. Then use `train_test_split` again, but to split the obtain val/test part into validation and test in the ratio 50:50. In each `train_test_split` application, use `random_state=13` and other default parameter values.

In the end, you should obtain `X_train`, `X_val`, `X_test` with the following shapes, respectively: (23786, 58), (7929, 58), (7929, 58). The same logic is with `y_train`, `y_val`, `y_test`.

**q4:** What is the mean value of target in the test part (`X_test`)? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [7]:
X = df.drop('shares', axis=1)
y = df['shares']

In [8]:
X_train, X_testval, y_train, y_testval = train_test_split(X, y, test_size=0.4, random_state=13)
X_val, X_test, y_val, y_test = train_test_split(X_testval, y_testval, test_size=0.5, random_state=13)

In [9]:
X_train.shape, X_val.shape, X_test.shape

((23786, 58), (7929, 58), (7929, 58))

In [10]:
round(y_test.mean(), 5)

3349.74057

## 5

Now let's train our first model - XGBoost. A link to the documentation: https://xgboost.readthedocs.io/en/latest/

We will use Scikit-Learn Wrapper interface for XGBoost (and the same logic applies to the following LightGBM and CatBoost models). Here, we work on the regression task - hence we will use `XGBRegressor`. Read about the parameters of `XGBRegressor`: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor

The main list of XGBoost parameters: https://xgboost.readthedocs.io/en/latest/parameter.html# Look through this list so that you understand which parameters are presented in the library.

Take `XGBRegressor` with MSE objective (`objective='reg:squarederror'`), 200 trees (`n_estimators=200`), `learning_rate=0.01`, `max_depth=5`, `random_state=13` and all other default parameter values. Train it on the train set (`fit` function). 

**q5:** Calculate Root Mean Squared Error (RMSE) on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [11]:
xgb = XGBRegressor(objective='reg:squarederror', n_estimators=200, learning_rate=0.01, max_depth=5, random_state=13)

In [12]:
xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=200, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=13,
             reg_alpha=0, reg_lambda=1, ...)

In [13]:
round(np.sqrt(mean_squared_error(y_val, xgb.predict(X_val))), 5)

10329.20768

## 6

In the task 5, we have decided to build 200 trees in our model. However, it is hard to understand whether it is a good decision - maybe it is too much? Maybe 150 is a better number? Or 100? Or 50 is enough?

During the training process, it is possible to stop constructing the ensemble if we see that the validation error does not decrease anymore. Using the same XGBoost model, call `fit` function (to train it) with `eval_set=[(X_val, y_val)]` (to evaluate the boosting model after building a new tree) and `early_stopping_rounds=50` (and other default parameter values). This `early_stopping_rounds` says that if the validation metric does not increase on 50 consequent iterations, the training stops.

**q6:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [14]:
xgb = XGBRegressor(objective='reg:squarederror', n_estimators=200, learning_rate=0.01, max_depth=5, random_state=13)
xgb.fit(X_train, y_train, early_stopping_rounds=50, eval_set=[(X_val, y_val)])

[0]	validation_0-rmse:9132.57896
[1]	validation_0-rmse:9118.46796
[2]	validation_0-rmse:9104.95068
[3]	validation_0-rmse:9091.79293
[4]	validation_0-rmse:9079.19770
[5]	validation_0-rmse:9067.49453




[6]	validation_0-rmse:9056.05565
[7]	validation_0-rmse:9045.18382
[8]	validation_0-rmse:9035.23396
[9]	validation_0-rmse:9025.48494
[10]	validation_0-rmse:9016.13997
[11]	validation_0-rmse:9007.63286
[12]	validation_0-rmse:8999.20179
[13]	validation_0-rmse:8991.06644
[14]	validation_0-rmse:8983.98021
[15]	validation_0-rmse:8976.73861
[16]	validation_0-rmse:8970.26101
[17]	validation_0-rmse:8963.87773
[18]	validation_0-rmse:8958.61072
[19]	validation_0-rmse:8953.27067
[20]	validation_0-rmse:8948.50669
[21]	validation_0-rmse:8944.50485
[22]	validation_0-rmse:8938.09517
[23]	validation_0-rmse:8934.24954
[24]	validation_0-rmse:8931.12920
[25]	validation_0-rmse:8925.65214
[26]	validation_0-rmse:8923.34015
[27]	validation_0-rmse:8918.59897
[28]	validation_0-rmse:8916.47308
[29]	validation_0-rmse:8912.38908
[30]	validation_0-rmse:8908.04187
[31]	validation_0-rmse:8907.26784
[32]	validation_0-rmse:8906.41027
[33]	validation_0-rmse:8903.25394
[34]	validation_0-rmse:8900.16386
[35]	validation_0-

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=200, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=13,
             reg_alpha=0, reg_lambda=1, ...)

In [15]:
round(np.sqrt(mean_squared_error(y_val, xgb.predict(X_val))), 5)

8890.23464

## 7

Notes on parameter tuning: https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html

Here, we tuned some parameters of the algorithm. Take `XGBRegressor` with the following parameters:

* `objective='reg:squarederror'`
* `n_estimators=5000`
* `learning_rate=0.001`
* `max_depth=4`
* `gamma=1`
* `subsample=0.5`
* `random_state=13`
* all other default parameter values

Train it in the same manner, as in the task 6, but with `early_stopping_rounds=500`. 

**q7:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Notice the speed of the algorithm.

In [16]:
xgb = XGBRegressor(
    objective='reg:squarederror',
    n_estimators=5000,
    learning_rate=0.001,
    max_depth=4,
    gamma=1,
    subsample=0.5,
    random_state=13
)
xgb.fit(X_train, y_train, early_stopping_rounds=500, eval_set=[(X_val, y_val)])

[0]	validation_0-rmse:9145.78842
[1]	validation_0-rmse:9144.42063
[2]	validation_0-rmse:9143.00983
[3]	validation_0-rmse:9141.59376
[4]	validation_0-rmse:9140.17952
[5]	validation_0-rmse:9138.77550
[6]	validation_0-rmse:9137.15972
[7]	validation_0-rmse:9135.65232
[8]	validation_0-rmse:9134.11349




[9]	validation_0-rmse:9132.71948
[10]	validation_0-rmse:9131.40694
[11]	validation_0-rmse:9129.92722
[12]	validation_0-rmse:9128.53498
[13]	validation_0-rmse:9127.15092
[14]	validation_0-rmse:9125.81464
[15]	validation_0-rmse:9124.47845
[16]	validation_0-rmse:9122.91544
[17]	validation_0-rmse:9121.50081
[18]	validation_0-rmse:9120.15113
[19]	validation_0-rmse:9118.70679
[20]	validation_0-rmse:9117.30311
[21]	validation_0-rmse:9115.88248
[22]	validation_0-rmse:9114.39763
[23]	validation_0-rmse:9112.98066
[24]	validation_0-rmse:9111.45363
[25]	validation_0-rmse:9110.14621
[26]	validation_0-rmse:9108.71929
[27]	validation_0-rmse:9107.29806
[28]	validation_0-rmse:9105.95208
[29]	validation_0-rmse:9104.58900
[30]	validation_0-rmse:9103.23103
[31]	validation_0-rmse:9101.82554
[32]	validation_0-rmse:9100.43117
[33]	validation_0-rmse:9099.14855
[34]	validation_0-rmse:9097.74098
[35]	validation_0-rmse:9096.33481
[36]	validation_0-rmse:9095.06801
[37]	validation_0-rmse:9093.63727
[38]	validation

[246]	validation_0-rmse:8872.58010
[247]	validation_0-rmse:8871.71013
[248]	validation_0-rmse:8871.08177
[249]	validation_0-rmse:8870.36210
[250]	validation_0-rmse:8869.47498
[251]	validation_0-rmse:8868.80058
[252]	validation_0-rmse:8867.94602
[253]	validation_0-rmse:8867.27876
[254]	validation_0-rmse:8866.61451
[255]	validation_0-rmse:8866.09402
[256]	validation_0-rmse:8865.22128
[257]	validation_0-rmse:8864.37969
[258]	validation_0-rmse:8863.51544
[259]	validation_0-rmse:8862.67123
[260]	validation_0-rmse:8861.73516
[261]	validation_0-rmse:8861.06955
[262]	validation_0-rmse:8860.45578
[263]	validation_0-rmse:8859.58217
[264]	validation_0-rmse:8858.78559
[265]	validation_0-rmse:8858.09208
[266]	validation_0-rmse:8857.15210
[267]	validation_0-rmse:8856.48894
[268]	validation_0-rmse:8855.86106
[269]	validation_0-rmse:8855.17199
[270]	validation_0-rmse:8854.30886
[271]	validation_0-rmse:8853.62754
[272]	validation_0-rmse:8853.00655
[273]	validation_0-rmse:8852.34162
[274]	validation_0-r

[480]	validation_0-rmse:8742.78938
[481]	validation_0-rmse:8742.70883
[482]	validation_0-rmse:8742.16194
[483]	validation_0-rmse:8741.57363
[484]	validation_0-rmse:8741.40829
[485]	validation_0-rmse:8741.06796
[486]	validation_0-rmse:8740.76855
[487]	validation_0-rmse:8740.49162
[488]	validation_0-rmse:8740.18496
[489]	validation_0-rmse:8740.03953
[490]	validation_0-rmse:8739.78528
[491]	validation_0-rmse:8739.34260
[492]	validation_0-rmse:8739.29895
[493]	validation_0-rmse:8739.09457
[494]	validation_0-rmse:8738.56560
[495]	validation_0-rmse:8738.39737
[496]	validation_0-rmse:8738.15312
[497]	validation_0-rmse:8737.96486
[498]	validation_0-rmse:8737.49652
[499]	validation_0-rmse:8737.33942
[500]	validation_0-rmse:8736.92123
[501]	validation_0-rmse:8736.71213
[502]	validation_0-rmse:8736.59608
[503]	validation_0-rmse:8736.13088
[504]	validation_0-rmse:8735.55918
[505]	validation_0-rmse:8735.43507
[506]	validation_0-rmse:8735.27973
[507]	validation_0-rmse:8735.10016
[508]	validation_0-r

[715]	validation_0-rmse:8696.70406
[716]	validation_0-rmse:8696.39230
[717]	validation_0-rmse:8696.07345
[718]	validation_0-rmse:8695.96114
[719]	validation_0-rmse:8696.04177
[720]	validation_0-rmse:8696.34223
[721]	validation_0-rmse:8696.39895
[722]	validation_0-rmse:8696.07263
[723]	validation_0-rmse:8696.13510
[724]	validation_0-rmse:8695.91593
[725]	validation_0-rmse:8695.59127
[726]	validation_0-rmse:8695.88948
[727]	validation_0-rmse:8695.94765
[728]	validation_0-rmse:8695.65654
[729]	validation_0-rmse:8695.95527
[730]	validation_0-rmse:8696.06360
[731]	validation_0-rmse:8695.89946
[732]	validation_0-rmse:8695.57539
[733]	validation_0-rmse:8695.69627
[734]	validation_0-rmse:8695.90891
[735]	validation_0-rmse:8695.59775
[736]	validation_0-rmse:8695.25503
[737]	validation_0-rmse:8695.10824
[738]	validation_0-rmse:8695.21606
[739]	validation_0-rmse:8695.46766
[740]	validation_0-rmse:8695.72335
[741]	validation_0-rmse:8695.91663
[742]	validation_0-rmse:8695.66815
[743]	validation_0-r

[950]	validation_0-rmse:8699.88244
[951]	validation_0-rmse:8699.66919
[952]	validation_0-rmse:8699.44052
[953]	validation_0-rmse:8699.71014
[954]	validation_0-rmse:8699.57341
[955]	validation_0-rmse:8699.82729
[956]	validation_0-rmse:8700.15181
[957]	validation_0-rmse:8700.15309
[958]	validation_0-rmse:8699.98845
[959]	validation_0-rmse:8699.73887
[960]	validation_0-rmse:8699.62170
[961]	validation_0-rmse:8700.24433
[962]	validation_0-rmse:8700.06568
[963]	validation_0-rmse:8700.47493
[964]	validation_0-rmse:8700.56701
[965]	validation_0-rmse:8700.50641
[966]	validation_0-rmse:8700.32579
[967]	validation_0-rmse:8700.33583
[968]	validation_0-rmse:8700.60496
[969]	validation_0-rmse:8701.11288
[970]	validation_0-rmse:8700.89323
[971]	validation_0-rmse:8700.67038
[972]	validation_0-rmse:8700.54519
[973]	validation_0-rmse:8700.95332
[974]	validation_0-rmse:8701.22283
[975]	validation_0-rmse:8701.64937
[976]	validation_0-rmse:8701.45940
[977]	validation_0-rmse:8701.83558
[978]	validation_0-r

[1179]	validation_0-rmse:8750.43135
[1180]	validation_0-rmse:8750.52910
[1181]	validation_0-rmse:8751.00118
[1182]	validation_0-rmse:8750.91763
[1183]	validation_0-rmse:8750.78772
[1184]	validation_0-rmse:8750.66262
[1185]	validation_0-rmse:8750.54528
[1186]	validation_0-rmse:8751.28853
[1187]	validation_0-rmse:8751.82557
[1188]	validation_0-rmse:8752.23776
[1189]	validation_0-rmse:8751.99008
[1190]	validation_0-rmse:8752.03957
[1191]	validation_0-rmse:8752.19066
[1192]	validation_0-rmse:8752.23019
[1193]	validation_0-rmse:8752.18772
[1194]	validation_0-rmse:8752.04799
[1195]	validation_0-rmse:8751.93196
[1196]	validation_0-rmse:8752.34399
[1197]	validation_0-rmse:8753.20809
[1198]	validation_0-rmse:8753.10948
[1199]	validation_0-rmse:8753.65014
[1200]	validation_0-rmse:8753.56454
[1201]	validation_0-rmse:8753.93350
[1202]	validation_0-rmse:8754.36318
[1203]	validation_0-rmse:8754.65477
[1204]	validation_0-rmse:8755.48216
[1205]	validation_0-rmse:8755.35400
[1206]	validation_0-rmse:875

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=1, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.001, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=4, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=5000,
             n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=13,
             reg_alpha=0, reg_lambda=1, ...)

In [17]:
round(np.sqrt(mean_squared_error(y_val, xgb.predict(X_val))), 5)

8692.77417

## 8

Calculate feature importances according to the model, trained in the task 7. 

**q8:** What is the name of the most important feature? Provide it as the answer. Do you understand why it might be important for the model?

Notice that by default, `XGBRegressor` calculates feature importance considering gain (`importance_type` parameter).

In [18]:
xgb.feature_names_in_[np.argmax(xgb.feature_importances_)]

'data_channel_is_bus'

## 9

Let's move to LightGBM. We will work with `LGBMRegressor`.

LGBMRegressor parameters: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor

The main list of LightGBM parameters: https://lightgbm.readthedocs.io/en/latest/Parameters.html Look through this list so that you understand which parameters are presented in the library.

Take `LGBMRegressor` with the following parameters, similar to the previous `XGBoost` model:

* `objective='regression'`
* `n_estimators=200`
* `learning_rate=0.01`
* `max_depth=5`
* `random_state=13`
* other default parameter values

Train it on the training data with `eval_set=[(X_val, y_val)]`, `eval_metric='rmse'`, `early_stopping_rounds=50` and all other default parameter values. 

**q9:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Notice the speed of the algorithm and compare it to the speed of XGBoost model.

In [19]:
lgb = LGBMRegressor(
objective='regression',
n_estimators=200,
learning_rate=0.01,
max_depth=5,
random_state=13)

lgb.fit(X_train, y_train, eval_metric='rmse', early_stopping_rounds=50, eval_set=[(X_val, y_val)])



[1]	valid_0's rmse: 8525.53	valid_0's l2: 7.26847e+07
[2]	valid_0's rmse: 8523.14	valid_0's l2: 7.26439e+07
[3]	valid_0's rmse: 8520.95	valid_0's l2: 7.26065e+07
[4]	valid_0's rmse: 8518.85	valid_0's l2: 7.25708e+07
[5]	valid_0's rmse: 8516.58	valid_0's l2: 7.25321e+07
[6]	valid_0's rmse: 8514.73	valid_0's l2: 7.25007e+07
[7]	valid_0's rmse: 8512.12	valid_0's l2: 7.24563e+07
[8]	valid_0's rmse: 8510.48	valid_0's l2: 7.24282e+07
[9]	valid_0's rmse: 8508.06	valid_0's l2: 7.23871e+07
[10]	valid_0's rmse: 8506.35	valid_0's l2: 7.2358e+07
[11]	valid_0's rmse: 8504.05	valid_0's l2: 7.23188e+07
[12]	valid_0's rmse: 8502.03	valid_0's l2: 7.22844e+07
[13]	valid_0's rmse: 8499.83	valid_0's l2: 7.22471e+07
[14]	valid_0's rmse: 8497.97	valid_0's l2: 7.22156e+07
[15]	valid_0's rmse: 8496.59	valid_0's l2: 7.2192e+07
[16]	valid_0's rmse: 8494.79	valid_0's l2: 7.21615e+07
[17]	valid_0's rmse: 8493.41	valid_0's l2: 7.21379e+07
[18]	valid_0's rmse: 8491.48	valid_0's l2: 7.21053e+07
[19]	valid_0's rmse: 

LGBMRegressor(learning_rate=0.01, max_depth=5, n_estimators=200,
              objective='regression', random_state=13)

In [20]:
rmse_from_estimator(lgb, X_val, y_val)

8451.2859

## 10

Notes on parameter tuning: https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

Here, we tuned some parameters of the algorithm. Take `LGBMRegressor` with the following parameters:

* `objective='regression'`
* `n_estimators=5000`
* `learning_rate=0.001`
* `max_depth=3`
* `lambda_l2=1.0`
* `boosting_type='goss'`
* `random_state=13`
* all other default parameter values

Train it in the same manner, as in the task 9, but with `early_stopping_rounds=500`. 

**q10:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [80]:
lgb = LGBMRegressor(
objective='regression',
n_estimators=5000,
learning_rate=0.001,
max_depth=3,
lambda_l2=1.0,
boosting_type='goss',
random_state=13,
num_threads=24
)

lgb.fit(
    X_train, y_train, 
    eval_metric='rmse', 
    early_stopping_rounds=500, eval_set=[(X_val, y_val)]
)



[1]	valid_0's rmse: 8527.68	valid_0's l2: 7.27214e+07
[2]	valid_0's rmse: 8527.44	valid_0's l2: 7.27172e+07
[3]	valid_0's rmse: 8527.2	valid_0's l2: 7.27131e+07
[4]	valid_0's rmse: 8526.96	valid_0's l2: 7.2709e+07
[5]	valid_0's rmse: 8526.71	valid_0's l2: 7.27049e+07
[6]	valid_0's rmse: 8526.47	valid_0's l2: 7.27008e+07
[7]	valid_0's rmse: 8526.23	valid_0's l2: 7.26967e+07
[8]	valid_0's rmse: 8525.98	valid_0's l2: 7.26924e+07
[9]	valid_0's rmse: 8525.74	valid_0's l2: 7.26883e+07
[10]	valid_0's rmse: 8525.49	valid_0's l2: 7.26841e+07
[11]	valid_0's rmse: 8525.26	valid_0's l2: 7.268e+07
[12]	valid_0's rmse: 8525.01	valid_0's l2: 7.26758e+07
[13]	valid_0's rmse: 8524.77	valid_0's l2: 7.26717e+07
[14]	valid_0's rmse: 8524.52	valid_0's l2: 7.26675e+07
[15]	valid_0's rmse: 8524.29	valid_0's l2: 7.26635e+07
[16]	valid_0's rmse: 8524.04	valid_0's l2: 7.26593e+07
[17]	valid_0's rmse: 8523.81	valid_0's l2: 7.26553e+07
[18]	valid_0's rmse: 8523.56	valid_0's l2: 7.26511e+07
[19]	valid_0's rmse: 85

[200]	valid_0's rmse: 8485.84	valid_0's l2: 7.20095e+07
[201]	valid_0's rmse: 8485.66	valid_0's l2: 7.20065e+07
[202]	valid_0's rmse: 8485.48	valid_0's l2: 7.20034e+07
[203]	valid_0's rmse: 8485.3	valid_0's l2: 7.20003e+07
[204]	valid_0's rmse: 8485.12	valid_0's l2: 7.19973e+07
[205]	valid_0's rmse: 8484.94	valid_0's l2: 7.19942e+07
[206]	valid_0's rmse: 8484.76	valid_0's l2: 7.19911e+07
[207]	valid_0's rmse: 8484.58	valid_0's l2: 7.19881e+07
[208]	valid_0's rmse: 8484.43	valid_0's l2: 7.19855e+07
[209]	valid_0's rmse: 8484.25	valid_0's l2: 7.19825e+07
[210]	valid_0's rmse: 8484.1	valid_0's l2: 7.19799e+07
[211]	valid_0's rmse: 8483.92	valid_0's l2: 7.19769e+07
[212]	valid_0's rmse: 8483.78	valid_0's l2: 7.19746e+07
[213]	valid_0's rmse: 8483.65	valid_0's l2: 7.19723e+07
[214]	valid_0's rmse: 8483.5	valid_0's l2: 7.19697e+07
[215]	valid_0's rmse: 8483.36	valid_0's l2: 7.19674e+07
[216]	valid_0's rmse: 8483.18	valid_0's l2: 7.19644e+07
[217]	valid_0's rmse: 8483.05	valid_0's l2: 7.19621

[471]	valid_0's rmse: 8456.01	valid_0's l2: 7.15041e+07
[472]	valid_0's rmse: 8455.93	valid_0's l2: 7.15028e+07
[473]	valid_0's rmse: 8455.86	valid_0's l2: 7.15016e+07
[474]	valid_0's rmse: 8455.79	valid_0's l2: 7.15004e+07
[475]	valid_0's rmse: 8455.72	valid_0's l2: 7.14992e+07
[476]	valid_0's rmse: 8455.64	valid_0's l2: 7.14979e+07
[477]	valid_0's rmse: 8455.57	valid_0's l2: 7.14966e+07
[478]	valid_0's rmse: 8455.5	valid_0's l2: 7.14954e+07
[479]	valid_0's rmse: 8455.42	valid_0's l2: 7.14941e+07
[480]	valid_0's rmse: 8455.35	valid_0's l2: 7.14929e+07
[481]	valid_0's rmse: 8455.28	valid_0's l2: 7.14918e+07
[482]	valid_0's rmse: 8455.2	valid_0's l2: 7.14904e+07
[483]	valid_0's rmse: 8455.1	valid_0's l2: 7.14888e+07
[484]	valid_0's rmse: 8455.03	valid_0's l2: 7.14875e+07
[485]	valid_0's rmse: 8454.95	valid_0's l2: 7.14862e+07
[486]	valid_0's rmse: 8454.89	valid_0's l2: 7.14851e+07
[487]	valid_0's rmse: 8454.82	valid_0's l2: 7.14839e+07
[488]	valid_0's rmse: 8454.74	valid_0's l2: 7.14826

[755]	valid_0's rmse: 8438.65	valid_0's l2: 7.12108e+07
[756]	valid_0's rmse: 8438.63	valid_0's l2: 7.12105e+07
[757]	valid_0's rmse: 8438.56	valid_0's l2: 7.12093e+07
[758]	valid_0's rmse: 8438.56	valid_0's l2: 7.12094e+07
[759]	valid_0's rmse: 8438.5	valid_0's l2: 7.12082e+07
[760]	valid_0's rmse: 8438.48	valid_0's l2: 7.12079e+07
[761]	valid_0's rmse: 8438.44	valid_0's l2: 7.12073e+07
[762]	valid_0's rmse: 8438.37	valid_0's l2: 7.12061e+07
[763]	valid_0's rmse: 8438.37	valid_0's l2: 7.12062e+07
[764]	valid_0's rmse: 8438.37	valid_0's l2: 7.1206e+07
[765]	valid_0's rmse: 8438.33	valid_0's l2: 7.12055e+07
[766]	valid_0's rmse: 8438.27	valid_0's l2: 7.12043e+07
[767]	valid_0's rmse: 8438.24	valid_0's l2: 7.12038e+07
[768]	valid_0's rmse: 8438.24	valid_0's l2: 7.12039e+07
[769]	valid_0's rmse: 8438.22	valid_0's l2: 7.12036e+07
[770]	valid_0's rmse: 8438.2	valid_0's l2: 7.12033e+07
[771]	valid_0's rmse: 8438.14	valid_0's l2: 7.12022e+07
[772]	valid_0's rmse: 8438.13	valid_0's l2: 7.1202e

[1034]	valid_0's rmse: 8431.76	valid_0's l2: 7.10947e+07
[1035]	valid_0's rmse: 8431.71	valid_0's l2: 7.10938e+07
[1036]	valid_0's rmse: 8431.67	valid_0's l2: 7.10931e+07
[1037]	valid_0's rmse: 8431.62	valid_0's l2: 7.10923e+07
[1038]	valid_0's rmse: 8431.59	valid_0's l2: 7.10917e+07
[1039]	valid_0's rmse: 8431.54	valid_0's l2: 7.10908e+07
[1040]	valid_0's rmse: 8431.47	valid_0's l2: 7.10897e+07
[1041]	valid_0's rmse: 8431.45	valid_0's l2: 7.10893e+07
[1042]	valid_0's rmse: 8431.42	valid_0's l2: 7.10888e+07
[1043]	valid_0's rmse: 8431.4	valid_0's l2: 7.10885e+07
[1044]	valid_0's rmse: 8431.36	valid_0's l2: 7.10879e+07
[1045]	valid_0's rmse: 8431.32	valid_0's l2: 7.10872e+07
[1046]	valid_0's rmse: 8431.36	valid_0's l2: 7.10878e+07
[1047]	valid_0's rmse: 8431.32	valid_0's l2: 7.10871e+07
[1048]	valid_0's rmse: 8431.29	valid_0's l2: 7.10867e+07
[1049]	valid_0's rmse: 8431.24	valid_0's l2: 7.10859e+07
[1050]	valid_0's rmse: 8431.2	valid_0's l2: 7.10851e+07
[1051]	valid_0's rmse: 8431.18	va

[1290]	valid_0's rmse: 8424.57	valid_0's l2: 7.09734e+07
[1291]	valid_0's rmse: 8424.53	valid_0's l2: 7.09727e+07
[1292]	valid_0's rmse: 8424.54	valid_0's l2: 7.09729e+07
[1293]	valid_0's rmse: 8424.54	valid_0's l2: 7.09728e+07
[1294]	valid_0's rmse: 8424.56	valid_0's l2: 7.09732e+07
[1295]	valid_0's rmse: 8424.55	valid_0's l2: 7.09731e+07
[1296]	valid_0's rmse: 8424.56	valid_0's l2: 7.09732e+07
[1297]	valid_0's rmse: 8424.54	valid_0's l2: 7.09728e+07
[1298]	valid_0's rmse: 8424.52	valid_0's l2: 7.09726e+07
[1299]	valid_0's rmse: 8424.5	valid_0's l2: 7.09722e+07
[1300]	valid_0's rmse: 8424.53	valid_0's l2: 7.09726e+07
[1301]	valid_0's rmse: 8424.47	valid_0's l2: 7.09716e+07
[1302]	valid_0's rmse: 8424.48	valid_0's l2: 7.09718e+07
[1303]	valid_0's rmse: 8424.47	valid_0's l2: 7.09716e+07
[1304]	valid_0's rmse: 8424.44	valid_0's l2: 7.09711e+07
[1305]	valid_0's rmse: 8424.4	valid_0's l2: 7.09705e+07
[1306]	valid_0's rmse: 8424.31	valid_0's l2: 7.0969e+07
[1307]	valid_0's rmse: 8424.28	val

[1544]	valid_0's rmse: 8421.9	valid_0's l2: 7.09284e+07
[1545]	valid_0's rmse: 8421.9	valid_0's l2: 7.09284e+07
[1546]	valid_0's rmse: 8421.89	valid_0's l2: 7.09282e+07
[1547]	valid_0's rmse: 8421.89	valid_0's l2: 7.09283e+07
[1548]	valid_0's rmse: 8421.91	valid_0's l2: 7.09285e+07
[1549]	valid_0's rmse: 8421.92	valid_0's l2: 7.09288e+07
[1550]	valid_0's rmse: 8421.91	valid_0's l2: 7.09285e+07
[1551]	valid_0's rmse: 8421.93	valid_0's l2: 7.0929e+07
[1552]	valid_0's rmse: 8421.94	valid_0's l2: 7.09291e+07
[1553]	valid_0's rmse: 8421.95	valid_0's l2: 7.09293e+07
[1554]	valid_0's rmse: 8421.93	valid_0's l2: 7.09288e+07
[1555]	valid_0's rmse: 8421.95	valid_0's l2: 7.09292e+07
[1556]	valid_0's rmse: 8421.97	valid_0's l2: 7.09296e+07
[1557]	valid_0's rmse: 8421.99	valid_0's l2: 7.09298e+07
[1558]	valid_0's rmse: 8421.99	valid_0's l2: 7.09298e+07
[1559]	valid_0's rmse: 8421.98	valid_0's l2: 7.09297e+07
[1560]	valid_0's rmse: 8421.95	valid_0's l2: 7.09293e+07
[1561]	valid_0's rmse: 8421.98	val

[1806]	valid_0's rmse: 8422.42	valid_0's l2: 7.09371e+07
[1807]	valid_0's rmse: 8422.4	valid_0's l2: 7.09368e+07
[1808]	valid_0's rmse: 8422.42	valid_0's l2: 7.09372e+07
[1809]	valid_0's rmse: 8422.38	valid_0's l2: 7.09364e+07
[1810]	valid_0's rmse: 8422.38	valid_0's l2: 7.09366e+07
[1811]	valid_0's rmse: 8422.43	valid_0's l2: 7.09373e+07
[1812]	valid_0's rmse: 8422.45	valid_0's l2: 7.09376e+07
[1813]	valid_0's rmse: 8422.4	valid_0's l2: 7.09368e+07
[1814]	valid_0's rmse: 8422.39	valid_0's l2: 7.09367e+07
[1815]	valid_0's rmse: 8422.45	valid_0's l2: 7.09377e+07
[1816]	valid_0's rmse: 8422.48	valid_0's l2: 7.09381e+07
[1817]	valid_0's rmse: 8422.52	valid_0's l2: 7.09389e+07
[1818]	valid_0's rmse: 8422.57	valid_0's l2: 7.09397e+07
[1819]	valid_0's rmse: 8422.57	valid_0's l2: 7.09397e+07
[1820]	valid_0's rmse: 8422.57	valid_0's l2: 7.09397e+07
[1821]	valid_0's rmse: 8422.53	valid_0's l2: 7.09391e+07
[1822]	valid_0's rmse: 8422.58	valid_0's l2: 7.09398e+07
[1823]	valid_0's rmse: 8422.58	va

[2065]	valid_0's rmse: 8426.38	valid_0's l2: 7.10039e+07
[2066]	valid_0's rmse: 8426.42	valid_0's l2: 7.10046e+07
[2067]	valid_0's rmse: 8426.47	valid_0's l2: 7.10053e+07
[2068]	valid_0's rmse: 8426.49	valid_0's l2: 7.10057e+07
[2069]	valid_0's rmse: 8426.51	valid_0's l2: 7.10061e+07
[2070]	valid_0's rmse: 8426.53	valid_0's l2: 7.10064e+07
[2071]	valid_0's rmse: 8426.53	valid_0's l2: 7.10065e+07
[2072]	valid_0's rmse: 8426.52	valid_0's l2: 7.10063e+07
[2073]	valid_0's rmse: 8426.53	valid_0's l2: 7.10064e+07
[2074]	valid_0's rmse: 8426.54	valid_0's l2: 7.10066e+07
[2075]	valid_0's rmse: 8426.53	valid_0's l2: 7.10065e+07
[2076]	valid_0's rmse: 8426.58	valid_0's l2: 7.10072e+07
[2077]	valid_0's rmse: 8426.64	valid_0's l2: 7.10082e+07
[2078]	valid_0's rmse: 8426.69	valid_0's l2: 7.10091e+07
[2079]	valid_0's rmse: 8426.67	valid_0's l2: 7.10088e+07
[2080]	valid_0's rmse: 8426.69	valid_0's l2: 7.10092e+07
[2081]	valid_0's rmse: 8426.69	valid_0's l2: 7.10092e+07
[2082]	valid_0's rmse: 8426.7	v

LGBMRegressor(boosting_type='goss', lambda_l2=1.0, learning_rate=0.001,
              max_depth=3, n_estimators=5000, num_threads=24,
              objective='regression', random_state=13)

In [81]:
rmse_from_estimator(lgb, X_val, y_val)

8421.39852

## 11

Calculate feature importances according to the model, trained in the task 10. 

**q11:** What is the name of the most important feature? Provide it as the answer. 

Do you understand why it might be important for the model?

Notice that by default, `LGBMRegressor` calculates feature importance considering number of times the feature is used in the model (`importance_type` parameter).

In [82]:
lgb.feature_name_[np.argmax(lgb.feature_importances_)]

'self_reference_min_shares'

## 12

Since some features are not important for the model, we can drop them in order to try to construct a better model which does not consider them at all.

Obtain new train and validation sets without the features with LightGBM importance less than 10 (the importances were computed in the task 11). Train the same model as in the task 10 on the new train set in the same manner. 

**q12:** Calculate RMSE on the new validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Notice that the new versions of train and validation sets are used only in this task and in blending.

In [83]:
allow_feats = np.array(lgb.feature_name_)[lgb.feature_importances_ >= 10]

In [84]:
lgb = LGBMRegressor(
objective='regression',
n_estimators=5000,
learning_rate=0.001,
max_depth=3,
lambda_l2=1.0,
boosting_type='goss',
random_state=13
)

lgb.fit(X_train[allow_feats], y_train, eval_metric='rmse', early_stopping_rounds=500, eval_set=[(X_val[allow_feats], y_val)])



[1]	valid_0's rmse: 8527.68	valid_0's l2: 7.27214e+07
[2]	valid_0's rmse: 8527.44	valid_0's l2: 7.27172e+07
[3]	valid_0's rmse: 8527.2	valid_0's l2: 7.27131e+07
[4]	valid_0's rmse: 8526.96	valid_0's l2: 7.2709e+07
[5]	valid_0's rmse: 8526.71	valid_0's l2: 7.27049e+07
[6]	valid_0's rmse: 8526.47	valid_0's l2: 7.27008e+07
[7]	valid_0's rmse: 8526.23	valid_0's l2: 7.26967e+07
[8]	valid_0's rmse: 8525.98	valid_0's l2: 7.26924e+07
[9]	valid_0's rmse: 8525.74	valid_0's l2: 7.26883e+07
[10]	valid_0's rmse: 8525.49	valid_0's l2: 7.26841e+07
[11]	valid_0's rmse: 8525.26	valid_0's l2: 7.268e+07
[12]	valid_0's rmse: 8525.01	valid_0's l2: 7.26758e+07
[13]	valid_0's rmse: 8524.77	valid_0's l2: 7.26717e+07
[14]	valid_0's rmse: 8524.52	valid_0's l2: 7.26675e+07
[15]	valid_0's rmse: 8524.29	valid_0's l2: 7.26635e+07
[16]	valid_0's rmse: 8524.04	valid_0's l2: 7.26593e+07
[17]	valid_0's rmse: 8523.81	valid_0's l2: 7.26553e+07
[18]	valid_0's rmse: 8523.56	valid_0's l2: 7.26511e+07
[19]	valid_0's rmse: 85

[205]	valid_0's rmse: 8484.94	valid_0's l2: 7.19942e+07
[206]	valid_0's rmse: 8484.76	valid_0's l2: 7.19911e+07
[207]	valid_0's rmse: 8484.58	valid_0's l2: 7.19881e+07
[208]	valid_0's rmse: 8484.43	valid_0's l2: 7.19855e+07
[209]	valid_0's rmse: 8484.25	valid_0's l2: 7.19825e+07
[210]	valid_0's rmse: 8484.1	valid_0's l2: 7.19799e+07
[211]	valid_0's rmse: 8483.92	valid_0's l2: 7.19769e+07
[212]	valid_0's rmse: 8483.78	valid_0's l2: 7.19746e+07
[213]	valid_0's rmse: 8483.65	valid_0's l2: 7.19723e+07
[214]	valid_0's rmse: 8483.5	valid_0's l2: 7.19697e+07
[215]	valid_0's rmse: 8483.36	valid_0's l2: 7.19674e+07
[216]	valid_0's rmse: 8483.18	valid_0's l2: 7.19644e+07
[217]	valid_0's rmse: 8483.05	valid_0's l2: 7.19621e+07
[218]	valid_0's rmse: 8482.87	valid_0's l2: 7.19591e+07
[219]	valid_0's rmse: 8482.74	valid_0's l2: 7.19568e+07
[220]	valid_0's rmse: 8482.6	valid_0's l2: 7.19544e+07
[221]	valid_0's rmse: 8482.46	valid_0's l2: 7.19522e+07
[222]	valid_0's rmse: 8482.33	valid_0's l2: 7.19499

[503]	valid_0's rmse: 8453.61	valid_0's l2: 7.14635e+07
[504]	valid_0's rmse: 8453.54	valid_0's l2: 7.14623e+07
[505]	valid_0's rmse: 8453.45	valid_0's l2: 7.14608e+07
[506]	valid_0's rmse: 8453.37	valid_0's l2: 7.14595e+07
[507]	valid_0's rmse: 8453.31	valid_0's l2: 7.14584e+07
[508]	valid_0's rmse: 8453.25	valid_0's l2: 7.14574e+07
[509]	valid_0's rmse: 8453.17	valid_0's l2: 7.14561e+07
[510]	valid_0's rmse: 8453.08	valid_0's l2: 7.14546e+07
[511]	valid_0's rmse: 8453.01	valid_0's l2: 7.14534e+07
[512]	valid_0's rmse: 8452.94	valid_0's l2: 7.14522e+07
[513]	valid_0's rmse: 8452.85	valid_0's l2: 7.14507e+07
[514]	valid_0's rmse: 8452.78	valid_0's l2: 7.14495e+07
[515]	valid_0's rmse: 8452.71	valid_0's l2: 7.14484e+07
[516]	valid_0's rmse: 8452.66	valid_0's l2: 7.14474e+07
[517]	valid_0's rmse: 8452.58	valid_0's l2: 7.14461e+07
[518]	valid_0's rmse: 8452.49	valid_0's l2: 7.14447e+07
[519]	valid_0's rmse: 8452.43	valid_0's l2: 7.14436e+07
[520]	valid_0's rmse: 8452.36	valid_0's l2: 7.14

[667]	valid_0's rmse: 8442.87	valid_0's l2: 7.1282e+07
[668]	valid_0's rmse: 8442.86	valid_0's l2: 7.12818e+07
[669]	valid_0's rmse: 8442.79	valid_0's l2: 7.12807e+07
[670]	valid_0's rmse: 8442.71	valid_0's l2: 7.12793e+07
[671]	valid_0's rmse: 8442.67	valid_0's l2: 7.12788e+07
[672]	valid_0's rmse: 8442.61	valid_0's l2: 7.12777e+07
[673]	valid_0's rmse: 8442.53	valid_0's l2: 7.12763e+07
[674]	valid_0's rmse: 8442.48	valid_0's l2: 7.12755e+07
[675]	valid_0's rmse: 8442.4	valid_0's l2: 7.12741e+07
[676]	valid_0's rmse: 8442.39	valid_0's l2: 7.12739e+07
[677]	valid_0's rmse: 8442.32	valid_0's l2: 7.12729e+07
[678]	valid_0's rmse: 8442.24	valid_0's l2: 7.12715e+07
[679]	valid_0's rmse: 8442.21	valid_0's l2: 7.12709e+07
[680]	valid_0's rmse: 8442.18	valid_0's l2: 7.12703e+07
[681]	valid_0's rmse: 8442.15	valid_0's l2: 7.12699e+07
[682]	valid_0's rmse: 8442.07	valid_0's l2: 7.12686e+07
[683]	valid_0's rmse: 8442.01	valid_0's l2: 7.12675e+07
[684]	valid_0's rmse: 8441.98	valid_0's l2: 7.1267

[815]	valid_0's rmse: 8436.84	valid_0's l2: 7.11802e+07
[816]	valid_0's rmse: 8436.79	valid_0's l2: 7.11794e+07
[817]	valid_0's rmse: 8436.76	valid_0's l2: 7.1179e+07
[818]	valid_0's rmse: 8436.76	valid_0's l2: 7.11789e+07
[819]	valid_0's rmse: 8436.75	valid_0's l2: 7.11788e+07
[820]	valid_0's rmse: 8436.71	valid_0's l2: 7.1178e+07
[821]	valid_0's rmse: 8436.64	valid_0's l2: 7.1177e+07
[822]	valid_0's rmse: 8436.65	valid_0's l2: 7.11771e+07
[823]	valid_0's rmse: 8436.61	valid_0's l2: 7.11763e+07
[824]	valid_0's rmse: 8436.6	valid_0's l2: 7.11763e+07
[825]	valid_0's rmse: 8436.54	valid_0's l2: 7.11752e+07
[826]	valid_0's rmse: 8436.54	valid_0's l2: 7.11751e+07
[827]	valid_0's rmse: 8436.55	valid_0's l2: 7.11753e+07
[828]	valid_0's rmse: 8436.5	valid_0's l2: 7.11745e+07
[829]	valid_0's rmse: 8436.49	valid_0's l2: 7.11743e+07
[830]	valid_0's rmse: 8436.43	valid_0's l2: 7.11733e+07
[831]	valid_0's rmse: 8436.44	valid_0's l2: 7.11735e+07
[832]	valid_0's rmse: 8436.39	valid_0's l2: 7.11727e+

[1103]	valid_0's rmse: 8429.78	valid_0's l2: 7.10612e+07
[1104]	valid_0's rmse: 8429.77	valid_0's l2: 7.10611e+07
[1105]	valid_0's rmse: 8429.64	valid_0's l2: 7.10588e+07
[1106]	valid_0's rmse: 8429.62	valid_0's l2: 7.10585e+07
[1107]	valid_0's rmse: 8429.59	valid_0's l2: 7.10581e+07
[1108]	valid_0's rmse: 8429.58	valid_0's l2: 7.10579e+07
[1109]	valid_0's rmse: 8429.53	valid_0's l2: 7.10569e+07
[1110]	valid_0's rmse: 8429.5	valid_0's l2: 7.10565e+07
[1111]	valid_0's rmse: 8429.47	valid_0's l2: 7.10559e+07
[1112]	valid_0's rmse: 8429.44	valid_0's l2: 7.10554e+07
[1113]	valid_0's rmse: 8429.46	valid_0's l2: 7.10558e+07
[1114]	valid_0's rmse: 8429.42	valid_0's l2: 7.10552e+07
[1115]	valid_0's rmse: 8429.41	valid_0's l2: 7.10549e+07
[1116]	valid_0's rmse: 8429.43	valid_0's l2: 7.10553e+07
[1117]	valid_0's rmse: 8429.35	valid_0's l2: 7.1054e+07
[1118]	valid_0's rmse: 8429.34	valid_0's l2: 7.10538e+07
[1119]	valid_0's rmse: 8429.25	valid_0's l2: 7.10522e+07
[1120]	valid_0's rmse: 8429.22	va

[1385]	valid_0's rmse: 8423.65	valid_0's l2: 7.09579e+07
[1386]	valid_0's rmse: 8423.63	valid_0's l2: 7.09576e+07
[1387]	valid_0's rmse: 8423.66	valid_0's l2: 7.0958e+07
[1388]	valid_0's rmse: 8423.68	valid_0's l2: 7.09584e+07
[1389]	valid_0's rmse: 8423.67	valid_0's l2: 7.09583e+07
[1390]	valid_0's rmse: 8423.65	valid_0's l2: 7.09579e+07
[1391]	valid_0's rmse: 8423.64	valid_0's l2: 7.09577e+07
[1392]	valid_0's rmse: 8423.65	valid_0's l2: 7.09579e+07
[1393]	valid_0's rmse: 8423.63	valid_0's l2: 7.09576e+07
[1394]	valid_0's rmse: 8423.63	valid_0's l2: 7.09576e+07
[1395]	valid_0's rmse: 8423.6	valid_0's l2: 7.09571e+07
[1396]	valid_0's rmse: 8423.58	valid_0's l2: 7.09568e+07
[1397]	valid_0's rmse: 8423.67	valid_0's l2: 7.09582e+07
[1398]	valid_0's rmse: 8423.69	valid_0's l2: 7.09585e+07
[1399]	valid_0's rmse: 8423.65	valid_0's l2: 7.09578e+07
[1400]	valid_0's rmse: 8423.65	valid_0's l2: 7.09579e+07
[1401]	valid_0's rmse: 8423.61	valid_0's l2: 7.09571e+07
[1402]	valid_0's rmse: 8423.58	va

[1668]	valid_0's rmse: 8422.92	valid_0's l2: 7.09456e+07
[1669]	valid_0's rmse: 8422.93	valid_0's l2: 7.09458e+07
[1670]	valid_0's rmse: 8422.95	valid_0's l2: 7.09461e+07
[1671]	valid_0's rmse: 8422.95	valid_0's l2: 7.09461e+07
[1672]	valid_0's rmse: 8422.91	valid_0's l2: 7.09454e+07
[1673]	valid_0's rmse: 8422.93	valid_0's l2: 7.09458e+07
[1674]	valid_0's rmse: 8422.95	valid_0's l2: 7.0946e+07
[1675]	valid_0's rmse: 8422.94	valid_0's l2: 7.09459e+07
[1676]	valid_0's rmse: 8422.95	valid_0's l2: 7.09461e+07
[1677]	valid_0's rmse: 8422.91	valid_0's l2: 7.09455e+07
[1678]	valid_0's rmse: 8422.92	valid_0's l2: 7.09455e+07
[1679]	valid_0's rmse: 8422.96	valid_0's l2: 7.09463e+07
[1680]	valid_0's rmse: 8422.95	valid_0's l2: 7.09462e+07
[1681]	valid_0's rmse: 8422.85	valid_0's l2: 7.09443e+07
[1682]	valid_0's rmse: 8422.83	valid_0's l2: 7.09441e+07
[1683]	valid_0's rmse: 8422.8	valid_0's l2: 7.09436e+07
[1684]	valid_0's rmse: 8422.79	valid_0's l2: 7.09434e+07
[1685]	valid_0's rmse: 8422.77	va

[1956]	valid_0's rmse: 8425.95	valid_0's l2: 7.09967e+07
[1957]	valid_0's rmse: 8425.99	valid_0's l2: 7.09973e+07
[1958]	valid_0's rmse: 8426.01	valid_0's l2: 7.09977e+07
[1959]	valid_0's rmse: 8426.04	valid_0's l2: 7.09982e+07
[1960]	valid_0's rmse: 8426.09	valid_0's l2: 7.0999e+07
[1961]	valid_0's rmse: 8426.14	valid_0's l2: 7.09999e+07
[1962]	valid_0's rmse: 8426.16	valid_0's l2: 7.10001e+07
[1963]	valid_0's rmse: 8426.21	valid_0's l2: 7.1001e+07
[1964]	valid_0's rmse: 8426.19	valid_0's l2: 7.10008e+07
[1965]	valid_0's rmse: 8426.23	valid_0's l2: 7.10014e+07
[1966]	valid_0's rmse: 8426.22	valid_0's l2: 7.10013e+07
[1967]	valid_0's rmse: 8426.26	valid_0's l2: 7.10018e+07
[1968]	valid_0's rmse: 8426.28	valid_0's l2: 7.10022e+07
[1969]	valid_0's rmse: 8426.32	valid_0's l2: 7.10028e+07
[1970]	valid_0's rmse: 8426.34	valid_0's l2: 7.10031e+07
[1971]	valid_0's rmse: 8426.39	valid_0's l2: 7.1004e+07
[1972]	valid_0's rmse: 8426.31	valid_0's l2: 7.10027e+07
[1973]	valid_0's rmse: 8426.32	val

LGBMRegressor(boosting_type='goss', lambda_l2=1.0, learning_rate=0.001,
              max_depth=3, n_estimators=5000, objective='regression',
              random_state=13)

In [85]:
round(np.sqrt(mean_squared_error(y_val, lgb.predict(X_val[allow_feats]))), 5)

8422.73009

## 13

Let's move to CatBoost. We will work with `CatBoostRegressor`.

Info about `CatBoostRegressor`: https://catboost.ai/docs/concepts/python-reference_catboostregressor.html

CatBoost parameters: https://catboost.ai/docs/concepts/python-reference_parameters-list.html Look through this list so that you understand which parameters are presented in the library.

Take `CatBoostRegressor` with the following parameters, similar to the previous models:

* `loss_function='RMSE'`
* `iterations=200`
* `learning_rate=0.01`
* `max_depth=5`
* `random_state=13`
* other default parameter values

Train it on the training data with `eval_set=[(X_val, y_val)]`, `early_stopping_rounds=50` and all other default parameter values. 

**q13:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Notice the speed of the algorithm and compare it to the speed of XGBoost and LightGBM models.

In [86]:
ctb = CatBoostRegressor(
loss_function='RMSE',
iterations=200,
learning_rate=0.01,
max_depth=5,
random_state=13
)

In [87]:
ctb.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=50)

0:	learn: 13277.2870054	test: 8526.5680566	best: 8526.5680566 (0)	total: 141ms	remaining: 28s
1:	learn: 13272.4810810	test: 8525.0371870	best: 8525.0371870 (1)	total: 151ms	remaining: 14.9s
2:	learn: 13267.5177449	test: 8523.3886374	best: 8523.3886374 (2)	total: 160ms	remaining: 10.5s
3:	learn: 13263.2952731	test: 8522.1582616	best: 8522.1582616 (3)	total: 170ms	remaining: 8.32s
4:	learn: 13257.2202531	test: 8520.3532273	best: 8520.3532273 (4)	total: 180ms	remaining: 7s
5:	learn: 13252.7543910	test: 8518.9990683	best: 8518.9990683 (5)	total: 189ms	remaining: 6.11s
6:	learn: 13249.2045770	test: 8519.1106061	best: 8518.9990683 (5)	total: 197ms	remaining: 5.44s
7:	learn: 13243.4458393	test: 8517.5087457	best: 8517.5087457 (7)	total: 206ms	remaining: 4.93s
8:	learn: 13238.6918845	test: 8516.4245434	best: 8516.4245434 (8)	total: 212ms	remaining: 4.5s
9:	learn: 13234.2011718	test: 8515.2163314	best: 8515.2163314 (9)	total: 218ms	remaining: 4.13s
10:	learn: 13230.5141530	test: 8515.0823939	be

<catboost.core.CatBoostRegressor at 0x1df5b2d2788>

In [88]:
round(np.sqrt(mean_squared_error(y_val, ctb.predict(X_val))), 5)

8485.01086

## 14

Notes on parameter tuning: https://catboost.ai/docs/concepts/parameter-tuning.html

Here, we tuned some parameters of the algorithm. Take `CatBoostRegressor` with the following parameters:

* `loss_function='RMSE'`
* `n_estimators=5000`
* `learning_rate=0.001`
* `max_depth=9`
* `random_state=13`
* all other default parameter values

Train it in the same manner, as in the task 13, but with `early_stopping_rounds=500`. 

**q14:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [89]:
ctb = CatBoostRegressor(
loss_function='RMSE',
n_estimators=5000,
learning_rate=0.001,
max_depth=9,
random_state=13
)

ctb.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=500)

0:	learn: 13280.9705594	test: 8527.7673907	best: 8527.7673907 (0)	total: 31.2ms	remaining: 2m 36s
1:	learn: 13280.1852114	test: 8527.6115169	best: 8527.6115169 (1)	total: 59.4ms	remaining: 2m 28s
2:	learn: 13279.3477587	test: 8527.4365512	best: 8527.4365512 (2)	total: 88.5ms	remaining: 2m 27s
3:	learn: 13278.4673795	test: 8527.1955417	best: 8527.1955417 (3)	total: 115ms	remaining: 2m 23s
4:	learn: 13277.6056592	test: 8527.0534633	best: 8527.0534633 (4)	total: 141ms	remaining: 2m 21s
5:	learn: 13276.7707409	test: 8526.9439818	best: 8526.9439818 (5)	total: 167ms	remaining: 2m 18s
6:	learn: 13276.2226780	test: 8526.8093958	best: 8526.8093958 (6)	total: 194ms	remaining: 2m 18s
7:	learn: 13275.3238861	test: 8526.6313357	best: 8526.6313357 (7)	total: 221ms	remaining: 2m 17s
8:	learn: 13274.8201836	test: 8526.4891028	best: 8526.4891028 (8)	total: 247ms	remaining: 2m 17s
9:	learn: 13273.9471276	test: 8526.3315744	best: 8526.3315744 (9)	total: 275ms	remaining: 2m 17s
10:	learn: 13273.4085069	te

86:	learn: 13218.0842716	test: 8514.7230700	best: 8514.7230700 (86)	total: 2.29s	remaining: 2m 9s
87:	learn: 13217.3918080	test: 8514.5867653	best: 8514.5867653 (87)	total: 2.32s	remaining: 2m 9s
88:	learn: 13216.9939720	test: 8514.5756579	best: 8514.5756579 (88)	total: 2.34s	remaining: 2m 9s
89:	learn: 13216.1292662	test: 8514.4487563	best: 8514.4487563 (89)	total: 2.37s	remaining: 2m 9s
90:	learn: 13215.2722693	test: 8514.2308110	best: 8514.2308110 (90)	total: 2.39s	remaining: 2m 9s
91:	learn: 13214.5167134	test: 8514.0978957	best: 8514.0978957 (91)	total: 2.42s	remaining: 2m 8s
92:	learn: 13213.9720951	test: 8513.9587940	best: 8513.9587940 (92)	total: 2.44s	remaining: 2m 8s
93:	learn: 13213.1492024	test: 8513.8238008	best: 8513.8238008 (93)	total: 2.47s	remaining: 2m 8s
94:	learn: 13212.2289443	test: 8513.6477520	best: 8513.6477520 (94)	total: 2.49s	remaining: 2m 8s
95:	learn: 13211.8478952	test: 8513.6598911	best: 8513.6477520 (94)	total: 2.52s	remaining: 2m 8s
96:	learn: 13210.812

173:	learn: 13159.4196364	test: 8504.4259958	best: 8504.4259958 (173)	total: 4.54s	remaining: 2m 6s
174:	learn: 13158.3531576	test: 8504.3091657	best: 8504.3091657 (174)	total: 4.57s	remaining: 2m 5s
175:	learn: 13157.4562880	test: 8504.1449967	best: 8504.1449967 (175)	total: 4.59s	remaining: 2m 5s
176:	learn: 13156.9733341	test: 8504.0406110	best: 8504.0406110 (176)	total: 4.62s	remaining: 2m 5s
177:	learn: 13156.6043942	test: 8504.0633348	best: 8504.0406110 (176)	total: 4.64s	remaining: 2m 5s
178:	learn: 13155.9280944	test: 8503.9619237	best: 8503.9619237 (178)	total: 4.67s	remaining: 2m 5s
179:	learn: 13154.7662731	test: 8503.8208697	best: 8503.8208697 (179)	total: 4.69s	remaining: 2m 5s
180:	learn: 13153.7989691	test: 8503.7253560	best: 8503.7253560 (180)	total: 4.72s	remaining: 2m 5s
181:	learn: 13153.0231638	test: 8503.5648422	best: 8503.5648422 (181)	total: 4.75s	remaining: 2m 5s
182:	learn: 13151.8839074	test: 8503.4620811	best: 8503.4620811 (182)	total: 4.78s	remaining: 2m 5s


260:	learn: 13099.5657020	test: 8496.0792740	best: 8496.0792740 (260)	total: 6.77s	remaining: 2m 2s
261:	learn: 13098.8040104	test: 8495.9680113	best: 8495.9680113 (261)	total: 6.8s	remaining: 2m 2s
262:	learn: 13098.3328950	test: 8495.8798284	best: 8495.8798284 (262)	total: 6.83s	remaining: 2m 2s
263:	learn: 13097.8732736	test: 8495.7740784	best: 8495.7740784 (263)	total: 6.85s	remaining: 2m 2s
264:	learn: 13097.0228816	test: 8495.6227862	best: 8495.6227862 (264)	total: 6.88s	remaining: 2m 2s
265:	learn: 13095.8143773	test: 8495.4691432	best: 8495.4691432 (265)	total: 6.91s	remaining: 2m 2s
266:	learn: 13095.0252293	test: 8495.3885849	best: 8495.3885849 (266)	total: 6.93s	remaining: 2m 2s
267:	learn: 13093.9422948	test: 8495.2226007	best: 8495.2226007 (267)	total: 6.95s	remaining: 2m 2s
268:	learn: 13093.1882467	test: 8495.1396198	best: 8495.1396198 (268)	total: 6.98s	remaining: 2m 2s
269:	learn: 13092.4275207	test: 8495.0698595	best: 8495.0698595 (269)	total: 7s	remaining: 2m 2s
270:

345:	learn: 13042.5889064	test: 8488.6503193	best: 8488.6164426 (344)	total: 8.94s	remaining: 2m
346:	learn: 13042.1443312	test: 8488.5640510	best: 8488.5640510 (346)	total: 8.97s	remaining: 2m
347:	learn: 13041.4187759	test: 8488.4887619	best: 8488.4887619 (347)	total: 8.99s	remaining: 2m
348:	learn: 13041.1128095	test: 8488.5428455	best: 8488.4887619 (347)	total: 9.02s	remaining: 2m
349:	learn: 13040.3235105	test: 8488.3902863	best: 8488.3902863 (349)	total: 9.04s	remaining: 2m
350:	learn: 13039.5689240	test: 8488.3645074	best: 8488.3645074 (350)	total: 9.07s	remaining: 2m
351:	learn: 13038.4390725	test: 8488.2244695	best: 8488.2244695 (351)	total: 9.09s	remaining: 2m
352:	learn: 13037.9877719	test: 8488.1167460	best: 8488.1167460 (352)	total: 9.12s	remaining: 2m
353:	learn: 13037.2447947	test: 8488.0033122	best: 8488.0033122 (353)	total: 9.15s	remaining: 2m
354:	learn: 13036.2400638	test: 8487.9043122	best: 8487.9043122 (354)	total: 9.17s	remaining: 2m
355:	learn: 13035.7697924	test

433:	learn: 12986.4268557	test: 8482.8559697	best: 8482.8559697 (433)	total: 11.1s	remaining: 1m 57s
434:	learn: 12986.0837514	test: 8482.8981252	best: 8482.8559697 (433)	total: 11.2s	remaining: 1m 57s
435:	learn: 12985.6905617	test: 8482.8144193	best: 8482.8144193 (435)	total: 11.2s	remaining: 1m 57s
436:	learn: 12985.0483837	test: 8482.7329923	best: 8482.7329923 (436)	total: 11.2s	remaining: 1m 57s
437:	learn: 12984.0664077	test: 8482.6001083	best: 8482.6001083 (437)	total: 11.3s	remaining: 1m 57s
438:	learn: 12983.3745914	test: 8482.5062727	best: 8482.5062727 (438)	total: 11.3s	remaining: 1m 57s
439:	learn: 12982.6149523	test: 8482.4064646	best: 8482.4064646 (439)	total: 11.3s	remaining: 1m 57s
440:	learn: 12981.8250735	test: 8482.2557804	best: 8482.2557804 (440)	total: 11.3s	remaining: 1m 57s
441:	learn: 12981.3685800	test: 8482.1444698	best: 8482.1444698 (441)	total: 11.4s	remaining: 1m 57s
442:	learn: 12980.7488212	test: 8482.0428649	best: 8482.0428649 (442)	total: 11.4s	remainin

518:	learn: 12936.4556768	test: 8478.3408107	best: 8478.3321548 (517)	total: 13.3s	remaining: 1m 54s
519:	learn: 12935.3748372	test: 8478.1946340	best: 8478.1946340 (519)	total: 13.3s	remaining: 1m 54s
520:	learn: 12934.4416960	test: 8478.1046694	best: 8478.1046694 (520)	total: 13.4s	remaining: 1m 54s
521:	learn: 12933.7749452	test: 8478.0223014	best: 8478.0223014 (521)	total: 13.4s	remaining: 1m 54s
522:	learn: 12933.1159133	test: 8477.9518224	best: 8477.9518224 (522)	total: 13.4s	remaining: 1m 54s
523:	learn: 12932.3504347	test: 8477.8186854	best: 8477.8186854 (523)	total: 13.4s	remaining: 1m 54s
524:	learn: 12931.6420572	test: 8477.7102325	best: 8477.7102325 (524)	total: 13.5s	remaining: 1m 54s
525:	learn: 12931.2213704	test: 8477.6321149	best: 8477.6321149 (525)	total: 13.5s	remaining: 1m 54s
526:	learn: 12930.2826656	test: 8477.6123535	best: 8477.6123535 (526)	total: 13.5s	remaining: 1m 54s
527:	learn: 12929.7732194	test: 8477.5299940	best: 8477.5299940 (527)	total: 13.5s	remainin

601:	learn: 12888.4602040	test: 8474.5146953	best: 8474.5081925 (600)	total: 15.3s	remaining: 1m 51s
602:	learn: 12888.1198218	test: 8474.4450802	best: 8474.4450802 (602)	total: 15.3s	remaining: 1m 51s
603:	learn: 12887.6922302	test: 8474.4435232	best: 8474.4435232 (603)	total: 15.3s	remaining: 1m 51s
604:	learn: 12887.3996029	test: 8474.5295615	best: 8474.4435232 (603)	total: 15.3s	remaining: 1m 51s
605:	learn: 12887.0243401	test: 8474.4591059	best: 8474.4435232 (603)	total: 15.4s	remaining: 1m 51s
606:	learn: 12886.6865148	test: 8474.5465504	best: 8474.4435232 (603)	total: 15.4s	remaining: 1m 51s
607:	learn: 12886.1150225	test: 8474.4566681	best: 8474.4435232 (603)	total: 15.4s	remaining: 1m 51s
608:	learn: 12884.9502397	test: 8474.4011091	best: 8474.4011091 (608)	total: 15.4s	remaining: 1m 51s
609:	learn: 12884.6513036	test: 8474.3878296	best: 8474.3878296 (609)	total: 15.5s	remaining: 1m 51s
610:	learn: 12883.9351303	test: 8474.2811259	best: 8474.2811259 (610)	total: 15.5s	remainin

691:	learn: 12839.6588376	test: 8470.8205787	best: 8470.6896619 (686)	total: 17.4s	remaining: 1m 48s
692:	learn: 12839.2799546	test: 8470.8379790	best: 8470.6896619 (686)	total: 17.5s	remaining: 1m 48s
693:	learn: 12838.9867118	test: 8470.8224225	best: 8470.6896619 (686)	total: 17.5s	remaining: 1m 48s
694:	learn: 12838.6272552	test: 8470.7703702	best: 8470.6896619 (686)	total: 17.5s	remaining: 1m 48s
695:	learn: 12838.2443556	test: 8470.8270334	best: 8470.6896619 (686)	total: 17.5s	remaining: 1m 48s
696:	learn: 12837.8607620	test: 8470.7997975	best: 8470.6896619 (686)	total: 17.6s	remaining: 1m 48s
697:	learn: 12837.4661091	test: 8470.7202942	best: 8470.6896619 (686)	total: 17.6s	remaining: 1m 48s
698:	learn: 12837.1028450	test: 8470.7319384	best: 8470.6896619 (686)	total: 17.6s	remaining: 1m 48s
699:	learn: 12836.4593836	test: 8470.6847911	best: 8470.6847911 (699)	total: 17.6s	remaining: 1m 48s
700:	learn: 12836.1051720	test: 8470.6127532	best: 8470.6127532 (700)	total: 17.7s	remainin

780:	learn: 12795.3479091	test: 8468.9255506	best: 8468.8624189 (778)	total: 19.5s	remaining: 1m 45s
781:	learn: 12794.8304976	test: 8468.8651633	best: 8468.8624189 (778)	total: 19.6s	remaining: 1m 45s
782:	learn: 12794.3139345	test: 8468.8386087	best: 8468.8386087 (782)	total: 19.6s	remaining: 1m 45s
783:	learn: 12793.6038458	test: 8468.7874190	best: 8468.7874190 (783)	total: 19.6s	remaining: 1m 45s
784:	learn: 12792.9547550	test: 8468.8139474	best: 8468.7874190 (783)	total: 19.6s	remaining: 1m 45s
785:	learn: 12792.3196469	test: 8468.7552938	best: 8468.7552938 (785)	total: 19.7s	remaining: 1m 45s
786:	learn: 12791.7037086	test: 8468.7513835	best: 8468.7513835 (786)	total: 19.7s	remaining: 1m 45s
787:	learn: 12791.3814052	test: 8468.7121351	best: 8468.7121351 (787)	total: 19.7s	remaining: 1m 45s
788:	learn: 12791.0499288	test: 8468.8124938	best: 8468.7121351 (787)	total: 19.7s	remaining: 1m 45s
789:	learn: 12790.6682935	test: 8468.7522928	best: 8468.7121351 (787)	total: 19.8s	remainin

867:	learn: 12748.9240638	test: 8468.0971521	best: 8467.7131490 (856)	total: 21.7s	remaining: 1m 43s
868:	learn: 12748.5391678	test: 8468.0512093	best: 8467.7131490 (856)	total: 21.7s	remaining: 1m 43s
869:	learn: 12747.8714947	test: 8467.9773719	best: 8467.7131490 (856)	total: 21.8s	remaining: 1m 43s
870:	learn: 12747.5291206	test: 8467.9519216	best: 8467.7131490 (856)	total: 21.8s	remaining: 1m 43s
871:	learn: 12746.8527906	test: 8467.9491903	best: 8467.7131490 (856)	total: 21.8s	remaining: 1m 43s
872:	learn: 12746.1320889	test: 8467.8632864	best: 8467.7131490 (856)	total: 21.9s	remaining: 1m 43s
873:	learn: 12745.5739276	test: 8467.9891113	best: 8467.7131490 (856)	total: 21.9s	remaining: 1m 43s
874:	learn: 12745.2232845	test: 8467.9337353	best: 8467.7131490 (856)	total: 21.9s	remaining: 1m 43s
875:	learn: 12744.9350530	test: 8467.9218610	best: 8467.7131490 (856)	total: 21.9s	remaining: 1m 43s
876:	learn: 12744.2602487	test: 8467.8310941	best: 8467.7131490 (856)	total: 22s	remaining:

949:	learn: 12707.1275993	test: 8467.6148537	best: 8467.1755367 (921)	total: 24.2s	remaining: 1m 43s
950:	learn: 12706.5602210	test: 8467.5113288	best: 8467.1755367 (921)	total: 24.2s	remaining: 1m 43s
951:	learn: 12705.9240808	test: 8467.4811051	best: 8467.1755367 (921)	total: 24.2s	remaining: 1m 43s
952:	learn: 12705.2487352	test: 8467.4416699	best: 8467.1755367 (921)	total: 24.3s	remaining: 1m 43s
953:	learn: 12704.8787288	test: 8467.3812183	best: 8467.1755367 (921)	total: 24.3s	remaining: 1m 43s
954:	learn: 12704.5387328	test: 8467.4830804	best: 8467.1755367 (921)	total: 24.3s	remaining: 1m 43s
955:	learn: 12703.8794382	test: 8467.4428408	best: 8467.1755367 (921)	total: 24.4s	remaining: 1m 43s
956:	learn: 12703.5992523	test: 8467.5297261	best: 8467.1755367 (921)	total: 24.4s	remaining: 1m 43s
957:	learn: 12703.3359889	test: 8467.6669094	best: 8467.1755367 (921)	total: 24.4s	remaining: 1m 43s
958:	learn: 12702.9865241	test: 8467.7335757	best: 8467.1755367 (921)	total: 24.4s	remainin

1036:	learn: 12663.4624276	test: 8467.0736585	best: 8466.6420804 (1029)	total: 26.7s	remaining: 1m 41s
1037:	learn: 12662.8331645	test: 8467.0112126	best: 8466.6420804 (1029)	total: 26.7s	remaining: 1m 41s
1038:	learn: 12662.4730336	test: 8466.9770777	best: 8466.6420804 (1029)	total: 26.7s	remaining: 1m 41s
1039:	learn: 12661.8553516	test: 8466.9362116	best: 8466.6420804 (1029)	total: 26.8s	remaining: 1m 41s
1040:	learn: 12660.9880464	test: 8466.8891520	best: 8466.6420804 (1029)	total: 26.8s	remaining: 1m 41s
1041:	learn: 12660.0036843	test: 8466.7996471	best: 8466.6420804 (1029)	total: 26.8s	remaining: 1m 41s
1042:	learn: 12659.5312458	test: 8466.7399147	best: 8466.6420804 (1029)	total: 26.9s	remaining: 1m 41s
1043:	learn: 12659.2118523	test: 8466.6678656	best: 8466.6420804 (1029)	total: 26.9s	remaining: 1m 41s
1044:	learn: 12658.3431478	test: 8466.6295511	best: 8466.6295511 (1044)	total: 26.9s	remaining: 1m 41s
1045:	learn: 12657.3753203	test: 8466.5389558	best: 8466.5389558 (1045)	t

1120:	learn: 12620.6483675	test: 8465.8792158	best: 8465.4403748 (1104)	total: 29.1s	remaining: 1m 40s
1121:	learn: 12620.3887687	test: 8465.8714938	best: 8465.4403748 (1104)	total: 29.2s	remaining: 1m 40s
1122:	learn: 12620.1402835	test: 8465.9694305	best: 8465.4403748 (1104)	total: 29.2s	remaining: 1m 40s
1123:	learn: 12619.7120612	test: 8466.0912760	best: 8465.4403748 (1104)	total: 29.2s	remaining: 1m 40s
1124:	learn: 12619.1590041	test: 8466.0091363	best: 8465.4403748 (1104)	total: 29.3s	remaining: 1m 40s
1125:	learn: 12618.8328734	test: 8466.0520790	best: 8465.4403748 (1104)	total: 29.3s	remaining: 1m 40s
1126:	learn: 12618.6097963	test: 8466.2391967	best: 8465.4403748 (1104)	total: 29.3s	remaining: 1m 40s
1127:	learn: 12618.3772679	test: 8466.2605965	best: 8465.4403748 (1104)	total: 29.3s	remaining: 1m 40s
1128:	learn: 12618.0749819	test: 8466.2248172	best: 8465.4403748 (1104)	total: 29.4s	remaining: 1m 40s
1129:	learn: 12617.4748400	test: 8466.1705287	best: 8465.4403748 (1104)	t

1202:	learn: 12585.2831396	test: 8467.5803779	best: 8465.4403748 (1104)	total: 31.6s	remaining: 1m 39s
1203:	learn: 12585.0236132	test: 8467.5740492	best: 8465.4403748 (1104)	total: 31.7s	remaining: 1m 39s
1204:	learn: 12584.7068736	test: 8467.6428082	best: 8465.4403748 (1104)	total: 31.7s	remaining: 1m 39s
1205:	learn: 12584.0795009	test: 8467.6184647	best: 8465.4403748 (1104)	total: 31.7s	remaining: 1m 39s
1206:	learn: 12583.4632514	test: 8467.7162913	best: 8465.4403748 (1104)	total: 31.7s	remaining: 1m 39s
1207:	learn: 12583.1634516	test: 8467.6875102	best: 8465.4403748 (1104)	total: 31.8s	remaining: 1m 39s
1208:	learn: 12582.8635711	test: 8467.6825985	best: 8465.4403748 (1104)	total: 31.8s	remaining: 1m 39s
1209:	learn: 12582.1963968	test: 8467.6262060	best: 8465.4403748 (1104)	total: 31.8s	remaining: 1m 39s
1210:	learn: 12581.1986757	test: 8467.5870707	best: 8465.4403748 (1104)	total: 31.9s	remaining: 1m 39s
1211:	learn: 12580.6703722	test: 8467.6813547	best: 8465.4403748 (1104)	t

1288:	learn: 12543.8596498	test: 8468.2383001	best: 8465.4403748 (1104)	total: 34.1s	remaining: 1m 38s
1289:	learn: 12543.5038409	test: 8468.1945559	best: 8465.4403748 (1104)	total: 34.2s	remaining: 1m 38s
1290:	learn: 12543.1100076	test: 8468.1141220	best: 8465.4403748 (1104)	total: 34.2s	remaining: 1m 38s
1291:	learn: 12542.7812233	test: 8468.0618666	best: 8465.4403748 (1104)	total: 34.2s	remaining: 1m 38s
1292:	learn: 12542.4960933	test: 8468.0065124	best: 8465.4403748 (1104)	total: 34.3s	remaining: 1m 38s
1293:	learn: 12541.9281849	test: 8467.9724982	best: 8465.4403748 (1104)	total: 34.3s	remaining: 1m 38s
1294:	learn: 12541.2799918	test: 8467.9491386	best: 8465.4403748 (1104)	total: 34.3s	remaining: 1m 38s
1295:	learn: 12540.6595070	test: 8467.9461754	best: 8465.4403748 (1104)	total: 34.4s	remaining: 1m 38s
1296:	learn: 12540.4415451	test: 8468.1195662	best: 8465.4403748 (1104)	total: 34.4s	remaining: 1m 38s
1297:	learn: 12540.1165956	test: 8468.0710353	best: 8465.4403748 (1104)	t

1373:	learn: 12508.4990728	test: 8469.0301534	best: 8465.4403748 (1104)	total: 36.7s	remaining: 1m 36s
1374:	learn: 12508.2144608	test: 8469.1087060	best: 8465.4403748 (1104)	total: 36.7s	remaining: 1m 36s
1375:	learn: 12507.9624615	test: 8469.2416557	best: 8465.4403748 (1104)	total: 36.7s	remaining: 1m 36s
1376:	learn: 12507.4931676	test: 8469.1851742	best: 8465.4403748 (1104)	total: 36.8s	remaining: 1m 36s
1377:	learn: 12507.2378893	test: 8469.1303889	best: 8465.4403748 (1104)	total: 36.8s	remaining: 1m 36s
1378:	learn: 12506.5477249	test: 8469.1054541	best: 8465.4403748 (1104)	total: 36.8s	remaining: 1m 36s
1379:	learn: 12505.9941858	test: 8469.1921369	best: 8465.4403748 (1104)	total: 36.9s	remaining: 1m 36s
1380:	learn: 12505.7310808	test: 8469.1784574	best: 8465.4403748 (1104)	total: 36.9s	remaining: 1m 36s
1381:	learn: 12505.4852728	test: 8469.1742213	best: 8465.4403748 (1104)	total: 36.9s	remaining: 1m 36s
1382:	learn: 12505.0473719	test: 8469.1822687	best: 8465.4403748 (1104)	t

1459:	learn: 12468.7611163	test: 8470.9327882	best: 8465.4403748 (1104)	total: 39.2s	remaining: 1m 34s
1460:	learn: 12468.4556994	test: 8470.9229712	best: 8465.4403748 (1104)	total: 39.2s	remaining: 1m 35s
1461:	learn: 12468.2188765	test: 8471.0419635	best: 8465.4403748 (1104)	total: 39.3s	remaining: 1m 35s
1462:	learn: 12467.6956742	test: 8471.1296184	best: 8465.4403748 (1104)	total: 39.3s	remaining: 1m 34s
1463:	learn: 12467.1821045	test: 8471.1392822	best: 8465.4403748 (1104)	total: 39.3s	remaining: 1m 34s
1464:	learn: 12466.9337212	test: 8471.2606649	best: 8465.4403748 (1104)	total: 39.4s	remaining: 1m 34s
1465:	learn: 12466.6197206	test: 8471.2225798	best: 8465.4403748 (1104)	total: 39.4s	remaining: 1m 34s
1466:	learn: 12466.3500080	test: 8471.3183366	best: 8465.4403748 (1104)	total: 39.4s	remaining: 1m 34s
1467:	learn: 12465.8425274	test: 8471.2697052	best: 8465.4403748 (1104)	total: 39.5s	remaining: 1m 34s
1468:	learn: 12465.4534709	test: 8471.2059981	best: 8465.4403748 (1104)	t

1540:	learn: 12435.5868306	test: 8472.5776311	best: 8465.4403748 (1104)	total: 41.6s	remaining: 1m 33s
1541:	learn: 12435.1083470	test: 8472.5844952	best: 8465.4403748 (1104)	total: 41.6s	remaining: 1m 33s
1542:	learn: 12434.5687803	test: 8472.6268234	best: 8465.4403748 (1104)	total: 41.7s	remaining: 1m 33s
1543:	learn: 12434.3454907	test: 8472.6143873	best: 8465.4403748 (1104)	total: 41.7s	remaining: 1m 33s
1544:	learn: 12433.9525441	test: 8472.5574132	best: 8465.4403748 (1104)	total: 41.7s	remaining: 1m 33s
1545:	learn: 12433.2897587	test: 8472.5312961	best: 8465.4403748 (1104)	total: 41.8s	remaining: 1m 33s
1546:	learn: 12432.9849711	test: 8472.6167150	best: 8465.4403748 (1104)	total: 41.8s	remaining: 1m 33s
1547:	learn: 12432.6830273	test: 8472.7059112	best: 8465.4403748 (1104)	total: 41.8s	remaining: 1m 33s
1548:	learn: 12431.9637166	test: 8472.6806097	best: 8465.4403748 (1104)	total: 41.8s	remaining: 1m 33s
1549:	learn: 12431.6744017	test: 8472.7939895	best: 8465.4403748 (1104)	t

<catboost.core.CatBoostRegressor at 0x1df5e4f9b88>

In [90]:
round(np.sqrt(mean_squared_error(y_val, ctb.predict(X_val))), 5)

8465.44037

## 15

Calculate feature importances according to the model, trained in the task 14. 

**q15:** What is the name of the most important feature? Provide it as the answer. 

Do you understand why it might be important for the model?

Notice that in case of regression, `CatBoostRegressor` calculates feature importance considering PredictionValuesChange: https://catboost.ai/docs/concepts/fstr.html#fstr__regular-feature-importance

In [91]:
ctb.feature_names_[np.argmax(ctb.feature_importances_)]

'kw_avg_avg'

## 16

Finally, take a `Lasso` model from `sklearn` with `alpha=10.0`, `random_state=13` and all other default parameter values. Train it on the train set. 

**q16:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [92]:
lr = Lasso(alpha=10, random_state=13)
lr.fit(X_train, y_train)

round(np.sqrt(mean_squared_error(y_val, lr.predict(X_val))), 5)

8426.97894

## 17

Compare the results on the validation set of the trained models:

* XGBoost (task 7)
* LightGBM (task 12)
* CatBoost (task 14)
* Lasso (task 16)

**q17:** Which model has the best RMSE value on the validation set? For the answer, provide the following:

* 1 (if XGBoost was the best)
* 2 (if LightGBM was the best)
* 3 (if CatBoost was the best)
* 4 (if Lasso was the best)

## 18

Finally, let's move to blending the models that we obtained. First, calculate the predictions for the trained models on the validation set. Remember that LightGBM model used slightly different set of columns in the data.

After getting the predictions for the validation set, concatenate them into a single dataframe `X_val_blend`. The dataframe should look like this:

||xgb|lgb|cb|lasso|
|-|-|-|-|-|
|0|2298.947754|3728.088336|3680.924182|4270.039931|
|1|3208.189209|5243.744431|4487.549790|6755.853939|
|...|...|...|...|...|

Here, `xgb` column represents XGBoost predictions, `lgb` - LightGBM predictions, `cb` - CatBoost predictions, `lasso` - lasso predictions.

**q18:** For the answer, calculate the mean value of all model predictions in the last row of this column (`X_val_blend.iloc[-1]`). Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [93]:
X_val_blend = pd.DataFrame(np.vstack((xgb.predict(X_val), lgb.predict(X_val[allow_feats]), ctb.predict(X_val), lr.predict(X_val))).T,
             columns=['xgb', 'lgb', 'cb',  'lasso'])

In [94]:
X_val_blend

Unnamed: 0,xgb,lgb,cb,lasso
0,2298.947754,3745.689944,3680.924182,4270.039931
1,3208.189209,5295.171499,4487.549790,6755.853939
2,1171.030029,2362.077406,2899.190806,960.707930
3,1715.524292,2979.686550,3102.992450,3280.292136
4,1780.428223,3058.667040,3404.989586,1586.863807
...,...,...,...,...
7924,2203.913086,3428.583531,3608.619347,4982.351696
7925,2486.425293,3775.533305,3874.454205,4799.275291
7926,1648.837158,3139.468603,3159.619911,3515.011430
7927,1368.506592,2549.878300,2971.898824,2246.626186


In [95]:
round(X_val_blend.iloc[-1].mean(), 5)

2881.74948

## 19

Obtain a matrix of pairwise Pearson Correlation Coefficient (PCC) values for the column of the dataframe `X_val_blend`. Find a pair of model predictions with the highest PCC value (don't consider 1.0 values of correlations with themselves). 

**q19:** What is this value equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [96]:
round(X_val_blend.corr().max(), 5)

0.48111

In [102]:
round(X_val_blend.corr().loc['lgb', 'cb'], 5)

0.84455

## 20

Blend models into the ensemble with the weights 0.25, 0.25, 0.25 and 0.25 (just mean value of the predictions). 

**q20:** Calculate RMSE of such ensemble on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Compare it with RMSE of each model and think whether this is a good ensemble.

In [103]:
round(np.sqrt(mean_squared_error(y_val, X_val_blend.mean(axis=1))), 5)

8439.26508

## 21

Tune the weights of the ensemble. Run each model weight through `np.linspace(0, 1, 101)`, so that all possible values of each weight will be [0.0, 0.01, 0.02, ..., 0.99, 1.0]. Skip each combinations of weights, if their sum is not equal to 1.0. If the sum of the weights in the combination is equal to 1.0, though, get ensemble prediction on the validation set using these weights and calculate RMSE value.

In the end, select a combination of weights with the best RMSE value - these are the best weights for the ensemble. 

**q21:** What is their corresponding RMSE value equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Compare RMSE value of the ensemble with RMSE values of the models in it. Is the ensemble better?

_Hint. You probably want to save RMSE with the corresponding weights for each valid combination into some array. Also this weight tuning might be implemented as quadriple nested loop, or you may think about other ways of implementing it. You can track tuning progress using `tqdm` module._

In [104]:
import tqdm

In [105]:
params = set()
for xgb_weight in tqdm.tqdm(np.linspace(0, 1, 101)):
    for lgb_weight in np.linspace(0, 1, 101):
        for ctb_weight in np.linspace(0, 1, 101):
            for lr_weight in np.linspace(0, 1, 101):
                if abs(xgb_weight + lgb_weight + ctb_weight + lr_weight - 1) < 1e-8:
                    y_pred = X_val_blend['xgb'] * xgb_weight + X_val_blend['lgb'] * lgb_weight + X_val_blend['cb'] * ctb_weight + X_val_blend['lasso'] * lr_weight
                    params.add(((xgb_weight, lgb_weight, ctb_weight, lr_weight), round(np.sqrt(mean_squared_error(y_val, y_pred)), 5)))

100%|████████████████████████████████████████████████████████████████████████████████████| 101/101 [05:06<00:00,  3.03s/it]


In [106]:
sorted(params, key=lambda x: x[1])

[((0.0, 0.53, 0.0, 0.47000000000000003), 8405.68519),
 ((0.0, 0.52, 0.0, 0.48), 8405.68951),
 ((0.0, 0.54, 0.0, 0.46), 8405.69617),
 ((0.0, 0.51, 0.0, 0.49), 8405.70913),
 ((0.0, 0.55, 0.0, 0.45), 8405.72247),
 ((0.0, 0.5, 0.0, 0.5), 8405.74407),
 ((0.0, 0.56, 0.0, 0.44), 8405.76406),
 ((0.0, 0.49, 0.0, 0.51), 8405.7943),
 ((0.0, 0.5700000000000001, 0.0, 0.43), 8405.82097),
 ((0.0, 0.52, 0.01, 0.47000000000000003), 8405.82118),
 ((0.0, 0.51, 0.01, 0.48), 8405.8275),
 ((0.0, 0.53, 0.01, 0.46), 8405.83017),
 ((0.0, 0.5, 0.01, 0.49), 8405.84913),
 ((0.0, 0.54, 0.01, 0.45), 8405.85446),
 ((0.0, 0.48, 0.0, 0.52), 8405.85985),
 ((0.0, 0.49, 0.01, 0.5), 8405.88606),
 ((0.0, 0.58, 0.0, 0.42), 8405.89317),
 ((0.01, 0.52, 0.0, 0.47000000000000003), 8405.89354),
 ((0.0, 0.55, 0.01, 0.44), 8405.89405),
 ((0.01, 0.51, 0.0, 0.48), 8405.8983),
 ((0.01, 0.53, 0.0, 0.46), 8405.90408),
 ((0.01, 0.5, 0.0, 0.49), 8405.91837),
 ((0.01, 0.54, 0.0, 0.45), 8405.92992),
 ((0.0, 0.48, 0.01, 0.51), 8405.9383),
 

## 22

Using the best weights obtained in the task 21, run the best ensemble on the test set. To do this, obtain model predictions on the test set (you can write them to the similar table to the one for the validation set in the task 18). Remember that LightGBM model uses slightly different set of columns.

**q22:** Calculate RMSE of the final ensemble on the test set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [107]:
X_test_blend = pd.DataFrame(np.vstack((xgb.predict(X_test), lgb.predict(X_test[allow_feats]), ctb.predict(X_test), lr.predict(X_test))).T,
             columns=['xgb', 'lgb', 'cb',  'lasso'])

In [110]:
y_pred = X_test_blend['xgb'] * 0 + X_test_blend['lgb'] * 0.53 + X_test_blend['cb'] * 0 + X_test_blend['lasso'] * 0.47

In [112]:
round(np.sqrt(mean_squared_error(y_test, y_pred)), 5)

8445.47489