# Practice assignment: Advanced ensembling techniques


In this programming assignment, you are going to work with a dataset based on the following data:

https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

_Citation:_

* _K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal._

The dataset contains the information about the internet news articles. In this assignment, you are going to predict a number of shares of the news article (target column: `shares`). The information about the features is available through the link above. You are going to construct several machine learning algorithms (XGBoost, LightGBM, CatBoost and Lasso) and blend them into the final ensemble.

In [1]:
def run_boosting(X_train, y_train, X_test, y_test, params, boosting_type=['cat', 'lgbm', 'xgbm'], evaluate=False, return_rmse=False):
    if boosting_type == 'cat':
        model = CatBoostRegressor(**params)
    elif boosting_type == 'lgbm':
        model = LGBMRegressor(**params)
    elif boosting_type == 'xgbm':
        model = XGBRegressor(**params)
    else:
        raise ValueError('Unknown boosting type')

    if evaluate == True and boosting_type == 'cat':
        model.fit(X_train, y_train, eval_set=[(X_test, y_test)])
    elif evaluate == True: 
        model.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric='rmse')
    else:
        model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # calc rmse
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f'RMSE: {np.round(rmse, 5)}')
    if return_rmse == True:
        return model, rmse
    return model

def create_feature_importance_df(model):
    feature_importance_df = pd.DataFrame({'feature': X_train.columns,
                                          'importance': model.feature_importances_})
    feature_importance_df = feature_importance_df.sort_values('importance', ascending=False)
    return feature_importance_df

In [2]:
import numpy as np
import pandas as pd
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

In [3]:
df = pd.read_csv("https://github.com/rustya5041/applied-machine-learning/raw/main/data.csv")

## 1

**q1:** How many missing values are there in the data? Provide the number of cells in the dataframe that contain NaNs.

In [4]:
# your code here
df.isna().sum().sum()

0

## 2

**q2:** What is the maximum number of shares among all the news articles presented in the data?

In [5]:
# your code here
df.shares.max()

843300

## 3

**q3:** What is the median number of shares for the articles published on Monday?

In [6]:
# your code here
np.median(df[df['weekday_is_monday'] == 1].shares)

1400.0

## 4

First, we separate the target from the dataframe with features (`df` -> `X`, `y`).

Next, let's split the data into train/val/test sets in the ratio 60:20:20. The idea is that we will use train set to train our models, val set to validate them and test set to calculate the final error of the blend. So, test set will be a completely unseen data.

To do this, use a regular `train_test_split` from `sklearn` to split `X` and `y` into train and val/test parts in the ratio 60:40. Then use `train_test_split` again, but to split the obtain val/test part into validation and test in the ratio 50:50. In each `train_test_split` application, use `random_state=13` and other default parameter values.

In the end, you should obtain `X_train`, `X_val`, `X_test` with the following shapes, respectively: (23786, 58), (7929, 58), (7929, 58). The same logic is with `y_train`, `y_val`, `y_test`.

**q4:** What is the mean value of target in the test part (`X_test`)? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [7]:
X = df.drop('shares', axis=1)
y = df['shares']

In [8]:
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.4, random_state=13)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=13)

assert (X_train.shape, X_val.shape, X_test.shape) == ((23786, 58), (7929, 58), (7929, 58))

np.round(np.mean(y_test), 5)

3349.74057

## 5

Now let's train our first model - XGBoost. A link to the documentation: https://xgboost.readthedocs.io/en/latest/

We will use Scikit-Learn Wrapper interface for XGBoost (and the same logic applies to the following LightGBM and CatBoost models). Here, we work on the regression task - hence we will use `XGBRegressor`. Read about the parameters of `XGBRegressor`: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor

The main list of XGBoost parameters: https://xgboost.readthedocs.io/en/latest/parameter.html# Look through this list so that you understand which parameters are presented in the library.

Take `XGBRegressor` with MSE objective (`objective='reg:squarederror'`), 200 trees (`n_estimators=200`), `learning_rate=0.01`, `max_depth=5`, `random_state=13` and all other default parameter values. Train it on the train set (`fit` function). 

**q5:** Calculate Root Mean Squared Error (RMSE) on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [9]:
# params
xgbm_params = {
    'objective': 'reg:squarederror',
    'n_estimators': 200,
    'learning_rate': 0.01,
    'max_depth': 5,
    'random_state': 13
}

xgbm = run_boosting(X_train, y_train, X_val, y_val, xgbm_params, boosting_type='xgbm', evaluate=False)

RMSE: 10337.77015


## 6

In the task 5, we have decided to build 200 trees in our model. However, it is hard to understand whether it is a good decision - maybe it is too much? Maybe 150 is a better number? Or 100? Or 50 is enough?

During the training process, it is possible to stop constructing the ensemble if we see that the validation error does not decrease anymore. Using the same XGBoost model, call `fit` function (to train it) with `eval_set=[(X_val, y_val)]` (to evaluate the boosting model after building a new tree) and `early_stopping_rounds=50` (and other default parameter values). This `early_stopping_rounds` says that if the validation metric does not increase on 50 consequent iterations, the training stops.

**q6:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [10]:
# params
xgbm_params = {
    'objective': 'reg:squarederror',
    'n_estimators': 200,
    'learning_rate': 0.01,
    'max_depth': 5,
    'random_state': 13,
    'early_stopping_rounds': 50
}

xgbm = run_boosting(X_train, y_train, X_val, y_val, xgbm_params, boosting_type='xgbm', evaluate=True)

[0]	validation_0-rmse:8526.14346
[1]	validation_0-rmse:8524.77360
[2]	validation_0-rmse:8523.73110
[3]	validation_0-rmse:8523.06405
[4]	validation_0-rmse:8522.60356
[5]	validation_0-rmse:8522.60392
[6]	validation_0-rmse:8522.82842
[7]	validation_0-rmse:8523.13738
[8]	validation_0-rmse:8524.41203
[9]	validation_0-rmse:8525.71422
[10]	validation_0-rmse:8528.07563
[11]	validation_0-rmse:8530.70579
[12]	validation_0-rmse:8533.88394
[13]	validation_0-rmse:8536.83069
[14]	validation_0-rmse:8538.83907
[15]	validation_0-rmse:8540.42672
[16]	validation_0-rmse:8544.74549
[17]	validation_0-rmse:8547.71976
[18]	validation_0-rmse:8550.29632
[19]	validation_0-rmse:8553.93203
[20]	validation_0-rmse:8557.39221
[21]	validation_0-rmse:8563.71792
[22]	validation_0-rmse:8568.12120
[23]	validation_0-rmse:8574.37509
[24]	validation_0-rmse:8579.72921
[25]	validation_0-rmse:8580.95175
[26]	validation_0-rmse:8587.01026
[27]	validation_0-rmse:8588.08066
[28]	validation_0-rmse:8594.96050
[29]	validation_0-rmse:8



[42]	validation_0-rmse:8664.67698
[43]	validation_0-rmse:8670.40344
[44]	validation_0-rmse:8677.94879
[45]	validation_0-rmse:8683.66393
[46]	validation_0-rmse:8689.18234
[47]	validation_0-rmse:8694.78732
[48]	validation_0-rmse:8701.28443
[49]	validation_0-rmse:8707.21527
[50]	validation_0-rmse:8713.34533
[51]	validation_0-rmse:8719.70769
[52]	validation_0-rmse:8726.32208
[53]	validation_0-rmse:8734.97346
[54]	validation_0-rmse:8741.76910
RMSE: 8522.60356


## 7

Notes on parameter tuning: https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html

Here, we tuned some parameters of the algorithm. Take `XGBRegressor` with the following parameters:

* `objective='reg:squarederror'`
* `n_estimators=5000`
* `learning_rate=0.001`
* `max_depth=4`
* `gamma=1`
* `subsample=0.5`
* `random_state=13`
* all other default parameter values

Train it in the same manner, as in the task 6, but with `early_stopping_rounds=500`. 

**q7:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Notice the speed of the algorithm.

In [11]:
xgbm_params = {
    'objective': 'reg:squarederror',
    'n_estimators': 5000,
    'learning_rate': 0.001,
    'max_depth': 4,
    'random_state': 13,
    'gamma' : 1,
    'subsample': 0.5,
    'early_stopping_rounds': 500
}

xgbm, rmse_xgbm = run_boosting(X_train, y_train, X_val, y_val, xgbm_params, boosting_type='xgbm', evaluate=True, return_rmse=True)

[0]	validation_0-rmse:8527.59403
[1]	validation_0-rmse:8527.40756
[2]	validation_0-rmse:8527.09291
[3]	validation_0-rmse:8526.79671
[4]	validation_0-rmse:8526.60257
[5]	validation_0-rmse:8526.38331
[6]	validation_0-rmse:8526.05516
[7]	validation_0-rmse:8525.75850
[8]	validation_0-rmse:8525.55023
[9]	validation_0-rmse:8525.26727
[10]	validation_0-rmse:8525.06648
[11]	validation_0-rmse:8524.89645
[12]	validation_0-rmse:8524.71251
[13]	validation_0-rmse:8524.50021
[14]	validation_0-rmse:8524.22673
[15]	validation_0-rmse:8523.93172
[16]	validation_0-rmse:8523.72846
[17]	validation_0-rmse:8523.44191
[18]	validation_0-rmse:8523.07469
[19]	validation_0-rmse:8522.40105
[20]	validation_0-rmse:8521.98181
[21]	validation_0-rmse:8521.79479
[22]	validation_0-rmse:8521.58745
[23]	validation_0-rmse:8521.36070
[24]	validation_0-rmse:8521.09811
[25]	validation_0-rmse:8520.57698
[26]	validation_0-rmse:8520.39283
[27]	validation_0-rmse:8520.14916
[28]	validation_0-rmse:8519.94634
[29]	validation_0-rmse:8



[42]	validation_0-rmse:8516.56544
[43]	validation_0-rmse:8516.39888
[44]	validation_0-rmse:8516.23509
[45]	validation_0-rmse:8516.04416
[46]	validation_0-rmse:8515.73714
[47]	validation_0-rmse:8515.60837
[48]	validation_0-rmse:8515.24080
[49]	validation_0-rmse:8515.02836
[50]	validation_0-rmse:8514.84648
[51]	validation_0-rmse:8514.69535
[52]	validation_0-rmse:8514.34845
[53]	validation_0-rmse:8514.09776
[54]	validation_0-rmse:8513.84994
[55]	validation_0-rmse:8513.60196
[56]	validation_0-rmse:8513.38889
[57]	validation_0-rmse:8513.12227
[58]	validation_0-rmse:8512.82677
[59]	validation_0-rmse:8512.58834
[60]	validation_0-rmse:8512.39177
[61]	validation_0-rmse:8512.22997
[62]	validation_0-rmse:8512.05030
[63]	validation_0-rmse:8511.83596
[64]	validation_0-rmse:8511.55714
[65]	validation_0-rmse:8511.42487
[66]	validation_0-rmse:8511.27370
[67]	validation_0-rmse:8511.15156
[68]	validation_0-rmse:8510.90917
[69]	validation_0-rmse:8510.80150
[70]	validation_0-rmse:8510.58396
[71]	validatio

## 8

Calculate feature importances according to the model, trained in the task 7. 

**q8:** What is the name of the most important feature? Provide it as the answer. Do you understand why it might be important for the model?

Notice that by default, `XGBRegressor` calculates feature importance considering gain (`importance_type` parameter).

In [12]:
xgbm_imp = create_feature_importance_df(xgbm)
xgbm_imp.head(1)

Unnamed: 0,feature,importance
13,data_channel_is_bus,0.06989


## 9

Let's move to LightGBM. We will work with `LGBMRegressor`.

LGBMRegressor parameters: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor

The main list of LightGBM parameters: https://lightgbm.readthedocs.io/en/latest/Parameters.html Look through this list so that you understand which parameters are presented in the library.

Take `LGBMRegressor` with the following parameters, similar to the previous `XGBoost` model:

* `objective='regression'`
* `n_estimators=200`
* `learning_rate=0.01`
* `max_depth=5`
* `random_state=13`
* other default parameter values

Train it on the training data with `eval_set=[(X_val, y_val)]`, `eval_metric='rmse'`, `early_stopping_rounds=50` and all other default parameter values. 

**q9:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Notice the speed of the algorithm and compare it to the speed of XGBoost model.

In [13]:
lgbm_params = {
    'objective': 'regression',
    'n_estimators': 200,
    'learning_rate': 0.01,
    'max_depth': 5,
    'random_state': 13,
    'early_stopping_rounds': 50,
}

lgbm = run_boosting(X_train, y_train, X_val, y_val, lgbm_params, boosting_type='lgbm', evaluate=True)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004233 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8418
[LightGBM] [Info] Number of data points in the train set: 23786, number of used features: 58
[LightGBM] [Info] Start training from score 3438.557933
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[71]	valid_0's rmse: 8451.29	valid_0's l2: 7.14242e+07
RMSE: 8451.2859


## 10

Notes on parameter tuning: https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

Here, we tuned some parameters of the algorithm. Take `LGBMRegressor` with the following parameters:

* `objective='regression'`
* `n_estimators=5000`
* `learning_rate=0.001`
* `max_depth=3`
* `lambda_l2=1.0`
* `boosting_type='goss'`
* `random_state=13`
* all other default parameter values

Train it in the same manner, as in the task 9, but with `early_stopping_rounds=500`. 

**q10:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [14]:
lgbm_params = {
    'objective': 'regression',
    'n_estimators' : 5000,
    'learning_rate' : 0.001,
    'max_depth': 3,
    'random_state': 13,
    'lambda_l2' : 1.0,
    'early_stopping_rounds' : 500,
    'boosting_type' : 'goss'
}

lgbm = run_boosting(X_train, y_train, X_val, y_val, lgbm_params, boosting_type='lgbm', evaluate=True)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002029 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8418
[LightGBM] [Info] Number of data points in the train set: 23786, number of used features: 58
[LightGBM] [Info] Using GOSS
[LightGBM] [Info] Start training from score 3438.557933
Training until validation scores don't improve for 500 rounds
Early stopping, best iteration is:
[1637]	valid_0's rmse: 8421.62	valid_0's l2: 7.09237e+07
RMSE: 8421.6219


## 11

Calculate feature importances according to the model, trained in the task 10. 

**q11:** What is the name of the most important feature? Provide it as the answer. 

Do you understand why it might be important for the model?

Notice that by default, `LGBMRegressor` calculates feature importance considering number of times the feature is used in the model (`importance_type` parameter).

In [15]:
lgbm_imps = create_feature_importance_df(lgbm)
lgbm_imps.head(1)

Unnamed: 0,feature,importance
26,self_reference_min_shares,1322


## 12

Since some features are not important for the model, we can drop them in order to try to construct a better model which does not consider them at all.

Obtain new train and validation sets without the features with LightGBM importance less than 10 (the importances were computed in the task 11). Train the same model as in the task 10 on the new train set in the same manner. 

**q12:** Calculate RMSE on the new validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Notice that the new versions of train and validation sets are used only in this task and in blending.

In [16]:
X_imp_only = X[lgbm_imps[lgbm_imps['importance'] >= 10]['feature']]

imp_X_train, imp_X_val_test, imp_y_train, imp_y_val_test = train_test_split(X_imp_only, y, test_size=0.4, random_state=13)
imp_X_val, imp_X_test, imp_y_val, imp_y_test = train_test_split(imp_X_val_test, imp_y_val_test, test_size=0.5, random_state=13)

lgbm_params = {
    'objective': 'regression',
    'n_estimators' : 5000,
    'learning_rate' : 0.001,
    'max_depth': 3,
    'random_state': 13,
    'lambda_l2' : 1.0,
    'early_stopping_rounds' : 500,
    'boosting_type' : 'goss'
}

lgbm, rmse_lgbm = run_boosting(imp_X_train, imp_y_train, imp_X_val, imp_y_val, lgbm_params, boosting_type='lgbm', evaluate=True, return_rmse=True)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001737 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7224
[LightGBM] [Info] Number of data points in the train set: 23786, number of used features: 40
[LightGBM] [Info] Using GOSS
[LightGBM] [Info] Start training from score 3438.557933
Training until validation scores don't improve for 500 rounds
Early stopping, best iteration is:
[1705]	valid_0's rmse: 8421.34	valid_0's l2: 7.0919e+07
RMSE: 8421.34343


## 13

Let's move to CatBoost. We will work with `CatBoostRegressor`.

Info about `CatBoostRegressor`: https://catboost.ai/docs/concepts/python-reference_catboostregressor.html

CatBoost parameters: https://catboost.ai/docs/concepts/python-reference_parameters-list.html Look through this list so that you understand which parameters are presented in the library.

Take `CatBoostRegressor` with the following parameters, similar to the previous models:

* `loss_function='RMSE'`
* `iterations=200`
* `learning_rate=0.01`
* `max_depth=5`
* `random_state=13`
* other default parameter values

Train it on the training data with `eval_set=[(X_val, y_val)]`, `early_stopping_rounds=50` and all other default parameter values. 

**q13:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Notice the speed of the algorithm and compare it to the speed of XGBoost and LightGBM models.

In [17]:
# your code here
cat_params = {
    'loss_function' : 'RMSE',
    'iterations' : 200,
    'learning_rate' : 0.01,
    'max_depth': 5,
    'random_state': 13,
    'early_stopping_rounds' : 50,
}

cat = run_boosting(X_train, y_train, X_val, y_val, cat_params, boosting_type='cat', evaluate=True)

0:	learn: 13277.2870054	test: 8526.5680566	best: 8526.5680566 (0)	total: 49.3ms	remaining: 9.8s
1:	learn: 13272.4810810	test: 8525.0371870	best: 8525.0371870 (1)	total: 51.7ms	remaining: 5.11s
2:	learn: 13267.5177449	test: 8523.3886374	best: 8523.3886374 (2)	total: 53.9ms	remaining: 3.54s
3:	learn: 13263.2952731	test: 8522.1582616	best: 8522.1582616 (3)	total: 56.1ms	remaining: 2.75s
4:	learn: 13257.2202531	test: 8520.3532273	best: 8520.3532273 (4)	total: 58.2ms	remaining: 2.27s
5:	learn: 13252.7543910	test: 8518.9990683	best: 8518.9990683 (5)	total: 60.4ms	remaining: 1.95s
6:	learn: 13249.2045770	test: 8519.1106061	best: 8518.9990683 (5)	total: 62.5ms	remaining: 1.72s
7:	learn: 13243.4458393	test: 8517.5087457	best: 8517.5087457 (7)	total: 64.6ms	remaining: 1.55s
8:	learn: 13238.6918845	test: 8516.4245434	best: 8516.4245434 (8)	total: 66.7ms	remaining: 1.41s
9:	learn: 13234.2011718	test: 8515.2163314	best: 8515.2163314 (9)	total: 68.7ms	remaining: 1.31s
10:	learn: 13230.5141530	test: 

## 14

Notes on parameter tuning: https://catboost.ai/docs/concepts/parameter-tuning.html

Here, we tuned some parameters of the algorithm. Take `CatBoostRegressor` with the following parameters:

* `loss_function='RMSE'`
* `n_estimators=5000`
* `learning_rate=0.001`
* `max_depth=9`
* `random_state=13`
* all other default parameter values

Train it in the same manner, as in the task 13, but with `early_stopping_rounds=500`. 

**q14:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [18]:
cat_params = {
    'loss_function' : 'RMSE',
    'iterations' : 5000,
    'learning_rate' : 0.001,
    'max_depth': 9,
    'random_state': 13,
    'early_stopping_rounds' : 500,
}

cat, rmse_cat = run_boosting(X_train, y_train, X_val, y_val, cat_params, boosting_type='cat', evaluate=True, return_rmse=True)

0:	learn: 13280.9705594	test: 8527.7673907	best: 8527.7673907 (0)	total: 16.2ms	remaining: 1m 20s
1:	learn: 13280.1852114	test: 8527.6115169	best: 8527.6115169 (1)	total: 28.1ms	remaining: 1m 10s
2:	learn: 13279.3477587	test: 8527.4365512	best: 8527.4365512 (2)	total: 39.6ms	remaining: 1m 5s
3:	learn: 13278.4673795	test: 8527.1955417	best: 8527.1955417 (3)	total: 50.9ms	remaining: 1m 3s
4:	learn: 13277.6056592	test: 8527.0534633	best: 8527.0534633 (4)	total: 63.1ms	remaining: 1m 3s
5:	learn: 13276.7707409	test: 8526.9439818	best: 8526.9439818 (5)	total: 73.4ms	remaining: 1m 1s
6:	learn: 13276.2226780	test: 8526.8093958	best: 8526.8093958 (6)	total: 84.8ms	remaining: 1m
7:	learn: 13275.3238861	test: 8526.6313357	best: 8526.6313357 (7)	total: 97.1ms	remaining: 1m
8:	learn: 13274.8201836	test: 8526.4891028	best: 8526.4891028 (8)	total: 108ms	remaining: 59.9s
9:	learn: 13273.9471276	test: 8526.3315744	best: 8526.3315744 (9)	total: 120ms	remaining: 59.7s
10:	learn: 13273.4085069	test: 8526.

## 15

Calculate feature importances according to the model, trained in the task 14. 

**q15:** What is the name of the most important feature? Provide it as the answer. 

Do you understand why it might be important for the model?

Notice that in case of regression, `CatBoostRegressor` calculates feature importance considering PredictionValuesChange: https://catboost.ai/docs/concepts/fstr.html#fstr__regular-feature-importance

In [19]:
imps_cat = create_feature_importance_df(cat)
imps_cat.head(1)

Unnamed: 0,feature,importance
25,kw_avg_avg,9.646792


## 16

Finally, take a `Lasso` model from `sklearn` with `alpha=10.0`, `random_state=13` and all other default parameter values. Train it on the train set. 

**q16:** Calculate RMSE on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [20]:
lasso = Lasso(alpha=10, random_state=13)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_val)

rmse_lasso = np.sqrt(mean_squared_error(y_val, y_pred))

np.round(rmse_lasso, 5)

8426.97894

## 17

Compare the results on the validation set of the trained models:

* XGBoost (task 7)
* LightGBM (task 12)
* CatBoost (task 14)
* Lasso (task 16)

**q17:** Which model has the best RMSE value on the validation set? For the answer, provide the following:

* 1 (if XGBoost was the best)
* 2 (if LightGBM was the best)
* 3 (if CatBoost was the best)
* 4 (if Lasso was the best)

In [21]:
models_scores = {'xgbm' : rmse_xgbm,
                'lgbm' : rmse_lgbm,
                 'cat' : rmse_cat,
                 'lasso' : rmse_lasso
                }

print(models_scores)

{'xgbm': 8480.201261510541, 'lgbm': 8421.343429919192, 'cat': 8465.440374777618, 'lasso': 8426.978935188357}


## 18

Finally, let's move to blending the models that we obtained. First, calculate the predictions for the trained models on the validation set. Remember that LightGBM model used slightly different set of columns in the data.

After getting the predictions for the validation set, concatenate them into a single dataframe `X_val_blend`. The dataframe should look like this:

||xgb|lgb|cb|lasso|
|-|-|-|-|-|
|0|2298.947754|3728.088336|3680.924182|4270.039931|
|1|3208.189209|5243.744431|4487.549790|6755.853939|
|...|...|...|...|...|

Here, `xgb` column represents XGBoost predictions, `lgb` - LightGBM predictions, `cb` - CatBoost predictions, `lasso` - lasso predictions.

**q18:** For the answer, calculate the mean value of all model predictions in the last row of this column (`X_val_blend.iloc[-1]`). Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [22]:
X_val_blend = pd.DataFrame({'xgb': xgbm.predict(X_val),
                            'lgb': lgbm.predict(imp_X_val),
                            'cb': cat.predict(X_val),
                            'lasso': lasso.predict(X_val)})

np.round(X_val_blend.iloc[-1].mean(), 5)



3285.35514

## 19

Obtain a matrix of pairwise Pearson Correlation Coefficient (PCC) values for the column of the dataframe `X_val_blend`. Find a pair of model predictions with the highest PCC value (don't consider 1.0 values of correlations with themselves). 

**q19:** What is this value equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [23]:
np.round(X_val_blend.corr(),5)

Unnamed: 0,xgb,lgb,cb,lasso
xgb,1.0,0.76973,0.7956,0.58879
lgb,0.76973,1.0,0.84346,0.76123
cb,0.7956,0.84346,1.0,0.7007
lasso,0.58879,0.76123,0.7007,1.0


## 20

Blend models into the ensemble with the weights 0.25, 0.25, 0.25 and 0.25 (just mean value of the predictions). 

**q20:** Calculate RMSE of such ensemble on the validation set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Compare it with RMSE of each model and think whether this is a good ensemble.

In [24]:
X_val_blend['mean_pred'] = X_val_blend.mean(axis=1)

rmse = np.sqrt(mean_squared_error(y_val, X_val_blend['mean_pred']))

np.round(rmse, 5)

8423.16564

## 21

Tune the weights of the ensemble. Run each model weight through `np.linspace(0, 1, 101)`, so that all possible values of each weight will be [0.0, 0.01, 0.02, ..., 0.99, 1.0]. Skip each combinations of weights, if their sum is not equal to 1.0. If the sum of the weights in the combination is equal to 1.0, though, get ensemble prediction on the validation set using these weights and calculate RMSE value.

In the end, select a combination of weights with the best RMSE value - these are the best weights for the ensemble. 

**q21:** What is their corresponding RMSE value equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Compare RMSE value of the ensemble with RMSE values of the models in it. Is the ensemble better?

_Hint. You probably want to save RMSE with the corresponding weights for each valid combination into some array. Also this weight tuning might be implemented as quadriple nested loop, or you may think about other ways of implementing it. You can track tuning progress using `tqdm` module._

In [43]:
weights = np.linspace(0, 1, 101)
rmse_l = []

# init all weights
xgb_weights, lgb_weights, cb_weights, lasso_weights = np.meshgrid(weights, weights, weights, weights)

# to one array
valid_weights = np.vstack((xgb_weights.flatten(), 
                           lgb_weights.flatten(), 
                           cb_weights.flatten(), 
                           lasso_weights.flatten())).T

# filtering out weights that dont sum-up to 1
valid_weights = valid_weights[valid_weights.sum(axis=1) == 1]

# iterating over weights to calc predictions
for w in tqdm(valid_weights):
    X_val_blend['weighted_pred'] = X_val_blend['xgb'] * w[0] + \
                                   X_val_blend['lgb'] * w[1] + \
                                   X_val_blend['cb'] * w[2] + \
                                   X_val_blend['lasso'] * w[3]
    
    rmse = np.sqrt(mean_squared_error(y_val, X_val_blend['weighted_pred']))
    rmse_l.append([w[0], w[1], w[2], w[3], rmse])

100%|██████████| 167002/167002 [03:52<00:00, 718.83it/s]


In [47]:
best_weights = sorted(rmse_l, key=lambda x: x[4])[0]
print(f"best weights: {best_weights[:-1]}\nbest rmse: {best_weights[-1]}")

best weights: [0.0, 0.54, 0.0, 0.46]
best rmse: 8404.851139799262


## 22

Using the best weights obtained in the task 21, run the best ensemble on the test set. To do this, obtain model predictions on the test set (you can write them to the similar table to the one for the validation set in the task 18). Remember that LightGBM model uses slightly different set of columns.

**q22:** Calculate RMSE of the final ensemble on the test set. What is it equal to? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [45]:
X_test_blend = pd.DataFrame(
    {'xgb': xgbm.predict(X_test),
     'lgb': lgbm.predict(imp_X_test),
     'cb': cat.predict(X_test),
     'lasso': lasso.predict(X_test)
    })

X_test_blend['weighted_pred'] = best_weights[0] * X_test_blend['xgb'] + best_weights[1] * X_test_blend['lgb'] + best_weights[2] * X_test_blend['cb'] + best_weights[3] * X_test_blend['lasso']

rmse = np.sqrt(mean_squared_error(y_test, X_test_blend['weighted_pred']))

np.round(rmse, 5)



8446.10361