# Final project for "How to win a data science competition" Coursera course

This is a notebook for the final project of the Coursera course ["How to win a data science competition"](https://www.coursera.org/learn/competitive-data-science).
The course is the 2nd course in the specialization ["Advanced Machine Learning Specialization"](https://www.coursera.org/specializations/aml) organized by HSE University.
This notebook was written in February 2021.
I did not intend to write this notebook for public.
However, I decided to publish this notebook.

Leaderboard Score.  
(Public): 0.882002 (1089/13338; Top 9%)  
(Private): 0.877179

The competition is still active; thefore we do not know the final score.
The private score should not be disclosed; however, the Coursera grader gives the private score as well.

# Summary

First, I constructed 18 Lightgbm models of 5 types: A, B, C, D, and E for the first layer of the stacking as follows:

Type | # of models | Lag features | blocks used for training | # of leves in lgb models   | Mean Target Encoding |
---- | ----------- | ---------- | --------------------------| -------------------------- |----------------------|
A    |           3 | 1, 2, 3, 12 | 12 ~ 33 (22 blocks)       | $2^3$, $2^4$, $2^5$        | YES (without interaction terms)
B    |           3 | 1, 2, 3     | 3 ~ 33 (31 blocks)        | $2^3$, $2^4$, $2^5$        | YES (without interaction terms)
C    |           4 | 1, 2, 3, 12 | 12 ~ 33 (22 blocks)       | $2^2$, $2^3$, $2^4$, $2^5$ | YES (with interaction terms)
D    |           4 | 1, 2, 3     | 3 ~ 33 (31 blocks)        | $2^2$, $2^3$, $2^4$, $2^5$ | YES (with interaction terms)
E    |           4 | 1           | 1 ~ 33 (33 blocks)        | $2^2$, $2^3$, $2^4$, $2^5$ | YES (with interaction terms)

For example, the type 'A' uses lag features of 1,2,3, and 12 months with mean target encodings of categorical variables.
There are three different models of type 'A': models with the number of leaves $2^3$, $2^4$, and $2^5$.

Next, I used Lasso and Ridge linear regresions for the second layer of the stacking and chose the one (i.e. Lasso) which gave the better internal score.

For the features, in addition to the lag features, I utilized:
- TfIdf for 'item_name',
- Label encodings for 'shop_names' and 'item_category_name',
- Days elapsed from the first day on sale for each item.

In particular, the last feature 'prev_days_on_sale' is a very useful feature.

I have created a validation scheme as follows:
for blocks $n = \{ \text{start_block}, \text{start_block} + 1, \cdots, \text{target_block} \}$, we
- use the blocks $n = \{ \text{start_block}, \text{start_block} + 1 \cdots, \text{target_block} - 2 \}$ for the training data,
- use the block $n = \{ \text{target_block} - 1 \}$ for the validation (especially for the early stopping),
- and predict for the block $n = \{ \text{target_block} \}$.

Recall that $n = 34$ is the block for the submission.

I used the module [shap](https://shap.readthedocs.io) as an explanation for the model output.

# Importing Modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import datetime
import xgboost as xgb # (eventually not used)
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
np.random.seed(seed = 2021)
from itertools import product
import shap
from tqdm import tqdm
import copy
import gc

In [None]:
!pip install pickle5
import pickle5 as pickle

In [None]:
print(np.__version__)
print(pd.__version__)
print(sns.__version__)
print(lgb.__version__)
print(xgb.__version__)

In [None]:
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 160)

# Loading Data

First, I load and compress the data; it is about 615.5+ MB with 214,199 rows × 13 columns.

```python
sales_train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
item_categories = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
test = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')
sample_submission = pd.read_csv('../input/competitive-data-science-predict-future-sales/sample_submission.csv')
```

```python
data_train = sales_train.copy()
data_train['item_sales'] = data_train['item_price'] * data_train['item_cnt_day']
data_train = data_train.groupby(['date_block_num', 'shop_id', 'item_id']).agg(
    item_count = ('item_cnt_day', np.sum),
    item_price = ('item_price', np.mean),
    item_sales = ('item_sales', np.sum)
)
data_train = data_train.reset_index()
```

```python
grid = []
for block_num in np.arange(34):
    cur_shops = sales_train[sales_train['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales_train[sales_train['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[[block_num], cur_shops, cur_items])),dtype='int32'))
grid = pd.DataFrame(np.vstack(grid), columns = ['date_block_num', 'shop_id', 'item_id'],dtype=np.int32)
```

```python
data_train = pd.merge(data_train, grid, on = ['date_block_num', 'shop_id', 'item_id'], how = 'outer').sort_values(['date_block_num', 'shop_id', 'item_id'])
data_train['item_count'] = data_train['item_count'].fillna(0.0).astype(np.int32)
data_train['item_sales'] = data_train['item_sales'].fillna(0.0).astype(np.float32)
data_train['target'] = data_train['item_count'].fillna(0.0).clip(0, 20).astype(np.float32)
data_train['year'] = data_train['date_block_num'] // 12 + 2013
data_train['month'] = data_train['date_block_num'] % 12 + 1
```

```python
data_train = pd.merge(data_train, items, on = 'item_id', how = 'left')
data_train = pd.merge(data_train, item_categories, on = 'item_category_id', how = 'left')
data_train = pd.merge(data_train, shops, on = 'shop_id', how = 'left')
```

```python
data_test = pd.merge(test, items, on = 'item_id', how = 'left')
data_test = pd.merge(test, items, on = 'item_id', how = 'left')
data_test = pd.merge(data_test, item_categories, on = 'item_category_id', how = 'left')
data_test = pd.merge(data_test, shops, on = 'shop_id', how = 'left')
data_test = data_test.set_index('ID')
data_test['date_block_num'] = 34
data_test['year'] = 2015
data_test['month'] = 11
```

```python
data = pd.concat([data_train, data_test])
```

```python
def compress_data(data):
    for column in data.columns:
        if column.startswith('word'):
            data[column] = data[column].astype('int8')
        if column.startswith('prev_days'):
            data[column] = data[column].astype('int16')
        if column.startswith('prev_blocks'):
            data[column] = data[column].astype('int8')
        if column in ['date_block_num', 'month']:
            data[column] = data[column].astype('int8')
        if column in ['year']:
            data[column] = data[column].astype('int16')
        if column.endswith('id'):
            data[column] = data[column].astype('int16') 
        if data[column].dtype == 'int64':
            data[column] = data[column].astype('int32')
        if data[column].dtype == 'float64':
            data[column] = data[column].astype('float32')
    return data
```

```python
data = compress_data(data)
```

# Feature Engineering

I add some features into the dataset.

```python
data2 = data.copy()
```

## Lags

I add monthly statistics of items for previous months.

```python
def shift_feature(data, groupby_columns, lags, features, default_value = None):
    data_tmp = data
    for lag in lags:
        column_names = [feature + '_' + str(lag) for feature in features]
        data_tmp[column_names] = data.groupby(groupby_columns)[features].shift(periods = lag)
        if default_value is not None:
            data_tmp[column_names] = data_tmp[column_names].fillna(default_value)
    return data_tmp
```

```python
data2 = shift_feature(data2, groupby_columns = ['item_id', 'shop_id'], lags = [1, 2, 3, 12], features = ['item_count', 'item_sales'], default_value = 0.0)
data2 = shift_feature(data2, groupby_columns = ['item_id', 'shop_id'], lags = [1, 2, 3, 12], features = ['item_price'], default_value = 0.0)
```

## TfIdf for item names

I add 50 TfIdf features for items names.

```python
tfidf = TfidfVectorizer(max_features=50)
items2 = pd.DataFrame(tfidf.fit_transform(items['item_name']).toarray(), columns = [f'word_{i}' for i in range(len(tfidf.get_feature_names()))])
items2.index.name = 'item_id'
items2 = items2.reset_index()
data2 = pd.merge(data2, items2, on = 'item_id', how = 'left')
data2 = compress_data(data2)
```

## Label encoding for 'shop_names' and 'item_category_name'

I utilized this note: https://www.kaggle.com/homiarafarhana/predict-future-sales#Feature-engineering-and-data-cleaning.

```python
data2.loc[data2['shop_name'] == 'Сергиев Посад ТЦ "7Я"', 'shop_name'] = 'СергиевПосад ТЦ "7Я"'
data2['city'] = data2['shop_name'].str.split(' ').map(lambda x: x[0])
data2.loc[data2['city'] == '!Якутск', 'city'] = 'Якутск'
data2['city_id'] = LabelEncoder().fit_transform(data2['city'])
```

```python
data2['split'] = data2['item_category_name'].str.split('-')
data2['item_category_name1'] = data2['split'].map(lambda x: x[0].strip())
data2['item_category_id1'] = LabelEncoder().fit_transform(data2['item_category_name1'])
data2['item_category_name2'] = data2['split'].map(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())
data2['item_category_id2'] = LabelEncoder().fit_transform(data2['item_category_name2'])
data2.drop(['split'], axis=1, inplace = True)
```

## First day on sale for each item

I utilized this note: https://www.kaggle.com/sushmaguntupalli/predict-future-sales-light-gbm-top-1#2.-Preparing-Training-Dataset-&-Feature-Engineering.

```python
tmp = data2[['year', 'month']].copy()
tmp['day'] = 1
tmp['day_of_year'] = pd.to_datetime(tmp[['year', 'month', 'day']]).dt.dayofyear
tmp['first_day_of_month'] = 365 * (tmp['year'] - 2013) + tmp['day_of_year']
data2['first_day_of_month'] = tmp['first_day_of_month']
```

```python
tmp = data2[['year', 'month']].copy()
tmp['day'] = 1
tmp['day_of_year'] = pd.to_datetime(tmp[['year', 'month', 'day']]).dt.dayofyear
tmp['first_day_of_month'] = 365 * (tmp['year'] - 2013) + tmp['day_of_year']
data2['first_day_of_month'] = tmp['first_day_of_month']
```

```python
sales = sales_train.copy()
sales['date'] = pd.to_datetime(sales.date,format='%d.%m.%Y')
sales['weekday'] = sales.date.dt.dayofweek
sales['day'] = sales.date.dt.dayofyear 
sales['day'] += 365 * (sales.date.dt.year-2013)
```

```python
data2 = data2.reset_index()
data2 = pd.merge(
    data2,
    sales.groupby('item_id').agg(
        first_sale_day__item = ('day', np.min),
        first_sale_block__item = ('date_block_num', np.min),
    ).reset_index(),
    on = ['item_id'],
    how = 'left'
)
data2 = pd.merge(
    data2,
    sales.groupby(['item_id', 'shop_id']).agg(
        first_sale_day__item_shop = ('day', np.min),
        first_sale_block__item_shop = ('date_block_num', np.min),
    ).reset_index(),
    on = ['item_id', 'shop_id'],
    how = 'left'
)
```

```python
data2['prev_days_on_sale__item'] = (data2['first_day_of_month'] - data2['first_sale_day__item']).fillna(0).clip(lower = 0).astype(np.int16)
data2['prev_blocks_on_sale__item'] = (data2['date_block_num'] - data2['first_sale_block__item']).fillna(0).clip(lower = 0).astype(np.int8)
data2['prev_days_on_sale__item_shop'] = (data2['first_day_of_month'] - data2['first_sale_day__item_shop']).fillna(0).clip(lower = 0).astype(np.int16)
data2['prev_blocks_on_sale__item_shop'] = (data2['date_block_num'] - data2['first_sale_block__item_shop']).fillna(0).clip(lower = 0).astype(np.int8)
```

```python
data2['first_sale_day__item_city'] = data2.groupby(['item_id', 'city_id'])['first_sale_day__item'].transform(np.min)
data2['first_sale_block__item_city'] = data2.groupby(['item_id', 'city_id'])['first_sale_block__item'].transform(np.min)
data2['prev_days_on_sale__item_city'] = (data2['first_day_of_month'] - data2['first_sale_day__item_city']).fillna(0).clip(lower = 0).astype(np.int16)
data2['prev_blocks_on_sale__item_city'] = (data2['date_block_num'] - data2['first_sale_block__item_city']).fillna(0).clip(lower = 0).astype(np.int8)
```

```python
day_quality = pd.merge(
    sales.groupby(['shop_id','weekday']).agg(
        shop_day_sales = ('item_cnt_day', np.sum)
    ).reset_index(),
    sales.groupby(['shop_id']).agg(
        shop_sales = ('item_cnt_day', np.sum)
    ).reset_index(),
    on='shop_id',
    how='left'
)
day_quality['day_quality'] = day_quality['shop_day_sales'] / day_quality['shop_sales']
```

```python
dates = pd.DataFrame(data = {'date':pd.date_range(start = '2013-01-01', end = '2015-11-30')})
dates['weekday'] = dates.date.dt.dayofweek
dates['month'] = dates.date.dt.month
dates['year'] = dates.date.dt.year - 2013
dates['date_block_num'] = dates['year'] * 12 + dates['month'] - 1
```

```python
shop_month_quality = pd.merge(
    dates,
    day_quality,
    on = ['weekday'],
    how = 'left'
)[['date_block_num', 'shop_id', 'day_quality']].groupby(['date_block_num', 'shop_id']).sum().reset_index()
```

```python
data2 = pd.merge(
    data2,
    shop_month_quality,
    on = ['date_block_num', 'shop_id'],
    how = 'left'
)
```

```python
data2 = data2.set_index('index')
```

```python
data2 = compress_data(data2)
```

```python
gc.collect()
```

## Mean target encodings

```python
data3 = data2.copy()
```

```python
def mean_target_encoding(data, block_column, target_column, groupby_columns, default_value = None):
    index_columns = [block_column] + groupby_columns
    encoded_column = '__'.join(groupby_columns) + '__encoded'
    
    data_grouped = data[[target_column] + index_columns].fillna(default_value).groupby(index_columns).agg(
        target_sum = (target_column, np.sum),
        target_count = (target_column, np.size)
    )
    
    data_grouped['target_cumsum'] = data_grouped.groupby(groupby_columns)['target_sum'].cumsum() - data_grouped['target_sum']
    data_grouped['target_cumcount'] = data_grouped.groupby(groupby_columns)['target_count'].cumsum() - data_grouped['target_count']
    data_grouped[encoded_column] = data_grouped['target_cumsum'] / data_grouped['target_cumcount']
    
    return pd.merge(data.reset_index(), data_grouped[[encoded_column]], left_on = index_columns, right_index = True, how = 'left').fillna(default_value).set_index('index')
```

```python
%%time
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['prev_days_on_sale__item'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['prev_blocks_on_sale__item'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['prev_days_on_sale__item_shop'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['prev_blocks_on_sale__item_shop'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['prev_days_on_sale__item_city'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['prev_blocks_on_sale__item_city'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['month'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['item_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['item_category_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['item_category_id1'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['item_category_id2'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['shop_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['city_id'], default_value = 0.0)
```

Add interaction terms:

```python
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['month', 'item_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['month', 'item_category_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['month', 'shop_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['month', 'city_id'], default_value = 0.0)

data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['shop_id', 'item_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['shop_id', 'item_category_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['shop_id', 'item_category_id1'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['shop_id', 'item_category_id2'], default_value = 0.0)

data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['city_id', 'item_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['city_id', 'item_category_id'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['city_id', 'item_category_id1'], default_value = 0.0)
data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['city_id', 'item_category_id2'], default_value = 0.0)

data3 = mean_target_encoding(data3, 'date_block_num', 'item_count', ['prev_blocks_on_sale__item', 'month'], default_value = 0.0)
```

```python
data3.to_pickle('data3.pickle')
```

In [None]:
with open("../input/pfs-data3/data3.pickle", "rb") as f:
    data3 = pickle.load(f)

In [None]:
gc.collect()

In [None]:
data3.info()

In [None]:
data3.describe()

# Validation Scheme

## Helper Functions

In [None]:
class Model():
    def train(self, x_tr, y_tr, x_va, y_va):
        pass
    
    def predict(self, x_test):
        pass

In [None]:
class RidgeModel(Model):
    def __init__(self, normalize = False):
        self.normalize = normalize
    
    def train(self, x_tr, y_tr, x_va, y_va):
        x_tr = x_tr.reset_index()
        x_va = x_va.reset_index()
        id_tr = x_tr.index
        id_va = x_va.index
        x = pd.concat([x_tr, x_va])
        y = pd.concat([y_tr, y_va])
        x = x.set_index('index')
        
        regressor = RidgeCV(cv = [(id_tr, id_va)], normalize = self.normalize).fit(x, y)
        self.regressor = regressor
        
    def predict(self, x_test): 
        pred_test = self.regressor.predict(x_test)
        return pred_test

In [None]:
class LassoModel(Model):
    def __init__(self, positive = True):
        self.positive = positive
    
    def train(self, x_tr, y_tr, x_va, y_va):
        x_tr = x_tr.reset_index()
        x_va = x_va.reset_index()
        id_tr = x_tr.index
        id_va = x_va.index
        x = pd.concat([x_tr, x_va])
        y = pd.concat([y_tr, y_va])
        x = x.set_index('index')
        
        regressor = LassoCV(cv = [(id_tr, id_va)], n_jobs = -1, positive = self.positive).fit(x, y)
        self.regressor = regressor
        
    def predict(self, x_test): 
        pred_test = self.regressor.predict(x_test)
        return pred_test

In [None]:
class LgbModel(Model):
    def __init__(self, lgb_params, num_boost_round = 10000, early_stopping_rounds = 1000, verbose_eval = 1000):
        self.lgb_params = lgb_params
        self.num_boost_round = num_boost_round
        self.early_stopping_rounds = early_stopping_rounds
        self.verbose_eval = verbose_eval
    
    def train(self, x_tr, y_tr, x_va, y_va):
        # validation is necessary for early stopping.
        train_set = lgb.Dataset(data = x_tr, label = y_tr)
        valid_set = lgb.Dataset(data = x_va, label = y_va)
        bst = lgb.train(
            params = self.lgb_params,
            train_set = train_set,
            num_boost_round = self.num_boost_round,
            valid_sets = [valid_set, train_set],
            valid_names = ['va', 'tr'],
            early_stopping_rounds = self.early_stopping_rounds,
            verbose_eval = self.verbose_eval
        )
        self.bst = bst
        
    def predict(self, x_test): 
        pred_test = self.bst.predict(x_test, num_iteration = self.bst.best_iteration)
        return pred_test

(I decided not to use Xgboost)

In [None]:
class XgbModel(Model):
    def __init__(self, xgb_params, num_boost_round = 100, early_stopping_rounds = 10, verbose_eval = 10):
        self.xgb_params = xgb_params
        self.num_boost_round = num_boost_round
        self.early_stopping_rounds = early_stopping_rounds
        self.verbose_eval = verbose_eval
    
    def train(self, x_tr, y_tr, x_va, y_va):
        x_tr_dmatrix = xgb.DMatrix(x_tr, label = y_tr)
        x_va_dmatrix = xgb.DMatrix(x_va, label = y_va)
        # validation is necessary for early stopping.
        bst = xgb.train(
            params = self.xgb_params,
            dtrain = x_tr_dmatrix,
            num_boost_round = self.num_boost_round,
            evals = [(x_tr_dmatrix, 'tr'), (x_va_dmatrix, 'va')],
            early_stopping_rounds = self.early_stopping_rounds,
            verbose_eval = self.verbose_eval
        )
        self.bst = bst
        
    def predict(self, x_test): 
        x_test_dmatrix = xgb.DMatrix(x_test)
        pred_test = self.bst.predict(x_test_dmatrix, ntree_limit = self.bst.best_ntree_limit)
            
        return pred_test

In [None]:
def train(data, models = [], features = [], start_block = 14, target_block = 34, block_column = 'date_block_num', target_column = 'target'):  
    trained_models = {}
    for (model_name, model) in models.items():
        trained_models[model_name] = {}
    
    for block in tqdm(range(start_block, target_block + 1)):
        # train the model for the block with 'n == block' using blocks with 'n < block'.
        # make sure that the model does not use the block with 'n == block'.
        id_tr = (data[block_column] < block - 1) & (data[block_column] >= start_block - 2)
        id_va = (data[block_column] == block - 1)
        x_tr, y_tr = data.loc[id_tr, features], data.loc[id_tr, target_column]
        x_va, y_va = data.loc[id_va, features], data.loc[id_va, target_column]

        for (model_name, model) in models.items():
            print(f'# {block}, {model_name}')
            model.train(x_tr, y_tr, x_va, y_va)
            trained_models[model_name][block] = copy.copy(model)
    
    return trained_models

In [None]:
def test(data, trained_models = {}, features = [], target_block = 34, block_column = 'date_block_num', target_column = 'target'):  
    data_layer = data.loc[:, [block_column, target_column]]
    scores = {}
    submissions = {}
    
    for (model_name, models) in tqdm(trained_models.items()):
        scores[model_name] = []
        for (block, model) in models.items():
            # predict for the block with 'n == block'.
            id_test = (data[block_column] == block)
            x_test, y_test = data.loc[id_test, features], data.loc[id_test, target_column]
            pred_test = model.predict(x_test)
            data_layer.loc[id_test, f'pred_{model_name}'] = pred_test
            if block < target_block:
                score = np.sqrt(metrics.mean_squared_error(y_true = y_test, y_pred = pred_test.clip(0, 20)))
                scores[model_name].append(score)
            else:
                submissions[model_name] = pred_test.clip(0, 20)
        scores[model_name] = pd.Series(scores[model_name], index = list(models.keys())[:-1])
        scores[model_name].name = model_name
            
    score_table = pd.concat(scores, axis = 1)
    corr_matrix = data_layer.drop(columns = block_column).corr()
    submissions = pd.DataFrame(submissions)
    
    return (data_layer, score_table, corr_matrix, submissions)

# 1st Layer

In [None]:
data_layer_1 = data3.copy()
del data3
gc.collect()

# 1st Layer: A

In [None]:
models_1A = {}

for d in [3, 4, 5]:
    lgb_params = {
        'objective': 'regression',
        'metric': ['rmse'],
        'learning_rate': 0.1,
        'num_leaves': 2 ** d,
        'min_data_in_leaf': 20,
        'max_depth': -1,
        'seed': 2021
    }
    models_1A[f'lgb_1A_d{d}'] = LgbModel(lgb_params)
    
features_1A = ['year', 'month', 'shop_id', 'item_id', 'item_category_id'] \
    + [f'item_count_{i}' for i in [1, 2, 3, 12]] \
    + [f'item_price_{i}' for i in [1, 2, 3, 12]] \
    + [f'item_sales_{i}' for i in [1, 2, 3, 12]] \
    + [f'word_{i}' for i in range(50)] \
    + ['city_id', 'item_category_id1', 'item_category_id2'] \
    + ['prev_days_on_sale__item',
       'prev_blocks_on_sale__item',
       'prev_days_on_sale__item_shop',
       'prev_blocks_on_sale__item_shop',
       'prev_days_on_sale__item_city',
       'prev_blocks_on_sale__item_city',
       'day_quality'
    ] \
    + ['prev_days_on_sale__item__encoded',
       'prev_blocks_on_sale__item__encoded',
       'prev_days_on_sale__item_shop__encoded',
       'prev_blocks_on_sale__item_shop__encoded',
       'prev_days_on_sale__item_city__encoded',
       'prev_blocks_on_sale__item_city__encoded',
       'month__encoded',
       'item_id__encoded',
       'item_category_id__encoded',
       'item_category_id1__encoded',
       'item_category_id2__encoded',
       'shop_id__encoded',
       'city_id__encoded'
    ]
len(features_1A)

```python
%%time
trained_models_1A = train(data_layer_1, models_1A, features_1A, start_block = 14, target_block = 34)
```

```python
with open('trained_models_1A.pickle', 'wb') as f:
    pickle.dump(trained_models_1A, f)
```

In [None]:
with open('../input/pfs-trained-models/PFS_trained_models_1A.pickle', 'rb') as f:
    trained_models_1A = pickle.load(f)

In [None]:
data_layer_2A, score_table_1A, corr_matrix_1A, submissions_1A = test(data_layer_1, trained_models_1A, features_1A, target_block = 34)

In [None]:
score_table_1A.boxplot()

Let us see the correlation matrix.

In [None]:
lgb.plot_importance(trained_models_1A[f'lgb_1A_d4'][34].bst, figsize = (12, 12), importance_type = 'split', max_num_features = 50)

In [None]:
lgb.plot_importance(trained_models_1A[f'lgb_1A_d4'][34].bst, figsize = (12, 12), importance_type = 'gain', max_num_features = 50)

In [None]:
x_train = data_layer_1.loc[data_layer_1['date_block_num'].isin(range(12, 34)), features_1A]

In [None]:
x = x_train.sample(10000)
explainer = shap.TreeExplainer(trained_models_1A[f'lgb_1A_d4'][34].bst, data = x)
x_shap = explainer.shap_values(x)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns
)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns,
    plot_type='bar',
    max_display = 25
)

In [None]:
del trained_models_1A
gc.collect()

# 1st Layer: B

In [None]:
models_1B = {}

for d in [3, 4, 5]:
    lgb_params = {
        'objective': 'regression',
        'metric': ['rmse'],
        'learning_rate': 0.1,
        'num_leaves': 2 ** d,
        'min_data_in_leaf': 20,
        'max_depth': -1,
        'seed': 2021
    }
    models_1B[f'lgb_1B_d{d}'] = LgbModel(lgb_params)
    
features_1B = ['year', 'month', 'shop_id', 'item_id', 'item_category_id'] \
    + [f'item_count_{i}' for i in [1, 2, 3]] \
    + [f'item_price_{i}' for i in [1, 2, 3]] \
    + [f'item_sales_{i}' for i in [1, 2, 3]] \
    + [f'word_{i}' for i in range(50)] \
    + ['city_id', 'item_category_id1', 'item_category_id2'] \
    + ['prev_days_on_sale__item',
       'prev_blocks_on_sale__item',
       'prev_days_on_sale__item_shop',
       'prev_blocks_on_sale__item_shop',
       'prev_days_on_sale__item_city',
       'prev_blocks_on_sale__item_city',
       'day_quality'
    ] \
    + ['prev_days_on_sale__item__encoded',
       'prev_blocks_on_sale__item__encoded',
       'prev_days_on_sale__item_shop__encoded',
       'prev_blocks_on_sale__item_shop__encoded',
       'prev_days_on_sale__item_city__encoded',
       'prev_blocks_on_sale__item_city__encoded',
       'month__encoded',
       'item_id__encoded',
       'item_category_id__encoded',
       'item_category_id1__encoded',
       'item_category_id2__encoded',
       'shop_id__encoded',
       'city_id__encoded'
    ]
len(features_1B)

```python
%%time
trained_models_1B = train(data_layer_1, models_1B, features_1B, start_block = 5, target_block = 34)
```

```python
with open('trained_models_1B.pickle', 'wb') as f:
    pickle.dump(trained_models_1B, f)
```

In [None]:
with open('../input/pfs-trained-models/PFS_trained_models_1B.pickle', 'rb') as f:
    trained_models_1B = pickle.load(f)

In [None]:
data_layer_2B, score_table_1B, corr_matrix_1B, submissions_1B = test(data_layer_1, trained_models_1B, features_1B, target_block = 34)

In [None]:
score_table_1B.boxplot()

In [None]:
lgb.plot_importance(trained_models_1B[f'lgb_1B_d4'][34].bst, figsize = (12, 12), importance_type = 'split', max_num_features = 50)

In [None]:
lgb.plot_importance(trained_models_1B[f'lgb_1B_d4'][34].bst, figsize = (12, 12), importance_type = 'gain', max_num_features = 50)

In [None]:
x_train = data_layer_1.loc[data_layer_1['date_block_num'].isin(range(5, 34)), features_1B]

In [None]:
x = x_train.sample(10000)
explainer = shap.TreeExplainer(trained_models_1B[f'lgb_1B_d4'][34].bst, data = x)
x_shap = explainer.shap_values(x)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns
)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns,
    plot_type='bar',
    max_display = 25
)

In [None]:
del trained_models_1B
gc.collect()

# 1st Layer: C

In [None]:
models_1C = {}

for d in [2, 3, 4, 5]:
    lgb_params = {
        'objective': 'regression',
        'metric': ['rmse'],
        'learning_rate': 0.1,
        'num_leaves': 2 ** d,
        'min_data_in_leaf': 20,
        'max_depth': -1,
        'seed': 2021
    }
    models_1C[f'lgb_1C_d{d}'] = LgbModel(lgb_params)
    
features_1C = ['year', 'month', 'shop_id', 'item_id', 'item_category_id'] \
    + [f'item_count_{i}' for i in [1, 2, 3, 12]] \
    + [f'item_price_{i}' for i in [1, 2, 3, 12]] \
    + [f'item_sales_{i}' for i in [1, 2, 3, 12]] \
    + [f'word_{i}' for i in range(50)] \
    + ['city_id', 'item_category_id1', 'item_category_id2'] \
    + ['prev_days_on_sale__item',
       'prev_blocks_on_sale__item',
       'prev_days_on_sale__item_shop',
       'prev_blocks_on_sale__item_shop',
       'prev_days_on_sale__item_city',
       'prev_blocks_on_sale__item_city',
       'day_quality'
    ] \
    + ['prev_days_on_sale__item__encoded',
       'prev_blocks_on_sale__item__encoded',
       'prev_days_on_sale__item_shop__encoded',
       'prev_blocks_on_sale__item_shop__encoded',
       'prev_days_on_sale__item_city__encoded',
       'prev_blocks_on_sale__item_city__encoded',
       'month__encoded',
       'item_id__encoded',
       'item_category_id__encoded',
       'item_category_id1__encoded',
       'item_category_id2__encoded',
       'shop_id__encoded',
       'city_id__encoded'
    ] \
    + ['month__item_id__encoded',
       'month__item_category_id__encoded',
       'month__shop_id__encoded',
       'month__city_id__encoded',
       'shop_id__item_id__encoded',
       'shop_id__item_category_id__encoded',
       'shop_id__item_category_id1__encoded',
       'shop_id__item_category_id2__encoded',
       'city_id__item_id__encoded',
       'city_id__item_category_id__encoded',
       'city_id__item_category_id1__encoded',
       'city_id__item_category_id2__encoded',
       'prev_blocks_on_sale__item__month__encoded'
      ]
len(features_1C)

```python
%%time
trained_models_1C = train(data_layer_1, models_1C, features_1C, start_block = 14, target_block = 34)
```

```python
with open('trained_models_1C.pickle', 'wb') as f:
    pickle.dump(trained_models_1C, f)
```

In [None]:
with open('../input/pfs-trained-models/PFS_trained_models_1C.pickle', 'rb') as f:
    trained_models_1C = pickle.load(f)

In [None]:
data_layer_2C, score_table_1C, corr_matrix_1C, submissions_1C = test(data_layer_1, trained_models_1C, features_1C, target_block = 34)

In [None]:
score_table_1C.boxplot()

In [None]:
lgb.plot_importance(trained_models_1C[f'lgb_1C_d4'][34].bst, figsize = (12, 12), importance_type = 'split', max_num_features = 50)

In [None]:
lgb.plot_importance(trained_models_1C[f'lgb_1C_d4'][34].bst, figsize = (12, 12), importance_type = 'gain', max_num_features = 50)

In [None]:
x_train = data_layer_1.loc[data_layer_1['date_block_num'].isin(range(14, 34)), features_1C]

In [None]:
x = x_train.sample(10000)
explainer = shap.TreeExplainer(trained_models_1C[f'lgb_1C_d4'][34].bst, data = x)
x_shap = explainer.shap_values(x)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns
)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns,
    plot_type='bar',
    max_display = 25
)

In [None]:
del trained_models_1C
gc.collect()

# 1st Layer: D

In [None]:
models_1D = {}

for d in [2, 3, 4, 5]:
    lgb_params = {
        'objective': 'regression',
        'metric': ['rmse'],
        'learning_rate': 0.1,
        'num_leaves': 2 ** d,
        'min_data_in_leaf': 20,
        'max_depth': -1,
        'seed': 2021
    }
    models_1D[f'lgb_1D_d{d}'] = LgbModel(lgb_params)
    
features_1D = ['year', 'month', 'shop_id', 'item_id', 'item_category_id'] \
    + [f'item_count_{i}' for i in [1, 2, 3]] \
    + [f'item_price_{i}' for i in [1, 2, 3]] \
    + [f'item_sales_{i}' for i in [1, 2, 3]] \
    + [f'word_{i}' for i in range(50)] \
    + ['city_id', 'item_category_id1', 'item_category_id2'] \
    + ['prev_days_on_sale__item',
       'prev_blocks_on_sale__item',
       'prev_days_on_sale__item_shop',
       'prev_blocks_on_sale__item_shop',
       'prev_days_on_sale__item_city',
       'prev_blocks_on_sale__item_city',
       'day_quality'
    ] \
    + ['prev_days_on_sale__item__encoded',
       'prev_blocks_on_sale__item__encoded',
       'prev_days_on_sale__item_shop__encoded',
       'prev_blocks_on_sale__item_shop__encoded',
       'prev_days_on_sale__item_city__encoded',
       'prev_blocks_on_sale__item_city__encoded',
       'month__encoded',
       'item_id__encoded',
       'item_category_id__encoded',
       'item_category_id1__encoded',
       'item_category_id2__encoded',
       'shop_id__encoded',
       'city_id__encoded'
    ] \
    + ['month__item_id__encoded',
       'month__item_category_id__encoded',
       'month__shop_id__encoded',
       'month__city_id__encoded',
       'shop_id__item_id__encoded',
       'shop_id__item_category_id__encoded',
       'shop_id__item_category_id1__encoded',
       'shop_id__item_category_id2__encoded',
       'city_id__item_id__encoded',
       'city_id__item_category_id__encoded',
       'city_id__item_category_id1__encoded',
       'city_id__item_category_id2__encoded',
       'prev_blocks_on_sale__item__month__encoded'
      ]
len(features_1D)

```python
%%time
trained_models_1D = train(data_layer_1, models_1D, features_1D, start_block = 5, target_block = 34)
```

```python
with open('trained_models_1D.pickle', 'wb') as f:
    pickle.dump(trained_models_1D, f)
```

In [None]:
with open('../input/pfs-trained-models/PFS_trained_models_1D.pickle', 'rb') as f:
    trained_models_1D = pickle.load(f)

In [None]:
data_layer_2D, score_table_1D, corr_matrix_1D, submissions_1D = test(data_layer_1, trained_models_1D, features_1D, target_block = 34)

In [None]:
score_table_1D.boxplot()

In [None]:
lgb.plot_importance(trained_models_1D[f'lgb_1D_d4'][34].bst, figsize = (12, 12), importance_type = 'split', max_num_features = 50)

In [None]:
lgb.plot_importance(trained_models_1D[f'lgb_1D_d4'][34].bst, figsize = (12, 12), importance_type = 'gain', max_num_features = 50)

In [None]:
x_train = data_layer_1.loc[data_layer_1['date_block_num'].isin(range(5, 34)), features_1D]

In [None]:
x = x_train.sample(10000)
explainer = shap.TreeExplainer(trained_models_1D[f'lgb_1D_d4'][34].bst, data = x)
x_shap = explainer.shap_values(x)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns
)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns,
    plot_type='bar',
    max_display = 25
)

In [None]:
del trained_models_1D
gc.collect()

# 1st Layer: E

In [None]:
models_1E = {}

for d in [2, 3, 4, 5]:
    lgb_params = {
        'objective': 'regression',
        'metric': ['rmse'],
        'learning_rate': 0.1,
        'num_leaves': 2 ** d,
        'min_data_in_leaf': 20,
        'max_depth': -1,
        'seed': 2021
    }
    models_1E[f'lgb_1E_d{d}'] = LgbModel(lgb_params)
    
features_1E = ['year', 'month', 'shop_id', 'item_id', 'item_category_id'] \
    + [f'item_count_{i}' for i in [1]] \
    + [f'item_price_{i}' for i in [1]] \
    + [f'item_sales_{i}' for i in [1]] \
    + [f'word_{i}' for i in range(50)] \
    + ['city_id', 'item_category_id1', 'item_category_id2'] \
    + ['prev_days_on_sale__item',
       'prev_blocks_on_sale__item',
       'prev_days_on_sale__item_shop',
       'prev_blocks_on_sale__item_shop',
       'prev_days_on_sale__item_city',
       'prev_blocks_on_sale__item_city',
       'day_quality'
    ] \
    + ['prev_days_on_sale__item__encoded',
       'prev_blocks_on_sale__item__encoded',
       'prev_days_on_sale__item_shop__encoded',
       'prev_blocks_on_sale__item_shop__encoded',
       'prev_days_on_sale__item_city__encoded',
       'prev_blocks_on_sale__item_city__encoded',
       'month__encoded',
       'item_id__encoded',
       'item_category_id__encoded',
       'item_category_id1__encoded',
       'item_category_id2__encoded',
       'shop_id__encoded',
       'city_id__encoded'
    ] \
    + ['month__item_id__encoded',
       'month__item_category_id__encoded',
       'month__shop_id__encoded',
       'month__city_id__encoded',
       'shop_id__item_id__encoded',
       'shop_id__item_category_id__encoded',
       'shop_id__item_category_id1__encoded',
       'shop_id__item_category_id2__encoded',
       'city_id__item_id__encoded',
       'city_id__item_category_id__encoded',
       'city_id__item_category_id1__encoded',
       'city_id__item_category_id2__encoded',
       'prev_blocks_on_sale__item__month__encoded'
      ]
len(features_1E)

```python
%%time
trained_models_1E = train(data_layer_1, models_1E, features_1E, start_block = 3, target_block = 34)
```

```python
with open('trained_models_1E.pickle', 'wb') as f:
    pickle.dump(trained_models_1E, f)
```

In [None]:
with open('../input/pfs-trained-models/PFS_trained_models_1E.pickle', 'rb') as f:
    trained_models_1E = pickle.load(f)

In [None]:
data_layer_2E, score_table_1E, corr_matrix_1E, submissions_1E = test(data_layer_1, trained_models_1E, features_1E, target_block = 34)

In [None]:
score_table_1E.boxplot()

In [None]:
lgb.plot_importance(trained_models_1E[f'lgb_1E_d4'][34].bst, figsize = (12, 12), importance_type = 'split', max_num_features = 50)

In [None]:
lgb.plot_importance(trained_models_1E[f'lgb_1E_d4'][34].bst, figsize = (12, 12), importance_type = 'gain', max_num_features = 50)

In [None]:
x_train = data_layer_1.loc[data_layer_1['date_block_num'].isin(range(3, 34)), features_1E]

In [None]:
x = x_train.sample(10000)
explainer = shap.TreeExplainer(trained_models_1E[f'lgb_1E_d4'][34].bst, data = x)
x_shap = explainer.shap_values(x)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns
)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns,
    plot_type='bar',
    max_display = 25
)

In [None]:
del trained_models_1E
gc.collect()

# 2nd Layer

In [None]:
data_layer_2 = pd.concat([
    data_layer_2A[['date_block_num', 'target']],
    data_layer_2A.drop(columns = ['date_block_num', 'target']),
    data_layer_2B.drop(columns = ['date_block_num', 'target']),
    data_layer_2C.drop(columns = ['date_block_num', 'target']),
    data_layer_2D.drop(columns = ['date_block_num', 'target']),
    data_layer_2E.drop(columns = ['date_block_num', 'target'])
], axis = 1)

In [None]:
del data_layer_1
del data_layer_2A
del data_layer_2B
del data_layer_2C
del data_layer_2D
del data_layer_2E
gc.collect()

In [None]:
score_table = pd.concat([
    score_table_1A,
    score_table_1B,
    score_table_1C,
    score_table_1D,
    score_table_1E
], axis = 1)

In [None]:
score_table

In [None]:
fig = plt.figure(figsize = (16, 10))
score_table[14:].boxplot()

In [None]:
sns.heatmap(data_layer_2[data_layer_2['date_block_num'] == 34][[
    'pred_lgb_1A_d3', 'pred_lgb_1A_d4', 'pred_lgb_1A_d5',
    'pred_lgb_1B_d3', 'pred_lgb_1B_d4', 'pred_lgb_1B_d5',
    'pred_lgb_1C_d2', 'pred_lgb_1C_d3', 'pred_lgb_1C_d4', 'pred_lgb_1C_d5',
    'pred_lgb_1D_d2', 'pred_lgb_1D_d3', 'pred_lgb_1D_d4', 'pred_lgb_1D_d5',
    'pred_lgb_1E_d2', 'pred_lgb_1E_d3', 'pred_lgb_1E_d4', 'pred_lgb_1E_d5',
]].corr(), cmap = sns.diverging_palette(230, 20, as_cmap = True))

In [None]:
models_2 = {}
models_2[f'lasso_2'] = LassoModel(positive = 'True')
models_2[f'ridge_2'] = RidgeModel()

features_2 = [
    'pred_lgb_1A_d3', 'pred_lgb_1A_d4', 'pred_lgb_1A_d5',
    'pred_lgb_1B_d3', 'pred_lgb_1B_d4', 'pred_lgb_1B_d5',
    'pred_lgb_1C_d2', 'pred_lgb_1C_d3', 'pred_lgb_1C_d4', 'pred_lgb_1C_d5',
    'pred_lgb_1D_d2', 'pred_lgb_1D_d3', 'pred_lgb_1D_d4', 'pred_lgb_1D_d5',
    'pred_lgb_1E_d2', 'pred_lgb_1E_d3', 'pred_lgb_1E_d4', 'pred_lgb_1E_d5',
]
len(features_2)

In [None]:
%%time
trained_models_2 = train(data_layer_2, models_2, features_2, start_block = 16, target_block = 34)

In [None]:
data_layer_3, score_table_2, corr_matrix_2, submissions_2 = test(data_layer_2, trained_models_2, features_2, target_block = 34)

In [None]:
score_table_2

In [None]:
score_table_2.boxplot()

In [None]:
corr_matrix_2

In [None]:
x_train = data_layer_2.loc[data_layer_2['date_block_num'].isin(range(14, 34)), features_2]
x = x_train.sample(1000)

In [None]:
explainer = shap.Explainer(trained_models_2[f'lasso_2'][34].regressor.predict, x)
x_shap = explainer(x)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns
)

In [None]:
shap.summary_plot(
    shap_values = x_shap,
    features = x,
    feature_names = x.columns,
    plot_type='bar',
    max_display = 25
)

In [None]:
submission_lasso = submissions_2[['lasso_2']].reset_index()
submission_lasso.columns = ['ID', 'item_cnt_month']
submission_lasso

In [None]:
submission_lasso.to_csv('submission_lasso.csv', index=False)

In [None]:
submission_ridge = submissions_2[['ridge_2']].reset_index()
submission_ridge.columns = ['ID', 'item_cnt_month']
submission_ridge

In [None]:
submission_ridge.to_csv('submission_ridge.csv', index=False)

In [None]:
submission = submissions_2.mean(axis = 1).reset_index()
submission.columns = ['ID', 'item_cnt_month']
submission

In [None]:
score_table = pd.concat([
    score_table,
    score_table_2,
], axis = 1)

In [None]:
score_table[16:]

In [None]:
fig = plt.figure(figsize = (16, 10))
score_table[16:].boxplot()