Now that we have fixed and generated three feature subsets
1. non-lagged + lagged textual features
2. lagged {target,item,shop} + non-lagged basic categories
3. lagged features within shop

and three first level classifiers types for each
* a.  CatBoost
* b. RidgeCV 
* c. Random Forest (sklearn) 

we search for hyperarameters that are used for predicting a month 
based on twelve month history, with one month gap between training and prediction periods.

This is a compromise of the prediction quality on the other hand, and not having the prediction 
quality and optimal hyperparameters vary too much over the training period when generating the first level predictions as input features of second stacking level.

The search for hyperparameters is problematic in whole because the chosen validation scheme is lacking. There may not be
too much that can be done, because the validation data necessarily has different distribution as the actual testing data.
This is because the temporal nature of the prediction problem. The distributions slowly drift during cause of time. Therefore, 
it is good to have the validation period temporally close to the test period. On the other hand, data analysis shows strong seasonal=(yearly) effects. 
Predicting October sales based on previous year simply is a very different problem to predicting December sales, as sales figures seem to peak strongly in December and have special characteristics.

We decide to search for such hyperparameters that maximise the quality of predictions (with
reasonable computational burden) in the hold-out validation data of Oct 2015. This is despite the fact that we have seen in examples that
such optimal model hyperparameters do not result in optimal prediction quality for Dec 2015.
We specifically do not search for such hyperparameters (via a coross-validation scheme) that would maximise the quality of predictions during
the training period, as the value of temporally distant predictions is questionable after because of the distribution shift throughtime.


The parameters are used for
a) creating submissions for ensembling using simple schemes
b) generating level 2 input features for a stacking algorithm



In [1]:
import numpy as np
import pandas as pd 
import sklearn
import scipy.sparse 
import lightgbm as lgb
from sklearn.metrics import r2_score
import catboost
import gc
import matplotlib.pyplot as plt

from catboost import CatBoostRegressor, Pool

import re
import os

for p in [np, pd, scipy, sklearn, lgb, catboost]:
    print (p.__name__, p.__version__)
    
DATA_FOLDER = 'competitive-data-science-predict-future-sales'
test_spec = pd.read_csv(os.path.join(DATA_FOLDER, 'test.csv'))

index_cols=['item_id','shop_id','date_block_num']
date_block_val = 33
date_block_test = 35 # Dec 2015

test2submission_mapping_generated = False

numpy 1.18.1
pandas 0.25.3
scipy 1.4.1
sklearn 0.22.1
lightgbm 2.3.1
catboost 0.22


In [2]:
def downcast_dtypes(df):
    '''
        Changes column types in the dataframe: 
                
                `float64` type to `float32`
                `int64`   type to `int32`
    '''
    
    # Select columns to downcast
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols =   [c for c in df if df[c].dtype == "int64"]
    
    # Downcast
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols]   = df[int_cols].astype(np.int32)
    
    return df

In [3]:
def write_predictions_by_array(array, filename):
  df=pd.DataFrame(array)
  df.columns=['item_cnt_month']
  df.to_csv(os.path.join(DATA_FOLDER, filename), index_label='ID')

In [4]:
def clipped_rmse(gt, predicted,clip_min=0, clip_max=20):
  target=np.minimum(np.maximum(gt,clip_min), clip_max)
  return np.sqrt((target-predicted)**2).mean()

# 2) Feature set 2: text based features

In [5]:
# load data

all_data = pd.read_csv(os.path.join(DATA_FOLDER, 'feature_set_text.csv'))

dates=all_data['date_block_num']

y_train = all_data.loc[(dates>= date_block_val - 9) & (dates<= date_block_val - 2), 'target']
y_trainval = all_data.loc[(dates>= date_block_test - 9) & (dates<= date_block_test - 2), 'target']
y_val = all_data.loc[dates == date_block_val, 'target']
y_test = all_data.loc[dates == date_block_test, 'target']

to_drop_cols = ['target','date_block_num']

X_train = all_data.loc[(dates>= date_block_val - 9) & (dates<= date_block_val - 2)].drop(to_drop_cols, axis=1)
X_trainval = all_data.loc[(dates>= date_block_test - 9) & (dates<= date_block_test - 2)].drop(to_drop_cols, axis=1)
X_val = all_data.loc[dates == date_block_val].drop(to_drop_cols, axis=1)
X_test = all_data.loc[dates == date_block_test].drop(to_drop_cols, axis=1)

shop_item2submissionid={}
for idx, row in test_spec.iterrows():
    shop_item2submissionid[str(row['shop_id'])+'_'+str(row['item_id'])] = row['ID']
    
test_data=all_data.loc[dates == date_block_test, ['shop_id','item_id']]    
    
testidx2submissionidx=np.zeros(test_data.shape[0], dtype=np.int32)
for idx in range(test_data.shape[0]):
    row =test_data.iloc[idx]
    testidx2submissionidx[idx] = shop_item2submissionid[str(row['shop_id'])+'_'+str(row['item_id'])]
    
 
#invert the mapping
submissionidx2testidx=np.zeros(test_data.shape[0], dtype=np.int32)
for i in range(test_data.shape[0]):
    submissionidx2testidx[testidx2submissionidx[i]]=i
    
del test_data
gc.collect()    



0

In [6]:
from sklearn import linear_model

#model=linear_model.RidgeCV(alphas=np.logspace(-3,13), fit_intercept=True, normalize=True)
model=linear_model.Ridge(alpha=0.04, fit_intercept=True, normalize=True)
model.fit(X_train.to_numpy(), y_train)
pred_val = np.clip(model.predict(X_val.to_numpy()), 0, 20)
#print('Validation R-squared for LightGBM is %f' % r2_score(y_val, pred_lgb_val))
print('Clipped RMSE {}'.format(clipped_rmse(y_val, pred_val)))

model.fit(X_trainval.to_numpy(), y_trainval)
pred_test = np.clip(model.predict(X_test.to_numpy()), 0, 20)
write_predictions_by_array(pred_test[submissionidx2testidx], 'submission-ridge-feature_set_text.csv')
# LB 1.203418 and 1.189612


Clipped RMSE 0.46532662185032936


In [11]:
X_train.columns

Index(['item_name_category_tfidf_bigram_256',
       'target_category_tfidf_unigram_256_lag_6',
       'target_category_frequent_256_lag_4', 'item_name_category_frequent_32',
       'target_category_tfidf_unigram_256_within_shop_lag_12',
       'target_category_tfidf_bigram_256_lag_5',
       'target_category_tfidf_unigram_256_lag_2',
       'target_category_frequent_256_within_shop_lag_4',
       'item_name_category_frequent_256',
       'item_name_category_tfidf_unigram_256',
       'target_category_frequent_256_within_shop_lag_5',
       'target_category_frequent_256_lag_5',
       'item_name_category_tfidf_bigram_32',
       'target_category_frequent_256_within_shop_lag_6',
       'target_category_frequent_256_within_shop_lag_2',
       'target_category_tfidf_unigram_256_lag_5',
       'target_category_tfidf_bigram_256_within_shop_lag_12',
       'target_category_frequent_256_lag_12',
       'target_category_frequent_256_within_shop_lag_12',
       'target_category_tfidf_unigram_25

In [7]:
lr=0.01
reg=CatBoostRegressor(iterations=2100, depth=16, eta=lr,metric_period=20)
#eval_dataset= Pool(X_val,y_val)
#reg.fit(X_train.to_numpy(), y_train, eval_set=eval_dataset)
reg.fit(X_train, y_train)


0:	learn: 3.2430756	total: 4.19s	remaining: 2h 26m 42s


KeyboardInterrupt: 

In [13]:
lr=0.3
reg=CatBoostRegressor(iterations=100, depth=16, eta=lr,early_stopping_rounds=50)
eval_dataset= Pool(X_val,y_val)
#reg.fit(X_train.to_numpy(), y_train, eval_set=eval_dataset)
reg.fit(X_train.to_numpy(), y_train, eval_set=eval_dataset)
pred_val = np.clip(reg.predict(X_val.to_numpy()), 0, 20)
    #print('Validation R-squared for LightGBM is %f' % r2_score(y_val, pred_lgb_val))
print('Clipped RMSE {}'.format(clipped_rmse(y_val, pred_val)))

0:	learn: 2.9938248	test: 5.3098607	best: 5.3098607 (0)	total: 4.11s	remaining: 6m 46s
1:	learn: 2.8194187	test: 5.2276048	best: 5.2276048 (1)	total: 8.23s	remaining: 6m 43s
2:	learn: 2.7197118	test: 5.1790292	best: 5.1790292 (2)	total: 12.4s	remaining: 6m 41s
3:	learn: 2.6275895	test: 5.1653746	best: 5.1653746 (3)	total: 16.5s	remaining: 6m 36s
4:	learn: 2.5644537	test: 5.1456011	best: 5.1456011 (4)	total: 20.8s	remaining: 6m 34s
5:	learn: 2.4976971	test: 5.1274460	best: 5.1274460 (5)	total: 25s	remaining: 6m 31s
6:	learn: 2.4389858	test: 5.1244156	best: 5.1244156 (6)	total: 29.2s	remaining: 6m 28s
7:	learn: 2.3734834	test: 5.1118234	best: 5.1118234 (7)	total: 33.5s	remaining: 6m 24s
8:	learn: 2.3301853	test: 4.9783693	best: 4.9783693 (8)	total: 37.6s	remaining: 6m 20s
9:	learn: 2.3147680	test: 4.9776331	best: 4.9776331 (9)	total: 41.9s	remaining: 6m 16s
10:	learn: 2.3032872	test: 4.9774073	best: 4.9774073 (10)	total: 46s	remaining: 6m 12s
11:	learn: 2.2636778	test: 4.9768361	best: 4.

In [14]:
reg=CatBoostRegressor(iterations=75, depth=16, eta=0.3)
reg.fit(X_trainval.to_numpy(), y_trainval)
pred_test = np.clip(reg.predict(X_test.to_numpy()), 0, 20)
write_predictions_by_array(pred_test[submissionidx2testidx], 'submission-catboost-feature_set_text-lr0.01.csv')
# LB score 1.121911 and 1.11979

0:	learn: 3.8032055	total: 3.97s	remaining: 4m 53s
1:	learn: 3.5934791	total: 8.06s	remaining: 4m 54s
2:	learn: 3.3624003	total: 12.2s	remaining: 4m 52s
3:	learn: 3.2548566	total: 16.3s	remaining: 4m 49s
4:	learn: 3.1810405	total: 20.3s	remaining: 4m 44s
5:	learn: 3.1081750	total: 24.5s	remaining: 4m 42s
6:	learn: 3.0140236	total: 28.8s	remaining: 4m 39s
7:	learn: 2.9076039	total: 32.9s	remaining: 4m 35s
8:	learn: 2.8283148	total: 36.9s	remaining: 4m 30s
9:	learn: 2.7912424	total: 41s	remaining: 4m 26s
10:	learn: 2.7416821	total: 45.2s	remaining: 4m 22s
11:	learn: 2.7171035	total: 49.3s	remaining: 4m 18s
12:	learn: 2.6968418	total: 53.4s	remaining: 4m 14s
13:	learn: 2.6532292	total: 57.5s	remaining: 4m 10s
14:	learn: 2.6021161	total: 1m 1s	remaining: 4m 6s
15:	learn: 2.5572232	total: 1m 5s	remaining: 4m 1s
16:	learn: 2.5168888	total: 1m 9s	remaining: 3m 57s
17:	learn: 2.4807065	total: 1m 13s	remaining: 3m 53s
18:	learn: 2.4585676	total: 1m 17s	remaining: 3m 49s
19:	learn: 2.4393804	tot

# Feature set 3: lags within shop

In [15]:
# load data

all_data = pd.read_csv(os.path.join(DATA_FOLDER, 'feature_set_within.csv'))

dates=all_data['date_block_num']

y_train = all_data.loc[(dates>= date_block_val - 9) & (dates<= date_block_val - 2), 'target']
y_trainval = all_data.loc[(dates>= date_block_test - 9) & (dates<= date_block_test - 2), 'target']
y_val = all_data.loc[dates == date_block_val, 'target']
y_test = all_data.loc[dates == date_block_test, 'target']

to_drop_cols = ['target','date_block_num']

X_train = all_data.loc[(dates>= date_block_val - 9) & (dates<= date_block_val - 2)].drop(to_drop_cols, axis=1)
X_trainval = all_data.loc[(dates>= date_block_test - 9) & (dates<= date_block_test - 2)].drop(to_drop_cols, axis=1)
X_val = all_data.loc[dates == date_block_val].drop(to_drop_cols, axis=1)
X_test = all_data.loc[dates == date_block_test].drop(to_drop_cols, axis=1)

shop_item2submissionid={}
for idx, row in test_spec.iterrows():
    shop_item2submissionid[str(row['shop_id'])+'_'+str(row['item_id'])] = row['ID']
    
test_data=all_data.loc[dates == date_block_test, ['shop_id','item_id']]    
    
testidx2submissionidx=np.zeros(test_data.shape[0], dtype=np.int32)
for idx in range(test_data.shape[0]):
    row =test_data.iloc[idx]
    testidx2submissionidx[idx] = shop_item2submissionid[str(row['shop_id'])+'_'+str(row['item_id'])]
    
 
#invert the mapping
submissionidx2testidx=np.zeros(test_data.shape[0], dtype=np.int32)
for i in range(test_data.shape[0]):
    submissionidx2testidx[testidx2submissionidx[i]]=i
    
del test_data
gc.collect()    



0

In [None]:
from sklearn import linear_model

#model=linear_model.RidgeCV(alphas=np.logspace(-3,13), fit_intercept=False)
model=linear_model.Ridge(alpha=3e7, fit_intercept=False)
model.fit(X_train.to_numpy(), y_train)
pred_val = np.clip(model.predict(X_val.to_numpy()), 0, 20)
#print('Validation R-squared for LightGBM is %f' % r2_score(y_val, pred_lgb_val))
print('Clipped RMSE {}'.format(clipped_rmse(y_val, pred_val)))

model.fit(X_trainval.to_numpy(), y_trainval)
pred_test = np.clip(model.predict(X_test.to_numpy()), 0, 20)
write_predictions_by_array(pred_test[submissionidx2testidx], 'submission-ridge-feature_set_within.csv')
# LB 1.215079 and 1.202396


In [None]:
model.alpha_

In [16]:
lr=0.3
reg=CatBoostRegressor(iterations=100, depth=16, eta=lr,early_stopping_rounds=50)
eval_dataset= Pool(X_val,y_val)
#reg.fit(X_train.to_numpy(), y_train, eval_set=eval_dataset)
reg.fit(X_train.to_numpy(), y_train, eval_set=eval_dataset)
pred_val = np.clip(reg.predict(X_val.to_numpy()), 0, 20)
    #print('Validation R-squared for LightGBM is %f' % r2_score(y_val, pred_lgb_val))
print('Clipped RMSE {}'.format(clipped_rmse(y_val, pred_val)))

0:	learn: 2.9968440	test: 5.3387101	best: 5.3387101 (0)	total: 3.92s	remaining: 6m 27s
1:	learn: 2.8941772	test: 5.3108525	best: 5.3108525 (1)	total: 7.86s	remaining: 6m 25s
2:	learn: 2.7389500	test: 5.2505995	best: 5.2505995 (2)	total: 11.8s	remaining: 6m 20s
3:	learn: 2.6352402	test: 5.1358947	best: 5.1358947 (3)	total: 15.6s	remaining: 6m 15s
4:	learn: 2.6038726	test: 5.1348162	best: 5.1348162 (4)	total: 19.5s	remaining: 6m 10s
5:	learn: 2.5414224	test: 5.1220145	best: 5.1220145 (5)	total: 23.3s	remaining: 6m 5s
6:	learn: 2.4635186	test: 5.0986699	best: 5.0986699 (6)	total: 27.2s	remaining: 6m 1s
7:	learn: 2.4458983	test: 5.0984144	best: 5.0984144 (7)	total: 31s	remaining: 5m 56s
8:	learn: 2.4226695	test: 5.0516756	best: 5.0516756 (8)	total: 34.9s	remaining: 5m 52s
9:	learn: 2.4077369	test: 5.0512090	best: 5.0512090 (9)	total: 38.7s	remaining: 5m 48s
10:	learn: 2.3787095	test: 5.0424099	best: 5.0424099 (10)	total: 42.6s	remaining: 5m 44s
11:	learn: 2.3660573	test: 5.0422021	best: 5.

In [17]:
reg=CatBoostRegressor(iterations=90, depth=16, eta=0.3)
reg.fit(X_trainval.to_numpy(), y_trainval)
pred_test = np.clip(reg.predict(X_test.to_numpy()), 0, 20)
write_predictions_by_array(pred_test[submissionidx2testidx], 'submission-catboost-feature_set_within-lr0.3.csv')
# LB score 1.121911 and 1.11979

0:	learn: 3.8506874	total: 3.83s	remaining: 5m 40s
1:	learn: 3.6068670	total: 7.67s	remaining: 5m 37s
2:	learn: 3.4484906	total: 11.5s	remaining: 5m 34s
3:	learn: 3.3430640	total: 15.4s	remaining: 5m 31s
4:	learn: 3.2802332	total: 19.3s	remaining: 5m 27s
5:	learn: 3.2383940	total: 23s	remaining: 5m 22s
6:	learn: 3.1713835	total: 26.9s	remaining: 5m 19s
7:	learn: 3.1461231	total: 30.8s	remaining: 5m 16s
8:	learn: 3.1309449	total: 34.7s	remaining: 5m 12s
9:	learn: 3.1045289	total: 38.6s	remaining: 5m 8s
10:	learn: 3.0654204	total: 42.4s	remaining: 5m 4s
11:	learn: 3.0368718	total: 46.3s	remaining: 5m
12:	learn: 3.0229326	total: 50.4s	remaining: 4m 58s
13:	learn: 3.0067326	total: 54.5s	remaining: 4m 55s
14:	learn: 2.9953285	total: 58.3s	remaining: 4m 51s
15:	learn: 2.9910267	total: 1m 2s	remaining: 4m 47s
16:	learn: 2.9795751	total: 1m 6s	remaining: 4m 43s
17:	learn: 2.9759359	total: 1m 9s	remaining: 4m 39s
18:	learn: 2.9733146	total: 1m 13s	remaining: 4m 35s
19:	learn: 2.9710599	total: 1