# Introduction

This notebook is based on the superb notebook by Gordon Henderson: https://www.kaggle.com/gordotron85/future-sales-xgboost-top-3 . Unlike him we will save the time for laborious feature engineering and use an automatic feature extraction for time series data provided by tsfresh (https://tsfresh.readthedocs.io/en/latest/).

Due to resource limitations in kaggle I will only showcase how to generate the features based on a subsample of the data. When I tried to run the rolling_frame function on the complete train set it took forever. Maybe dask dataframes could be an option here and I plan to explore this in future versions -> https://tsfresh.readthedocs.io/en/latest/text/large_data.html

In [None]:
# data imports
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import tsfresh as tsf
sns.set(style="darkgrid")


import os
import random
random.seed(42)
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load Data

In [None]:
# load data
train=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv")
test=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/test.csv")

# 1. Data Cleaning

We'll remove outliers, clean up some of the raw data and add some new variables to it.

# Remove outliers

We'll remove the obvious outliers in the dataset - the items that sold more than 1000 in one day and the item with price greater than 300,000.

In [None]:
train = train[(train.item_price < 300000 )& (train.item_cnt_day < 1000)]
train = train[train.item_price > 0].reset_index(drop = True)
train.loc[train.item_cnt_day < 1, "item_cnt_day"] = 0

# Preprocessing

Create a matrix df with every combination of month, shop and item in order of increasing month. Item_cnt_day is summed into an item_cnt_month.

With tsfresh we will later slide a window over every time series to create feature for that window. This means we have to create one entry for every shop_id and item_id for every date_block_num so that the windows contain even time spaces. 

In [None]:
from itertools import product
import time
ts = time.time()
matrix = []
cols  = ["date_block_num", "shop_id", "item_id"]
train = train.sample(5000)  # comment out if you want to run for the whole data set
for i in range(34):
    matrix.append( np.array(list( product( [i], train.shop_id.unique(), train.item_id.unique() ) ), dtype = np.int16) )

matrix = pd.DataFrame( np.vstack(matrix), columns = cols )
matrix["date_block_num"] = matrix["date_block_num"].astype(np.int8)
matrix["shop_id"] = matrix["shop_id"].astype(np.int8)
matrix["item_id"] = matrix["item_id"].astype(np.int16)
matrix.sort_values( cols, inplace = True )
time.time()- ts

In [None]:
ts = time.time()
group = train.groupby( ["date_block_num", "shop_id", "item_id"] ).agg( {"item_cnt_day": ["sum"]} )
group.columns = ["item_cnt_month"]
group.reset_index( inplace = True)
matrix = pd.merge( matrix, group, on = cols, how = "left" )
matrix["item_cnt_month"] = matrix["item_cnt_month"].fillna(0).astype(np.float16)
time.time() - ts

Create a test set for month 34.

In [None]:
test["date_block_num"] = 34
test["date_block_num"] = test["date_block_num"].astype(np.int8)
test["shop_id"] = test.shop_id.astype(np.int8)
test["item_id"] = test.item_id.astype(np.int16)

Concatenate train and test sets.

In [None]:
ts = time.time()

matrix = pd.concat([matrix, test.drop(["ID"],axis = 1)], ignore_index=True, sort=False, keys=cols)
matrix.fillna( 0, inplace = True )
time.time() - ts

# Feature engineering 

For feature engineering we will use the features generated by the tsfresh package. As this notebook is designed to only show the basic approach of using this package and as we deal with quite a lot of data here, we will sample a subset of the whole data to show how to use the package. 

First we have to bring the data into the format that tsfresh requires. 

That is, we need to have a dataframe with at least three columns:
- id column identifiying one timeseries, in our case a combination from shop and item id
- time column identifying the time step, in our case date block num
- observation value: in our case the sales of the month

Let's create ad ID column that distinguishes an item_id shop_id combination unambigously. 

In [None]:
ts = time.time()


matrix['shop_id_even'] = matrix.shop_id.astype(str).str.zfill(2)
matrix['item_id_even'] = matrix.item_id.astype(str).str.zfill(5)
matrix['id'] = matrix.shop_id_even + matrix.item_id_even

time.time() - ts

Next, let's downsample our data and only look at 15 different shops and 25 items

In [None]:
shops = random.sample(list(matrix.shop_id.unique()), k = 15)  # comment out if you want to run for the whole data set
items = random.sample(list(matrix.item_id.unique()), k = 25)  # comment out if you want to run for the whole data set

matrix_small = matrix.loc[(matrix.shop_id.isin(shops)) & (matrix.item_id.isin(items))]
df = matrix_small[['id', 'date_block_num', 'item_cnt_month']]
df.info()

Next, we slide a window of size 6 over the entire dataframe. Tsfresh provides a very useful method that generates a new dataframe for us. This method generates a new id that is a tuple of a) the id we created and b) the maximum date_block_num considered in the window. 

In [None]:
from tsfresh.utilities.dataframe_functions import roll_time_series
df_rolled = roll_time_series(df, "id", "date_block_num", max_timeshift=6)

In [None]:
df_rolled.head()

Next, we let tsfresh run its magic and create the features for the time series. In total, 779 features are created for our timeseries. 

In [None]:
from tsfresh import extract_features
df_rolled['item_cnt_month'] = df_rolled.item_cnt_month.astype(float) # somehow tsfresh cannot cope with float16 values
df_features = extract_features(df_rolled, column_id="id", column_sort="date_block_num")
df_features.shape

In [None]:
print(*df_features.columns, sep="\n")

Arguably, that are too many columns and also not all features contain values as we look at very short time windows. 

In [None]:
df_features.isna().sum().sort_values()

In [None]:
# Therefore, we eliminate all features that are filled with less than 90% 
df_features = df_features.loc[:, df_features.isnull().mean() < .1]
df_features.shape

That leaves us with 208 features. Not bad.

Next, we will now merge this data again onto our dataframe to create the training dataframe

In [None]:
df_features.reset_index(inplace=True)
df_features.head()

In [None]:
# lets split the id back into shop id and item id so that we can merge it to the original dataframe
df_features['shop_id'] = df_features['level_0'].str[:2]
df_features['item_id'] = df_features['level_0'].str[2:]

df_features['item_id'] = df_features.item_id.astype(int)
df_features['shop_id'] = df_features.shop_id.astype(int)

df_features = df_features.rename(columns={'level_1': 'date_block_num'})
df_features.drop(columns=['level_0'], inplace=True)

In [None]:
# this will generate the features as a lag feature
df_features['date_block_num'] += 1

In [None]:
matrix_small.drop(columns=['item_id_even', 'id', 'shop_id_even'], inplace=True)
data = matrix_small.merge(df_features, on=["item_id", "shop_id", "date_block_num"], how="left")

# Modelling

In [None]:
import gc
import pickle
from xgboost import XGBRegressor
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

In [None]:
data[data["date_block_num"]==34].shape

Use month 34 as validation for training.

In [None]:
X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1)
Y_train = data[data.date_block_num < 33]['item_cnt_month']
X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1)
Y_valid = data[data.date_block_num == 33]['item_cnt_month']
X_test = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1)

In [None]:
Y_train = Y_train.clip(0, 20)
Y_valid = Y_valid.clip(0, 20)

In [None]:
del data
gc.collect();

In [None]:
ts = time.time()

model = XGBRegressor(
    n_estimators=1000
)

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds = 20)

time.time() - ts

In [None]:
Y_pred = model.predict(X_valid).clip(0, 20)
Y_test = model.predict(X_test).clip(0, 20)

X_test['pred'] = Y_test
test = test.merge(X_test, on=['shop_id', 'item_id'], how="left")

submission = pd.DataFrame({
    "ID": test.ID, 
    "item_cnt_month": test['pred']
})
submission.to_csv('xgb_submission.csv', index=False)

In [None]:
import shap
shap.initjs()

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_valid, approximate=True)

shap.summary_plot(shap_values, X_valid)