**Table of contents**  
[1. Introduction](#intro)  
[2. Data Preprocessing and cleaning](#data_preprocessing)  
[3. Temporal Feature Analysis](#temporal_analysis)  
[4. Individual Feature Analysis](#individual_analysis)  
[5. Feature Engineering](#feature_engineering)  
[6. Final Dataset](#finaldataset)  
[7. Regression](#regression)

In [None]:
!pip install -U seaborn # seaborn>=0.11

<a id="intro"></a>
# 1. Introduction
This is a time-series problems, so data presents an order.  Time-based features are probably crucial information for achieveing a good performance. We may want also investigate features based on seasonality, as this is a sales dataset.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize']=(23,8)
def twosubplots(figsize=(23,8)):
    return plt.subplots(1,2,figsize=figsize)[1]

!ls /kaggle/input/*
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Lets load dataset. I have got a dataset where item categories are translated to english.

In [None]:
ROOT_DIR = '/kaggle/input/competitive-data-science-predict-future-sales'
ROOT_DIR_EN = '/kaggle/input/predict-future-sales-supplementary'
D = pd.read_csv(ROOT_DIR+ '/sales_train.csv')
Dtest = pd.read_csv(ROOT_DIR+ '/test.csv')
D_itemcategory = pd.read_csv(ROOT_DIR_EN+'/item_category.csv')

print("train dataset has %d rows" % len(D))
D=D.drop('date',axis=1) # dropping date column.
#D = pd.merge(D, D_item[['item_id','item_category_id', 'item_name']], on='item_id')
D = pd.merge(D,D_itemcategory, on='item_id')
D = D.rename(columns={'item_name_translated':'item_name'})

Dtest = pd.merge(Dtest,D_itemcategory, on='item_id')
Dtest = Dtest.rename(columns={'item_name_translated':'item_name'})
Dtest['date_block_num']=34

display(D.head(3))
display(Dtest.head(3))

<a id="data_preprocessing"></a>
# 2. Data Preprocessing, cleaning and Outliers removal
Lets see the data first

In [None]:
import pandas_profiling # library for automatic EDA
report = pandas_profiling.ProfileReport(D)

In [None]:
display(report)

* There are no missing values.
* item_price with negatives values. This represents returned items or something like that.
* There are only 60 shops ids.
* Item_price and item_cnt_day probably have outliers.

# 2.1 Data cleaning
Lets see which items ids, categories ids are common in both train and test dataset.

In [None]:
D_category_dup = D_itemcategory[D_itemcategory['item_name_translated'].duplicated(keep=False)]
D_category_dup.sort_values('item_name_translated')

There are about 20 duplicated names. I will use the one with the lowest item_id.

In [None]:
dftmp = D_category_dup.groupby('item_name_translated')['item_id'].min()
item_id_mapping = {itemid:dftmp[iname] for _,itemid,iname in D_category_dup[['item_id','item_name_translated']].itertuples()}
D=D.replace({'item_id':item_id_mapping})
Dtest=Dtest.replace({'item_id':item_id_mapping})

Maybe there are item_ids present in the test set but not in the train dataset. Lets see what are they:

In [None]:
Duniq = D['item_id'].unique()
Dtest_uniq = Dtest['item_id'].unique()

item_id_inter = np.intersect1d(Duniq,Dtest_uniq)

len(item_id_inter),len(Duniq),len(Dtest_uniq)

In [None]:
new_items_test = D_itemcategory[D_itemcategory['item_id'].isin(np.setdiff1d(Dtest_uniq,Duniq))]
new_items_test

Most items on test dataset are in train dataset. There are 361 new items in test dataset that are not present in train dataset.

In [None]:
sns.histplot(new_items_test, x='item_cat1');

Most of it really seems to be new products launched only the month of the test dataset. I need to remember this later on the feature engineering phase. 

There are no missing values, but the histogram for item_price is really strange. This will be investigated.
## 2.2 item_price

Its seens reasoable to put the average item price.

In [None]:
Dgtmp = D.groupby(['item_id'])['item_price'].mean()
Dgtmp.name='avg_item_price'
D = pd.merge(D,Dgtmp, on='item_id')
display(D.head(3))

In [None]:
sns.boxplot(y=D['item_price']);
D['item_price'].describe()

There is a single row with negative value for item_price. Also, there is a huge value for item_price. Lets remove first this negative value and then proceed to investigate about the huge value.

In [None]:
D=D[D['item_price']>0]
D[D['item_price']>300000]

What is the average price for item id 6066?

In [None]:
D[D['item_id']==6066]['item_price'].describe()

There is only one item of this type. I am getting rid of it.

In [None]:
D=D[D['item_price']<300000];

It seens reasoable to check if there are items sold much more above average price:

In [None]:
D['item_price_relative'] = D['item_price']/D['avg_item_price']
sns.boxplot(y=D['item_price_relative']);

Wow, there items being sold by more than 10 times the average price! Lets investigate what those items are.

In [None]:
dftmp = D[D['item_price_relative']>=10].sort_values(by='item_price_relative', ascending=False)
dftmp

## 2.3 Data preparation
Test dataset is concerned with month item solds, not days. We will process train data in order to become similar to test dataset. We will lose day information, but will insert it later somehow, if necessary.

In [None]:
Dgroup = D.groupby(['date_block_num','shop_id','item_id','item_cat1','item_cat2'])
D = Dgroup.agg({'item_cnt_day':'sum',
                 'item_price':'mean'})
D = D.rename({'item_cnt_day':'item_cnt_month'}, axis=1).reset_index()
Dgroup = D.groupby('item_id')['item_price'].mean()
Dgroup.name = 'avg_item_price'
D = D.merge(Dgroup, on='item_id')
D.head(3)

In [None]:
D['item_price_relative'] = D['item_price']/D['avg_item_price']
D.head(3)

In [None]:
sns.boxplot(y=D['item_price_relative']);

In [None]:
dftmp = D[D['item_price_relative']>=8].sort_values(by='item_price_relative', ascending=False)
dftmp

* item_price: the price of a specific item and specific shop for that month.
* alltime_avg_item_price: The average price of a specific item, averaged over all months and shops.
* item_cnt_month: The target variable.

<a id="temporal_analysis"></a>
# 3. Temporal Feature Analysis
## 3.1 item_cnt_month x time

## date_block_num
Looking the histogram of date_block_num, we can clearly see a seasonality (12 months period) on the number of items sold on a month. Its seens to be in December. Lets investigate:

In [None]:
ax1,ax2 = twosubplots()
dftmp=D.groupby(['date_block_num']).sum()
sns.barplot(data=dftmp,x=dftmp.index, y='item_cnt_month', ax=ax1);
dftmp=D.groupby(['date_block_num']).mean()
sns.barplot(data=dftmp,x=dftmp.index, y='item_cnt_month', ax=ax2);
ax2.set_ylabel("Average item_cnt_month");

In the graph we see a seasonality and a decreasing trend on the total number of items sold per month. However, there is no trend in the average number of itens sold per row.

## item_price x time

In [None]:
ax1,ax2 = twosubplots()
dftmp=D.groupby(['date_block_num']).sum()
sns.barplot(data=dftmp,x=dftmp.index, y='item_price', ax=ax1);
ax1.set_ylabel('Total Money Spent')
dftmp=D.groupby(['date_block_num']).mean()
sns.barplot(data=dftmp,x=dftmp.index, y='item_price', ax=ax2);
ax2.set_ylabel('Average item_price');

The Average item price is inscreasing with time while the total money spent remains the same (except November,December and January). Maybe people are buying more expensive stuff. We will see this later.

## 3.2 shop
Lets see the two shops with most items sold in each month

In [None]:
dftmp = D.groupby(['date_block_num','shop_id'])[['item_cnt_month']].sum().sort_values(['date_block_num','item_cnt_month'], ascending=False) #.groupby('date_block_num').max()
#dftmp = dftmp[dftmp.groupby('date_block_num').idxmax()].reset_index()
dftmp = dftmp.reset_index().groupby(['date_block_num']).nth([0,1]).reset_index()
dftmp = D[D['shop_id'].isin(dftmp['shop_id'].unique())]
dftmp = dftmp.groupby(['date_block_num','shop_id'])['item_cnt_month'].sum().reset_index()
sns.barplot(data=dftmp,x='date_block_num',y='item_cnt_month',hue='shop_id');

We can see shops 25,28,31 and 54 sells a lot. Shop 31 is the shop with most item_counts, except in the last month where shop 25 sold a little more. Note that shop 25 is almost always the second shop with most sold items.
Interestingly, shop 54 dissapeared after month 27.

## 3.3 Item_id


In [None]:
dftmp = D.groupby(['date_block_num','item_id'])[['item_cnt_month']].sum().sort_values(['date_block_num','item_cnt_month'], ascending=False) #.groupby('date_block_num').max()
#dftmp = dftmp[dftmp.groupby('date_block_num').idxmax()].reset_index()
dftmp = dftmp.reset_index().groupby(['date_block_num']).nth(0).reset_index()
most_sold_items = dftmp['item_id'].unique()
dftmp = D[D['item_id'].isin(most_sold_items)]
dftmp = dftmp.groupby(['date_block_num','item_id'])['item_cnt_month'].sum().reset_index()
sns.barplot(data=dftmp,x='date_block_num',y='item_cnt_month',hue='item_id');
#sns.barplot(data=dftmp,x='date_block_num',y='item_cnt_month',hue='item_id');


In [None]:
D_itemcategory[D_itemcategory['item_id'].isin(most_sold_items)]

<a id="individual_analysis"></a>
# 4. Individual feature analysis
Lets start by analysing individual information of each feature, such as the data type, unique values, means, max, etc..

## 4.1 Item_price

Items below the average price and/or low price, are probably more sold at once. Lets investigate:

In [None]:
ax1,ax2 = twosubplots()
sns.scatterplot(data=D, x='item_price_relative', y='item_cnt_month', ax=ax1);
sns.scatterplot(data=D, x='item_price', y='item_cnt_month', ax=ax2);

As it can been seen, items with high item_cnt_month have low price low relative price. However, it is really strange that item_price_relative can have values like 10 or above.

In [None]:
#sns.histplot(data=D,x='item_price_relative', hue=pd.cut(D['item_cnt_month'], bins=[-24,0,1,2,3,5,128,2000]));

Most item sold have item_cnt_month high and at a item_price_relative=1. 

<a id="feature_engineering"></a>
# 5 Feature Engineering

In [None]:
#D.to_csv('data_train1.csv') # backup
#Dtest.to_csv('data_test1.csv')

In [None]:
#pd.read_csv('data_train1.csv', index_col=0)

In [None]:
import gc
dftmp=None; Dgroup=None
gc.collect()

Concatenating training dataset with test dataset so feature engineering is done on both at the same time.

In [None]:
D = D.append(Dtest, ignore_index=True)
train_idxs = D['date_block_num']<=33
len(D)

In [None]:
D['item_cat2']=D['item_cat2'].fillna('na')

Clip item counts to (0,20) due to the fact the test dataset is already clipped to (0,20)

In [None]:
D['item_cnt_month'].clip(0,20, inplace=True)

# 5.1 Date features

Adding month, year and number of days in month as features.

In [None]:
from calendar import monthrange
D['month'] = D['date_block_num'] % 12
D['year'] = D['date_block_num']// 12 + 2013
D['num_days_month'] = D.apply(lambda x:monthrange(x['year'], x['month']+1)[1],axis=1)

## 5.1 All-time based features
These features are made based on all months.

### 5.1.1 Absolute item count and item price for each group

In [None]:
group_keys =[['shop_id'],
            ['item_id'],
            ['item_cat1'],
            ['item_cat2'],
            ['shop_id','item_cat1'],
            ['shop_id','item_cat2'],
            ['shop_id','item_id']]

new_alltimefeatures_names = ['shop', 'itemid','itemcat1','itemcat2','shop-itemcat1','shop-itemcat2', 'shop-itemid']
new_alltimefeatures_cnt_names = ["alltime_%s_cnt" % name for name in new_alltimefeatures_names] # inserts a prefix and a suffix
new_alltimefeatures_price_names = ["alltime_%s_avgprice" % name for name in new_alltimefeatures_names] # inserts a prefix and a suffix


for name,k in zip(new_alltimefeatures_cnt_names,group_keys):
    dftmp = D.groupby(k)['item_cnt_month'].mean()#sum()
    dftmp.name = name
    D = D.merge(dftmp, on=k, how='left')
    
for name,k in zip(new_alltimefeatures_price_names,group_keys):
    dftmp = D.groupby(k)['item_price'].mean()
    dftmp.name = name
    D = D.merge(dftmp, on=k, how='left')
D[['shop_id','item_id']+new_alltimefeatures_cnt_names+new_alltimefeatures_price_names]

## 5.2 Time-series based features or lag features
### 5.2.1 First time appeared 
Items at launch date are more sold.

In [None]:
Itemfirst = D.groupby(['item_id'])['date_block_num'].min()
Itemfirst.name = 'item_id_first_time'
D = pd.merge(D,Itemfirst,on='item_id')
D['item_id_first_time'] = D['date_block_num']-D['item_id_first_time']
#sns.boxplot(D['item_id_first_time_lag']);

### 5.2.2 Month Lag Features

Features based on the past months.

In [None]:
cnt_keys = [[],
            ['shop_id'],
            ['item_id'],
            ['item_cat1'],
            ['item_cat2'],
            ['shop_id','item_cat1'],
            ['shop_id','item_cat2']]

new_monthfeatures_names = ['','shop', 'itemid','itemcat1','itemcat2','shop-itemcat1','shop-itemcat2']
new_monthfeatures_cnt_names = ["month_%s_cnt" % name for name in new_monthfeatures_names] # inserts a prefix and a suffix
new_monthfeatures_price_names = ["month_%s_avgprice" % name for name in new_monthfeatures_names] # inserts a prefix and a suffix

In [None]:
LAGS = [1,2,3,12]
D.sort_values('date_block_num', inplace=True)

lag_features = []
for feature_name,k in zip(new_monthfeatures_cnt_names,cnt_keys):
    for lag in LAGS:
        dftmp = D.groupby(k+['date_block_num'])['item_cnt_month'].mean()#.sum()
        if(len(k)==0):
            dftmp = dftmp.shift(lag)
        else:
            dftmp = dftmp.groupby(k).shift(lag)
        dftmp.name = '%s-lag%d' % (feature_name,lag)
        lag_features.append(dftmp.name)
        D = D.merge(dftmp,how='left',on=k+['date_block_num'])
        
for feature_name,k in zip(new_monthfeatures_price_names,cnt_keys):
    for lag in LAGS:
        dftmp = D.groupby(k+['date_block_num'])['item_price'].mean()
        if(len(k)==0):
            dftmp = dftmp.shift(lag)
        else:
            dftmp = dftmp.groupby(k).shift(lag)
        dftmp.name = '%s-lag%d' % (feature_name,lag)
        lag_features.append(dftmp.name)
        D = D.merge(dftmp,how='left',on=k+['date_block_num'])
    
D[(D['shop_id']==24) & (D['item_id']==32)]

### 5.2.3 Relative features
Features that are the proportion/ration of two features.

In [None]:
prices_features_lag = [fname for fname in lag_features if 'price' in fname and 'item' in fname]
prices_features_lag_relative_names = ['%s_relative' % fname for fname in prices_features_lag]
for fname,new_fname in zip(prices_features_lag,prices_features_lag_relative_names):
    D[new_fname] = D[fname]/D['alltime_itemid_avgprice']
D.sample(3)

### Other features

In [None]:
from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
D['item_cat1'] = LE.fit_transform(D['item_cat1'])
D['item_cat2'] = LE.fit_transform(D['item_cat2'])

<a id="finaldataset"></a>
# 6. Final dataset

In [None]:
final_features = ['date_block_num','month','year','num_days_month','item_cnt_month','item_id_first_time','item_cat1','item_cat2','ID']
final_features += lag_features
final_features += new_alltimefeatures_cnt_names + new_alltimefeatures_price_names
final_features += prices_features_lag_relative_names
final_features

Filling NA values and saving data. The first three dates (date_block_num<3) are not saved, since not enough lag time feature is available.

In [None]:
D=D[final_features]
D=D.fillna(0)
Dtrain=D[(D['date_block_num']<=33) & (D['date_block_num']>=3)]
assert(len(D[D['date_block_num']==34]) == len(Dtest))
Dtest = D[D['date_block_num']==34]
#Dtrain.to_csv('date_train_final.csv', index=False)
#Dtest.to_csv('date_test_final.csv', index=False)

<a id="regression"></a>
# 7. Regression

Separating features from target label:

In [None]:
#Dtrain=Dtrain[Dtrain['date_block_num']<=7]
X = Dtrain.drop(['item_cnt_month','ID'], axis=1)
Y = Dtrain['item_cnt_month']
X.shape

Sampler to be used in time datasets.

In [None]:
class TimeSplitter:
    def __init__(self, n_splits=1):
        self.n_splits=n_splits
        
    def split(self,X,y=None,groups=None):
        date_block_num = X['date_block_num']
        max_date = date_block_num.max()
        for i in range(self.n_splits):
            test_date=max_date-i
            train_idxs, = np.where(date_block_num<test_date)
            test_idxs, = np.where(date_block_num==test_date)
            yield train_idxs,test_idxs
        
    def get_n_splits(self,X=None,y=None,groups=None):
        return self.n_splits
        

Stacking algorithm. Did not worked for me quite as well as Xgboost.

In [None]:
from sklearn.base import BaseEstimator, clone
from sklearn.linear_model import LinearRegression

class TimeStackingRegressor(BaseEstimator):
    def __init__(self, estimators, final_estimator=LinearRegression(), n_splits=1, passthrough=False):
        self.estimators=estimators
        self.final_estimator = final_estimator
        self.n_splits = n_splits
        self.passthrough=passthrough
        
    def fit(self, X,y):
        timesplitter = TimeSplitter(self.n_splits)
        Xlvl2=[]
        Xlvl1=[]
        ylvl2=[]
        for train_idxs, test_idxs in timesplitter.split(X):
            preds = np.empty((len(test_idxs),len(self.estimators)))
            Xtrain,ytrain = X.values[train_idxs], y.values[train_idxs]
            Xtest,ytest = X.values[test_idxs], y.values[test_idxs]
            for i,(name,estimator) in enumerate(self.estimators):
                estimator = clone(estimator)
                estimator.fit(Xtrain,ytrain)
                preds[:,i]=estimator.predict(Xtest)
            Xlvl2.append(preds)
            Xlvl1.append(Xtest)
            ylvl2.append(ytest)
        Xlvl1 = np.vstack(Xlvl1)
        Xlvl2 = np.vstack(Xlvl2)
        ylvl2 = np.hstack(ylvl2)
        if(self.passthrough):
            Xlvl2 = np.hstack((Xlvl1,Xlvl2))
        self.final_estimator.fit(Xlvl2, ylvl2)
        
        for name, estimator in self.estimators:
            estimator.fit(X,y)
        
        return self
    
    def predict(self, X):
        preds = np.empty((len(X),len(self.estimators)))
        for i,(name,estimator) in enumerate(self.estimators):
            preds[:,i] = estimator.predict(X)
        if(self.passthrough):
            Xlvl2 = np.hstack((X,preds))
        else:
            Xlvl2 = preds
        return self.final_estimator.predict(Xlvl2)
            

Constructs all classifiers. Various classifiers were tested, but Random Forest and Xgboost are the best.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate, PredefinedSplit, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor, StackingRegressor, GradientBoostingRegressor, VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn import svm
from tqdm import tqdm
import time
from sklearn.feature_selection import SelectPercentile, f_regression
import xgboost

RANDOM_STATE=42

def createClassifiers():
    clfs = []
    lr = Pipeline([('feature_selector',SelectPercentile(f_regression, percentile=70)),
                  ('lr',LinearRegression(n_jobs=-1))])
    knn = KNeighborsRegressor(3, n_jobs=-1)
    knn = Pipeline([('feature_selector',SelectPercentile(f_regression, percentile=50)),
                    ('scaler',StandardScaler()),
                    ('clf',knn)])
    dt = DecisionTreeRegressor(min_impurity_decrease=0.001)
    rf = RandomForestRegressor(n_estimators=100, max_features=0.5, min_impurity_decrease=0.001,
                               criterion='mse', n_jobs=-1, random_state=RANDOM_STATE) #best_params for random forest{'max_features': 0.5, 'min_impurity_decrease': 0.001}
    rf2 = GridSearchCV(RandomForestRegressor(n_estimators=100, criterion='mse',min_impurity_decrease=0.001, n_jobs=-1, random_state=RANDOM_STATE),
                       {'max_features':[0.4,0.5,0.6]}, cv=TimeSplitter())
    gbreg = GradientBoostingRegressor(n_estimators=50, learning_rate=0.15, min_impurity_decrease=0.001)
    
    
    stack1_estimators= [('lr',lr),
                        ('rf',rf)]
    #stack1_reg = TimeStackingRegressor(estimators=stack1_estimators, n_splits=5)
    xgb = xgboost.XGBRegressor(max_depth = 10, n_estimators=100, min_child_weight=200, subsample = 1, eta = 0.5, seed = RANDOM_STATE)

    #clfs.append(('knn',knn))
    #clfs.append(('dt',dt))
    #clfs.append(('stacking1',stack1_reg))
    #clfs.append(('rf',rf))
    #clfs.append(('gbreg',gbreg))
    #clfs.append(('rf2',rf2))
    #clfs.append(('knn',knn))
    clfs.append(('xgb',xgb))
    clfs.append(('lr',lr))
    
    return clfs


clfs = createClassifiers()

### Model Training
Trains a model **if necessary**

In [None]:
import os 
import pickle

best_model = None
if(os.path.isdir('/kaggle/input/my-modelpkl')):
    with open('/kaggle/input/my-modelpkl/best_model.pkl','rb') as f:
        best_model = pickle.load(f)

In [None]:
%%time
if(best_model is None):
    sampler = TimeSplitter()
    Results={}
    pbar = tqdm(clfs)
    for clf_name,clf in pbar:
        pbar.set_description("Training %s" % clf_name)
        Results[clf_name] = cross_validate(clf, X, Y, cv=sampler, scoring='neg_root_mean_squared_error', return_estimator=True, return_train_score=True)
    best_model = Results['xgb']['estimator'][0]
    
    R = []
    for clf_name,res in Results.items():
        R.append([clf_name,'mse', -res['test_score'][0], -res['train_score'][0]])
    R = pd.DataFrame(R, columns=['classifier name','metric','test_rms', 'train_rms'])
    display(R)

### Feature Importance

In [None]:
from xgboost import plot_importance

_, ax = plt.subplots(1,1,figsize=(10,14))
plot_importance(booster=best_model, ax=ax);

## Predicting test dataset

In [None]:
Dtest['item_cnt_month'] = best_model.predict(Dtest.drop(['item_cnt_month','ID'],axis=1))
Dtest = Dtest.sort_values('ID')
Dtest['ID'] = Dtest['ID'].astype(int)
Dtest['item_cnt_month'] = Dtest['item_cnt_month'].clip(0,20)
Dtest.to_csv('submission.csv', columns=['ID','item_cnt_month'], index=False)