# Goal
<b><font size='5'>T</font>HE</b> goal of this kernel is to analyse the sales data of <a href='https://www.kaggle.com/c/rossmann-store-sales'>Rossmann Store Sales</a> along with stationary data regading its stores and promotions run during whole year and then to predict the sales at a store for further days. These stores are spread in 7 European countries and our task is to predict the sales for upto 6 weeks in advance. The sales may get influenced by numerous factors counting from store location, season, local conditions, running promotion (its total length and time elapsed etc), holiday and other demographic factors.  A robust prediction model will enable managers to plan the resources accoringly to increase productivity. <br>
The dataset here is divided into **two** files:-
 - Stores master data for its 1115 stores - store.csv and
 - transaction data - train.csv/test.csv
 
## Data fields

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

   - Id - an Id that represents a (Store, Date) duple within the test set
   - Store - a unique Id for each store
   - Sales - the turnover for any given day (this is what you are predicting)
   - Customers - the number of customers on a given day
   - Open - an indicator for whether the store was open: 0 = closed, 1 = open
   - StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
   - SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
   - StoreType - differentiates between 4 different store models: a, b, c, d
   - Assortment - describes an assortment level: a = basic, b = extra, c = extended
   - CompetitionDistance - distance in meters to the nearest competitor store
   - CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
   - Promo - indicates whether a store is running a promo on that day
   - Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
   - Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
   - PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew

In [None]:
from IPython.display import HTML,YouTubeVideo
display(YouTubeVideo('B2CHNrNmM80', width=600, height=300))

# Importing required packages
Here we need various packages ranging from common data processing (numpy/pandas) to Visualization to machine learning modeling packages like scikit/mlextend/prophet etc.

In [None]:
# 1.1 Data manipulation libraries
import pandas as pd
import numpy as np

# Dimensionality reduction
from sklearn.decomposition import KernelPCA

# Data transformation classes
from sklearn.preprocessing import OneHotEncoder as ohe
from sklearn.preprocessing import LabelEncoder as le
from sklearn.preprocessing import StandardScaler as ss
 
#Data splitting
from sklearn.model_selection import TimeSeriesSplit

#Model pipelining
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

#Model

from skopt import BayesSearchCV
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import TimeSeriesSplit as tss

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Other small utilities
from sklearn.metrics import make_scorer
from pandas.tseries.offsets import MonthEnd

import gc
import datetime

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.filterwarnings('ignore')

import os

In [None]:
os.chdir('/kaggle/input/rossmann-store-sales/')
train = pd.read_csv("train.csv",parse_dates=[2])
test = pd.read_csv("test.csv",parse_dates=[3])
store = pd.read_csv("store.csv")

In [None]:
train.info()
display(HTML('<h3>Features in train having null values:</h3>'))
train.columns.values[train.isnull().any()]
display(HTML('<h3>Features wise Minimum-Maximum in training dataset:</h3>'))
pd.DataFrame([train.min(),train.max()])

Here Store,DayOfWeek,Open,Promo and SchoolHoliday are obvious categorical features but since last 3 are binary we need not encode them furher.

In [None]:
train['Date']=train.Date.astype('datetime64[D]')
train['Store']=train.Store.astype('category')
train['DayOfWeek'] = train.DayOfWeek.astype('category')

In [None]:
train.head()

In [None]:
store.info()
display(HTML('<h3>Features wise Minimum-Maximum and NaN in Store dataset:</h3>'))
pd.DataFrame([store.min(),store.max(),store.isnull().sum(),store.nunique()],index=['Min','Max','Nulls','Unique'])

<h3> OBSERVATIONS:</h3>
<ol>
    <li>CompetionDistance is Null at 3 places:may impute it with mean.</li>
    <li>CompetitionSinceMonth and CompetionSinceYear are Null at 354 samples:Better we drop the columns.</li>
    <li>Promo2 details are uniformly missing in 544 samples:We can look wether these null are common samples or diversed(data missing)</li>
</ol>

In [None]:
store.CompetitionDistance.fillna(store.CompetitionDistance.mean(),inplace=True)
store[store.Promo2SinceWeek.isnull()].describe(include='all',percentiles=[])

In [None]:
display(HTML('<h4>From table we get that the rest null values occur due to no promo2 are run on some stores, thus we can put a constant value 0 there'))
store.fillna(0,inplace=True)

In [None]:
store.head()

In [None]:
X=train.merge(store,on='Store',copy=False)

In [None]:
_=plt.figure(figsize=(20,5))
_=train.set_index(keys='Date',drop=False).resample('M')['Sales'].sum().plot(fontsize=20)
_=plt.xlabel('Date', fontsize=20)
_=plt.ylabel('Sales', fontsize=20)
_=plt.suptitle('Rossmann Stores Sales over time', fontsize=30)

Now we look into Store-wise monthly sales of first few stores.

In [None]:
_=train.set_index(keys='Date',drop=False).groupby('Store').resample('M')['Sales'].sum().reset_index(level=[0,1])
f,ax=plt.subplots(10,1,sharex=True)
ax=ax.flatten()
for i in range(10):
    __=_[_.Store==(i+1)].plot(x='Date',y='Sales',legend=False,title='Store'+str(i),ax=ax[i],figsize=(20,20))
del _
del __
gc.collect()

In [None]:
%%time
display(HTML('<h4>Now we see week-of-day wise sales for all stores</h4>'))
_=train.groupby('DayOfWeek').agg({'Sales':np.mean}).plot(kind='bar',color='bgyr',legend=[])
__=_.set_xticklabels(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

In [None]:
display(HTML('The graph above poses 2 big questions:<br>1- Sunday having least sales!Why?<br>2- Whether it is just amount or customer footfall decrease?'))

_=train.plot.scatter('Customers','Sales',s=20,c='DayOfWeek',cmap='rainbow',figsize=(10,5),alpha=0.6)

In [None]:
train.StateHoliday.replace({0:'0'},inplace=True)
f,(ax1,ax2)=plt.subplots(1,2,figsize=(20,5))
_=sns.heatmap(train.groupby(['DayOfWeek','StateHoliday']).agg({'Customers':np.mean}).unstack().fillna(0),cmap='GnBu',ax=ax1,annot=True)
_=sns.heatmap(train.groupby(['SchoolHoliday','DayOfWeek']).agg({'Customers':np.mean}).unstack().fillna(0),cmap='GnBu',ax=ax2,annot=True)
_=ax1.set_title('State Holiday Vs Day of Week')
_=ax2.set_title('School Holiday Vs Day of Week')

The above three plots suggest that there is rather smaller footfall on Sundays and StateHoldays affect the footfall quite adversely while SchoolHolidays have marked only a small dent.<br>
Another question is whether big counters actually attract more customers or rather sales...lets see!

In [None]:
_=train.merge(store, on='Store').groupby('Assortment').agg({'Customers':np.mean,'Sales':np.mean})
_["SalesPerCustomer"]=_.Sales/_.Customers
_

It suggests while "extra" assortment level stores cater more customers the actual average sales is much lesser than even basic stores. **It suffices the assortment level plays a big role in average sales as well actual footfall**.<br>
To know the the actual feed on sales and customers to be given on model we need to know the distribution of the same.

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,5))
_=sns.distplot(train.Customers,ax=ax[0])
_=sns.distplot(train.Sales,ax=ax[1])
ax[0].set_title('Customers Distribution')
ax[1].set_title('Sales Distribution')

As both of the plots are skewed we can use medians for our model.

In [None]:
train['TicketSize']= train['Sales'] / train['Customers']
med_sales= train.groupby('Store')[['Sales', 'Customers', 'TicketSize']].median()
med_sales.rename(columns=lambda x: x+'_median', inplace=True)
train.drop(columns=['TicketSize'],inplace=True)
#med_sales.sample(5)

def build_features(train):
    X= train.merge(med_sales,on='Store')
    X = X.merge(store,on='Store')
    X['Year'] = X.Date.dt.year
    X['Month'] = X.Date.dt.month
    X['Day'] = X.Date.dt.day
    X['Q_Month'] = (train.Date.dt.month-1)%3+1
    X['CompOpSinceMonth']=(X.Year-X.CompetitionOpenSinceYear)*12+(X.Month-X.CompetitionOpenSinceMonth)
    X['LeftDaysInMonth'] = ((X.Date+MonthEnd(0))-X.Date).dt.days
    
    #store.PromoInterval.astype('category').cat.categories
    cat1 = pd.CategoricalDtype(categories=[0, 'Jan,Apr,Jul,Oct', 'Feb,May,Aug,Nov', 'Mar,Jun,Sept,Dec'])
    X['PromoInterval'] = X.PromoInterval.astype(cat1).cat.codes
    
    ##Change types
    cat_cols = ['Store', 'DayOfWeek', 'Open','Promo', 'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment','Promo2','Q_Month','PromoInterval']
    int_cols = ['CompetitionOpenSinceMonth','CompetitionOpenSinceYear', 'Promo2SinceWeek','Promo2SinceYear', 'Year', 'Day', 'CompOpSinceMonth', 'LeftDaysInMonth']
    for col in cat_cols:
        X[col] = X[col].astype('category')
    for col in int_cols:
        X[col] = X[col].astype('int')
    return X

In [None]:
X= build_features(train)
y=X.pop('Sales')
y = np.log1p(y)

## Evaluation
To evaluate the model we will use the loss function **Root Mean Squared Percentage Error** which is calculated as:<br>
$rmspe=\sqrt{\frac{1}{n}\sum_{i=1}^{n} {\left (\frac{y_i-\hat{y_i}}{y_i}  \right )}^{2}}$
<br>but here we will use log of sales instead of sales itself thus metric may be recalculated as:<br>
$rmspe-log1p=\sqrt{\frac{1}{n}\sum_{i=1}^{n} {\left (\frac{e^{\hat{y_i}}-1}{e^{y_i}-1}-1  \right )}^{2}}$

In [None]:
def rmspe_log1p(y,yhat):
    y=np.expm1(y)
    yhat=np.expm1(yhat)
    weight=pd.Series([1/a if a!=0 else 0 for a in y])
    return np.sqrt(np.mean((weight*(yhat-y))**2))

rmspe_scorer = make_scorer(rmspe_log1p, greater_is_better = False)

In [None]:
mCat_cols = ['Store','DayOfWeek','StateHoliday','StoreType', 'Assortment','Q_Month','PromoInterval']
bin_cat_cols = ['Open', 'Promo','SchoolHoliday','Promo2']
num_cols = X.select_dtypes('number').columns.to_list()

In [None]:
ctt = ColumnTransformer(
                        [
                            ('mcat',ohe(),mCat_cols),
                            ('num',ss(),num_cols)
                        ])
X_t=ctt.fit_transform(X)

In [None]:
xgboost_tree = XGBRegressor(
    n_jobs = -1,
    n_estimators = 1000,
    eta = 0.1,
    max_depth = 2,
    min_child_weight = 2,
    subsample = 0.8,
    colsample_bytree = 0.8,
    tree_method = 'exact',
    reg_alpha = 0.05,
    random_state = 1023
)
xgboost_tree.fit(X_t, y,
                 eval_metric = rmspe_log1p
                )

In [None]:
rmspe_log1p(y,xgboost_tree.predict(X_t))

In [None]:
import xgboost as xgb
def rmspe_xg(yhat, y):
    y = np.expm1(y.get_label())
    yhat = np.expm1(yhat)
    
    weight=pd.Series([1/a if a!=0 else 0 for a in y])
    return "rmspe", -np.sqrt(np.mean((weight*(yhat-y))**2))

dtrain = xgb.DMatrix(X_t, y)

params = {
    'n_estimators': (200, 2000),
    'max_depth' :(1,8),
    'eta':(0.01, 0.6, 'log-uniform'),
    'colsample_bytree':(0.1,0.9,'uniform'),
    'gamma':(1,10),
    'alpha':(0,10),
    'lambda':(1,10),
    'subsample':(0.1,1.0),
    'min_child_weight':(0,5)
}

bayes_cv = BayesSearchCV(
                        estimator = XGBRegressor(objective= 'reg:linear',
                                                 booster='gbtree',
                                                 verbosity=2,
                                                 tree_method='hist',
                                                 feval=rmspe_xg
                                                ),
                        search_spaces = params, 
                        cv=tss(3),
                        n_jobs=-1,
                        n_iter = 100,
                        verbose=0
                        )

In [None]:
%%time
bayes_cv.fit(X_t,y.values)

In [None]:
bayes_cv.best_score_

In [None]:
rmspe_log1p(bayes_cv.predict(X_t),y)

In [None]:
bayes_cv.best_params_

In [None]:
rand_stores = np.random.randint(0,1115,5)
_ = pd.DataFrame(train[['Date','Store','Sales']])
_['Prediction'] = bayes_cv.predict(X_t)
_['Prediction'] = np.expm1(_.Prediction)
_=_.set_index(keys='Date',drop=False).groupby('Store').resample('M')['Sales','Prediction'].sum().reset_index(level=[0,1])
f,ax=plt.subplots(5,1,sharex=True)
ax=ax.flatten()
for i in range(5):
    __=_[_.Store==rand_stores[i]].plot(x='Date',y='Sales',title='Store'+str(rand_stores[i]),ax=ax[i],figsize=(20,20))
    __=_[_.Store==rand_stores[i]].plot(x='Date',y='Prediction',title='Store'+str(rand_stores[i]),ax=ax[i])
    
del _
del __
gc.collect()