## Rossmann Store - Forecasting Sales

Rossmann Store Sales data is available via [Kaggle](https://www.kaggle.com/c/rossmann-store-sales). Data is from a chemist store that has over 6000 stores across Europe. 

I have previously presented a notebook with insights on Rossmann Store, [Rossmann Store -Insights](https://www.kaggle.com/virusme/rossmann-store-sales/rossmann-store-insights). In this notebook, I will attempt to forecast or predict sales for each store, six weeks in advance. 

_**Note**: I am always looking forward for constructive feedback. If you find a bug or if you think there are some fundamental mistakes, please do let me know. If you think this analysis could be further improved, again please do let me know. Your inputs will be highly appreciated. My contact email: info@theportfoliotrader.com_


### Feature Selection

One of the most important things to do before embarking on any sort of modelling task is to understand the features, in order to make sure our model is trained with the best feature set.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
%matplotlib inline
# packages
import matplotlib.pyplot as plt
#import mpld3
import warnings
import seaborn as sns
sns.set(style='darkgrid')

#
warnings.filterwarnings('ignore')
#mpld3.enable_notebook()
#
# sales data, lets load the data
train = pd.read_csv('../input/train.csv')
# sort the dates 
train.sort_values(by='Date', ascending=True, inplace=True)
# stores data
stores = pd.read_csv('../input/store.csv')
#
print('-----Train data ------------------------------------')
print(train.head(10))
print('-----------------------------------------')
print('-----Stores data ------------------------------------')
print(stores.head(10))
print('-----------------------------------------')
#
# lets collate sales and customer data on monthly and yearly charts
# split Year-Month-Date to three different columns
train['Month'] = train['Date'].apply(lambda x : int(str(x)[5:7]))
train['Year'] = train['Date'].apply(lambda x : int(str(x)[:4]))
train['MonthYear'] = train['Date'].apply(lambda x : (str(x)[:7]))

#
train.info()
stores.info()

### Train data

`Customers` numbers will be co-related to `Sales` however from a business perspective, future customer numbers are unknown therefore, we can discount `Customers` being one of the features for preliminary forecasting or predictions. `Sales` are only possible if the stores are open therefore we should retain `Open` being a feature. Let's explore whether the rest of the attributes have any co-relation with `Sales`. As the remaining attributes are all categorical, therefore let's explore their relationship with `Sales` using boxplot. Meantime, we will also convert `Sales` to logscale for modelling.

In [None]:
train['LogSales'] = np.log(train['Sales']+1)  # +1 to take care of log(0) condition
train_stores = train[train['Open']!=0]
cols = ['DayOfWeek', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Month','Year']  # interested in these attributes

fig, axis = plt.subplots(3,2, figsize=(15,15))
axis = axis.ravel()
for i, attr in enumerate(cols):
    ax1 = sns.boxplot(x=attr, y='Sales', data=train_stores[['Sales', attr]], palette='husl', ax=axis[i])
    axis[i].set_title('Distributions per ' + attr)

**DayOfWeek**: It does appear that day-1, day-2, day-6 an day-7 have varying distributions wheres day-3,4,5 appear to have similar distributions. May be we can club day-3,4,5 as day-3 reducing the dimensionality of this feature

**Promo**: Promo does have varying distributions hence will be a good feature to have

**StateHoliday**: StateHoliday also appears to be a good feature to have. However, '0' (string 0) and 0 (numeric 0) as categories appears to be mis-labelled categories and can be clubbed as '0' (string 0)

**SchoolHoliday**: SchoolHoliday does not appear to have much interesting information. May be we can discard this feature.

**Month**: This is an important feature because of the seasonality factor

**Year**: Year could be discarded as a feature.


### Stores data
Let's explore the stores data to see if any of those attributes could be used as a feature. Each store has one data-point for each attribute. Therefore, we will have to expand the data to fit the `train` data. First, let's look at `Promo2`, by itself it only conveys whether a store had a second promotion, so it may not very useful to expand. However we could use `Promo2SinceWeek`, `Promo2SinceYear` and `PromoInterval` to create a step-function that switches from `0` to `1` when `Promo2` started thereby covering the expanse of the `train` data. I think, similar processing could be done to `CompetitionOpen` using `CompetitionOpenSinceMonth` and `CompetitionOpenSinceYear`, instead of a step-function we could convert this into a function of `CompetitionDistance`. `StoreType` and `Assortment` are important features of a store hence let us retain them as is.

I will first prepare data for `CompetitionOpen` by creating `CompOpenDate` and using a inverse of `CompetitionDistance` as `CompImpact`. I use inverse relationship because, from [Rossmann Store -Insights](https://www.kaggle.com/virusme/rossmann-store-sales/rossmann-store-insights), we have seen that competition distance has inverse relationship with sales and performance. That is, larger the competition distance better the sales and performance and vice-versa.

In [None]:
import datetime
# stores data
stores = pd.read_csv('../input/store.csv')
# there are many NANs, remove them
stores_notnull = stores['CompetitionOpenSinceMonth'].notnull() & stores['CompetitionOpenSinceYear'].notnull()
# create CompetitionOpenDate
stores['CompOpen'] = stores[stores_notnull]['CompetitionOpenSinceYear'].astype(int).astype(str).str.cat(stores[stores_notnull]['CompetitionOpenSinceMonth'].astype(int).astype(str).str.zfill(2), sep='-')
stores['CompOpenDate'] = pd.Series([datetime.datetime.strptime(str(ym), '%Y-%m').strftime('%Y-%m-%d') for ym in stores[stores_notnull]['CompOpen'].tolist()], index = stores[stores_notnull].index)    
# fill CompetitionDistance for Nan as high number
stores['CompetitionDistance'].fillna(value=1000000, inplace=True)


# let's update train data
# create a step function based on CompOpenDate for train
print('processing Stores...')
for store in stores['Store']:
    print('\r', 'Store: ', store, end='')
    storedata = train[train['Store'] == store]
    compd = stores[stores['Store']==store]['CompOpenDate']
    dist = stores[stores['Store']==store]['CompetitionDistance']
    train.ix[train['Store']==store, 'CompImpact'] = (storedata['Date'] > compd.values[0]).astype(int).values * (1/dist.values[0])

print('\n','finished')

_Note: using `\r` in the `print` statement should overwrite the previous printed line, oddly enough that does not seem to be the case in Kaggle notebook. However, on `jupyter-notebook` it works fine._

We have now created a `CompImpact` step function for each store. We can use some sort of inverse function of `CompetitionDistance` combined with step function to generate a meaningful series that hopefully adds more dimensionality to our model building.

Now, let's look at generating a similar series for `Promo2`. `Promo2SinceWeek` and `Promo2SinceYear` appears to be straightforward. However, `PromoInterval` is quite confusing. For example, what `Jan, Apr, Jul, Oct` mean for `PromoInterval`. From the data description, it appears:

> describes the consecutive intervals Promo2 is started, naming the months the
promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in
February, May, August, November of any given year for that store

So, if `Promo2` started in Feb, how long does it run? If it started anew in May, did `Promo2` that started in Feb end before May and was there a no promotion period? If there was no period with no promotion then why should not I consider `Promo2` being ON from Feb till Jan, i.e. entire 52 weeks. If there were more details provided as to what promotions were held at each of those intervals then it would have been lot more informative. For example, Feb - 10$\%$ discount promotion, May - free $\$$10 voucher promotion, etc. Some stores have one or two intervals, again information on how long did the promotions run for would have been really helpful. I will discard `PromoInterval` as of now and create a step function using `Promo2SinceWeek` and `Promo2SinceYear`.

To convert `year-week` string format to `date`, I will use `datetime` with `-0` and `-$\%$w` options, as suggested [here](http://stackoverflow.com/a/17087427/2979010). `-w` will directs the parser to convert to week starting `Monday`. Assuming businesses would like to start promotions on `Sunday`, `-0` will pick `Sunday` as the starting dat of the week. 

In [None]:
# there are many NANs, remove them
stores_notnull = stores['Promo2SinceWeek'].notnull() & stores['Promo2SinceYear'].notnull()
# create Promo2OpenDate
stores['Promo2Open'] = stores[stores_notnull]['Promo2SinceYear'].astype(int).astype(str).str.cat(stores[stores_notnull]['Promo2SinceWeek'].astype(int).astype(str).str.zfill(2), sep='-')
stores['Promo2OpenDate'] = pd.Series([datetime.datetime.strptime(str(ym)+'-0', '%Y-%W-%w').strftime('%Y-%m-%d') for ym in stores[stores_notnull]['Promo2Open'].tolist()], index = stores[stores_notnull].index)    

# let's update train data
# create a step function based on CompOpenDate for train
print('processing Stores...')
for store in stores['Store']:
    print('\r', 'Store: ', store, end='')
    storedata = train[train['Store'] == store]
    p2d = stores[stores['Store']==store]['Promo2OpenDate']
    train.ix[train['Store']==store, 'Promo2'] = (storedata['Date'] > p2d.values[0]).astype(int).values


print('\n','finished')

In [None]:
# plot
fig, (axis1, axis2, axis3) = plt.subplots(3,1, sharex=True, figsize=(10,7))
# We will now plot the generated series for one store to see if it looks alright.
# Pick a random store which has CompOpenDate and Promo2OpenDate  
store = stores[(stores['CompOpenDate'].notnull()) & (stores['Promo2OpenDate'].notnull())]['Store'].sample(n=1).values[0]

# display CompOpenDate and Promo2OpenDate
print('Competition Open Date: ' + stores[stores['Store']==store]['CompOpenDate'].astype(str))
print('Promo2 Start Date: ' + stores[stores['Store']==store]['Promo2OpenDate'].astype(str))
# plot generated series along with sales
storedata = train[train['Store']==store]
storedata['Date'] = pd.to_datetime(storedata['Date'])
#
storedata['CompImpact'].plot(marker='o', ax=axis1)
tmp = axis1.set_title('Store-{} :Competition Impact'.format(store))
storedata['Promo2'].plot(marker='o', ax=axis2)
tmp = axis2.set_title('Promo2 Start')
storedata['Sales'].plot(marker='o', ax=axis3)
tmp = axis3.set_title('Sales')
tmp = axis3.set_xticks(storedata['Date'].index[::20])
tmp = axis3.set_xticklabels(storedata['Date'][::20], rotation=90)

_Note: Above plot are generated by selecting a store at random, everytime you run, the plots will change._

Above plots, show the step-functions for `CompImpact` and `Promo2` plotted along with the `Sales` data for a random store. In this case(i.e. `Store=115`), we can see that the `CompImpact` is always `1` suggesting that the competition was in existence even before the first data point was collected. Also, there is a large gap in the data starting `2014-06-25`, this could be because of renovations, etc being performed on the store during which there was no data collected as the store was closed. We have to be mindful of such data-gaps and make sure they do not exist before while training/constructing a model.

Now, let's add `StoreType` and `Assortment` data for each store.

In [None]:
# StoreType and Assortment
print('processing Stores...')
for store in stores['Store']:
    print('\r', 'Store: ', store, end='')
    storedata = train[train['Store'] == store]
    st = stores[stores['Store']==store]['StoreType']
    train.ix[train['Store']==store, 'StoreType'] = st.values[0]
    asst = stores[stores['Store']==store]['Assortment']
    train.ix[train['Store']==store, 'Assortment'] = asst.values[0]
    
print('\n','finished')

In [None]:
## lets prepare data for training and testing

# make a copy of the data 
train_copy = train.copy()

# StateHoliday, StoreType and Assortment are categorical strings, convert using simple LabelEncoder/DictVectoriser
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
train_copy['StateHoliday'].replace(0, '0', inplace=True)
train_copy['StateHoliday'] = LabelEncoder().fit_transform(train_copy['StateHoliday'])
train_copy['StoreType'] = LabelEncoder().fit_transform(train_copy['StoreType'])
train_copy['Assortment'] = LabelEncoder().fit_transform(train_copy['Assortment'])

# Date is already sorted
unique_dates = train_copy['Date'].unique()
train_length = np.round(unique_dates.shape[0] * 0.8).astype(int)
#
train_data = train_copy[train_copy['Date'].isin(unique_dates[0:train_length])]
test_data = train_copy[train_copy['Date'].isin(unique_dates[train_length+1:])]
# feature attributes
feature_attributes = ['Open', 'DayOfWeek', 'Promo', 'StateHoliday', 'SchoolHoliday', 'CompImpact', 'Promo2', 'Month', 'Year', 'StoreType', 'Assortment']
target_attribute = ['LogSales']
#