This is a notebook of preparing time-series data for further deep-learning applicaiton. 
It is mainly based on the following notebook:

https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb

With additional tweaks like bridging the school/state holiday and promo period with weekend, and add running length encoding features. 

# Env setup

In [None]:
from fastai.tabular import *
import pandas as pd
import os, tarfile
import random
import matplotlib.pyplot as plt
import re
from datetime import *
from isoweek import Week

%matplotlib inline
%reload_ext autoreload
%autoreload 2

np.set_printoptions(threshold=50, edgeitems=20)
pd.options.mode.chained_assignment = None
PATH = '../input/multiple-data-source-of-rossmann/'
OUTPUT = './'

In [None]:
def rmax(n=50):
  pd.set_option('display.max_rows', n)

def cmax(n=50):
  pd.set_option('display.max_columns',n)

In [None]:
tarfile.open(f'{PATH}rossmann.tgz').extractall(path = OUTPUT)

# Load datasets

In [None]:
table_names = ['train', 'store', 'store_states', 'state_names', 
               'googletrend', 'weather', 'test']
tables = [pd.read_csv(f'{OUTPUT}{fname}.csv', low_memory=False) for fname in table_names]
train, store, store_states, state_names, googletrend, weather, test = tables
train.shape, test.shape

# Quick summary of the data
In addition to the provided data, there are some xternal datasets put together by participants in the Kaggle competition.
* train: Contains store information on a daily basis, tracks things like sales, customers, whether that day was a holdiay, etc.
* store: general info about the store including competition, etc.
* store_states: maps store to state it is in
* state_names: Maps state abbreviations to names
* googletrend: trend data for particular week/state
* weather: weather conditions for each state
* test: Same as training table, w/o sales and customers

In [None]:
from IPython.display import HTML, display
for t in tables: display(t.head())

# Data Cleaning / Add Date-related features

In [None]:
train.StateHoliday.value_counts(), test.StateHoliday.value_counts()

Since there is only one type of state holiday in the test set, we will thus simplify this variable into binary type.

In [None]:
train.StateHoliday = (train.StateHoliday!='0').astype('int')
test.StateHoliday = (test.StateHoliday!='0').astype('int')

Since the google trend data is a weekly data (given by start-end date), we need to use either start or end of that period as an anchor to associate with the main store sales data (train/test). Here use of the start date of each period is more like leaking future data into the past. (i.e. the trend of 2014/9/30 is actually the google trend result of 9/28-10/4 .If we are going to predict 10/3's sales based on data before 10/3 such trend data won't be available after 10/4)

Maybe we should consider use the google trend data from the previous week (or use the end-date of the period as the week to join df). Using the start-date 

In [None]:
googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]

Here we used the fastai's 'add_datepart' function to expand the date information into multiple features

In [None]:
add_datepart(weather, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)

# Join multiple tables
join_df is a function for joining tables on specific fields. By default, we'll be doing a left outer join of right on the left argument using the given fields for each table.

In [None]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

Join tables of weather and state name

In [None]:
weather = join_df(weather, state_names, "file", "StateName")
weather.State.unique() , googletrend.file.unique()

The state codes in weather and googletrend are different: NI and HB,NI, and the googletrend file has a distinct category for the whole Germany.

In [None]:
googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', "State"] = 'HB,NI'
train.shape, test.shape

Make a separate table for the whole germany google trend

In [None]:
trend_de = googletrend[googletrend.file == 'Rossmann_DE']

Now join the remaining tables

In [None]:
store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])

In [None]:
joined = join_df(train, store, "Store")
joined_test = join_df(test, store, "Store")
len(joined[joined.StoreType.isnull()]),len(joined_test[joined_test.StoreType.isnull()])

In [None]:
joined = join_df(joined, googletrend, ["State","Year", "Week"])
joined_test = join_df(joined_test, googletrend, ["State","Year", "Week"])
len(joined[joined.trend.isnull()]),len(joined_test[joined_test.trend.isnull()])

In [None]:
joined = joined.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
joined_test = joined_test.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
len(joined[joined.trend_DE.isnull()]),len(joined_test[joined_test.trend_DE.isnull()])

In [None]:
joined = join_df(joined, weather, ["State","Date"])
joined_test = join_df(joined_test, weather, ["State","Date"])
len(joined[joined.Mean_TemperatureC.isnull()]),len(joined_test[joined_test.Mean_TemperatureC.isnull()])

In [None]:
joined.shape, joined_test.shape

Clean up the duplicated columns

In [None]:
for df in (joined, joined_test):
    for c in df.columns:
        if c.endswith('_y'):
            if c in df.columns: df.drop(c, inplace=True, axis=1)
        if c.endswith('_DE'):
            if not c.startswith('trend'):
              if c in df.columns: df.drop(c, inplace=True, axis=1)
joined.shape, joined_test.shape

Fill in missing values

In [None]:
for df in (joined,joined_test):
    df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
    df['CompetitionOpenSinceMonth'] = df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
    df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)
    df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)

# Feature engineering
Create new feature counting how many days a competition store has opened

In [None]:
for df in (joined,joined_test):
    df["CompetitionOpenSince"] = pd.to_datetime(dict(year=df.CompetitionOpenSinceYear, 
                                                     month=df.CompetitionOpenSinceMonth, day=15))
    df["CompetitionDaysOpen"] = df.Date.subtract(df.CompetitionOpenSince).dt.days
joined.CompetitionDaysOpen.describe()

In [None]:
joined.CompetitionDaysOpen.isna().any()

We'll replace some erroneous / outlying data.

In [None]:
for df in (joined,joined_test):
    df.loc[df.CompetitionDaysOpen<0, "CompetitionDaysOpen"] = 0
    df.loc[df.CompetitionOpenSinceYear<1990, "CompetitionDaysOpen"] = 0
joined.CompetitionDaysOpen.describe() 

We add "CompetitionMonthsOpen" field, limiting the maximum to 2 years to limit number of unique categories.

In [None]:
for df in (joined,joined_test):
    df["CompetitionMonthsOpen"] = df["CompetitionDaysOpen"]//30
    df.loc[df.CompetitionMonthsOpen>24, "CompetitionMonthsOpen"] = 24
joined.CompetitionMonthsOpen.unique()
joined.CompetitionMonthsOpen.value_counts()

Same process for the promo date

In [None]:
for df in (joined,joined_test):
    df["Promo2Since"] = pd.to_datetime(df.apply(lambda x: Week(
        x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype('datetime64'))
    df["Promo2Days"] = df.Date.subtract(df["Promo2Since"]).dt.days
for df in (joined,joined_test):
    df.loc[df.Promo2Days<0, "Promo2Days"] = 0
    df.loc[df.Promo2SinceYear<1990, "Promo2Days"] = 0
    df["Promo2Weeks"] = df["Promo2Days"]//7
    df.loc[df.Promo2Weeks<0, "Promo2Weeks"] = 0
    df.loc[df.Promo2Weeks>25, "Promo2Weeks"] = 25

In [None]:
joined.shape, joined_test.shape

# Fill gap between event by weekend (if applicable)

Since the weekend is usually considered as part of school/state holidays, we will fill in the gaps of school/state holidays by weekend. The same applies to Promo period.

First we need to create a weekend variable

In [None]:
columns = ["Date",'Dayofweek', "Store", "Promo", "StateHoliday", "SchoolHoliday"]
df = train[columns].append(test[columns])
df['StateHoliday'] = df.StateHoliday.astype('int')
df['Weekend'] = 0
df.loc[df.Dayofweek>= 5 ,['Weekend']] = 1

Make new columns by combining weekend and School/state holidays

In [None]:
df['SchoolHoliday2'] = df['SchoolHoliday']
df['StateHoliday2'] = df['StateHoliday']
df['Promo2'] = df['Promo']

df.loc[df.Weekend==1, ['SchoolHoliday2', 'StateHoliday2', 'Promo2']] = 1

Calculate the duration of each event period

In [None]:
columns = ['SchoolHoliday2', 'StateHoliday2', 'Promo2']
sub = df[['Date','Store']+columns]   #make a smaller dataframe to work with. 
sub.sort_values(by=['Store', 'Date'], inplace=True)

daysum = sub.copy()

for c in columns:
  daysum.loc[:,c] = sub.groupby(['Store', sub[c].diff().ne(0).cumsum()])[c].transform('sum')

df = df.merge(daysum, how = 'left', on=['Date', 'Store'], suffixes=['', '_DaySum']) ;

In [None]:
rmax(500)
df[(df.Date>datetime(2015,3,1)) & (df.Store==1)].head(500)

We can see there are discrete holiday and promo periods which could be further linked if weekend was taken into consideration. 

If the event duration is more than 2 days (DaySum > 2), it could be a weekend adjacent to that event. Replace the original data with the weekend-filled data.

In [None]:
rmax()

In [None]:
df.loc[df['SchoolHoliday2_DaySum'] > 2, 'SchoolHoliday'] = df['SchoolHoliday2']
df.loc[df['StateHoliday2_DaySum'] > 2, 'StateHoliday'] = df['StateHoliday2'] 
df.loc[df['Promo2_DaySum'] > 2, 'Promo'] = df['Promo2'] 

Now the 'SchoolHoliday', 'StateHoliday', 'Promo' are taking weekend into consideration (will be consecutive if gap filled by the weekend). 

Clean up the dataframe.

In [None]:
columns = ['SchoolHoliday', 'StateHoliday', 'Promo']
df = df[['Date','Store', 'Weekend']+columns]

(Grabbed from fastai course notebook):
We'll define a function get_elapsed for cumulative counting across a sorted dataframe. Given a particular field fld to monitor, this function will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero. Upon initialization, this will result in datetime na's until the field is encountered. This is reset every time a new store is seen. We'll see how to use this shortly.
Let's walk through an example. Say we're looking at School Holiday. We'll first sort by Store, then Date, and then call add_elapsed('SchoolHoliday', 'After'): This will apply to each row with School Holiday:
A applied to every row of the dataframe in order of store and date
Will add to the dataframe the days since seeing a School Holiday
If we sort in the other direction, this will count the days until another holiday.

In [None]:
def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []

    for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1))
    df[pre+fld] = res

In [None]:
fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')
fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')
fld = 'Promo'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

We're going to set the active index to Date. Then set null values from elapsed field calculations to 0.

In [None]:
df = df.set_index("Date")
columns = ['SchoolHoliday', 'StateHoliday', 'Promo']
for o in ['Before', 'After']:
    for p in columns:
        a = o+p
        df[a] = df[a].fillna(0).astype(int)

# Features from running length encoding
Create new features to count the total duration of a given event, such as how long is the current promo/holiday period. Add the serial count of that event as well (i.e. the first day of a 42days school holiday)

Any gap of weekend was already filled up. 

In [None]:
df.info()

Make a smaller dataframe to work with

In [None]:
columns = ['SchoolHoliday', 'StateHoliday', 'Promo']
sub = df[['Store']+columns]

In [None]:
# sub.sort_index(inplace=True)
sub.sort_values(by=['Store', 'Date'], inplace=True)
daysum = sub.copy()
daycount = sub.copy()

for c in columns:
  daysum.loc[:,c] = sub.groupby(['Store', sub[c].diff().ne(0).cumsum()])[c].transform('sum')
  daycount[c] = sub.groupby(['Store', sub[c].diff().ne(0).cumsum()])[c].transform('cumsum')

sub2 = sub.merge(daysum, how = 'left', on=['Date', 'Store'], suffixes=['', '_DaySum']) ;
sub2 = sub2.merge(daycount, how = 'left', on=['Date', 'Store'], suffixes=['', '_DayCount']) ; 

## Apply rolling window functions

In [None]:
bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()
fwd = df[['Store']+columns].sort_index(ascending=False
                                      ).groupby("Store").rolling(7, min_periods=1).sum()

bwd.drop(columns='Store',inplace=True)
bwd.reset_index(inplace=True)
fwd.drop(columns='Store',inplace=True)
fwd.reset_index(inplace=True)
df.reset_index(inplace=True)

## Merge the engineered features with the main dataframe

In [None]:
df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])
df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])

sub2.reset_index(inplace=True)
sub2.drop(columns, 1 ,inplace=True)
df = df.merge(sub2, how='left', on=['Date','Store'])
df.drop(columns,1,inplace=True)

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.to_feather(f'{OUTPUT}df')

In [None]:
joined = join_df(joined, df, ['Store', 'Date'])
joined_test = join_df(joined_test, df, ['Store', 'Date'])

Instances where the store had zero sale (closed) are removed. 

In [None]:
joined = joined[joined.Sales!=0]

Save the joined master dataframes

In [None]:
joined.reset_index(drop=True, inplace=True)
joined_test.reset_index(drop=True, inplace=True)
joined.to_feather(f'{OUTPUT}joined2')
joined_test.to_feather(f'{OUTPUT}joined2_test')

Now we can proceed to the deep learning part:

https://www.kaggle.com/zongtseng/rossmann-time-series-prediction-deep-learning?scriptVersionId=23443126