# Rossmann - Fastai

This notebook is based on the code from [Fast.ai Machine learning class lecture 12](https://course18.fast.ai/lessonsml1/lesson12.html). The class lecture basically reviews the third place winner solution described [here](https://www.kaggle.com/c/rossmann-store-sales/discussion/17974).[](http://)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

import numpy as np
import pandas as pd
from os.path import splitext, join
from IPython.display import HTML, display
import os
import re
import time
from isoweek import Week
from functools import partial

from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Embedding, concatenate, Flatten, Dropout

pd.set_option('display.max_columns', 1000)

# Load Data

The competition data contains information about historical sales from 1115 Rossmann stores. 
- The training data contains historical sales (and number of customers) from January 1, 2013 till July 31, 2015.
- The test data, we need to estimate sales for the period betwee August 1, 2015 to September 17, 2015. The Promo information is provided for this "future" period, and also State/School holiday information. 

Before reading the data, we need to add few more data files which have been contributed by the community (all data can be downloaded from [here](http://files.fast.ai/part2/lesson14/rossmann.tgz). Those include four files, in addition to the ones provided by the competition: 
1. *googletrend.csv*: A google trend metric for a particular week, and particular state in Germany. 
2. *weather.csv*: Various weather information for the period from training data start till test data end.
3. *store_states.csv*: The state location of each store ID. I would like to know how did they get that!
4. *state_names.csv*: A mapping from full state names to their abbreviation.

First, walk through the data folder and reads all csv files as pandas dataframes and stores them all into a dictionary.

In [None]:
data = {}
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        full_name = os.path.join(dirname, filename)
        key = splitext(filename)[0]        
        data[key] = pd.read_csv(full_name, low_memory=False)
    
display(f'Loaded file keys: {list(data.keys())}', )

# Clean and merge tables

Define a function which takes a `datetime` column, and extract date parts from it. There is a default of date parts to extract, but we can also pass a subset of those dateparts to extract.

In [None]:
def add_date_parts(df, date_column, parts=None, prefix=None):
    """
    Add date information to the dataframe inplace.
    """
    prefix = prefix or ''
    if parts is None:
        parts = [
            'Year',
            'Month',
            'Week',
            'Day',
            'Dayofweek',
            'Dayofyear',
            'Is_month_end',
            'Is_month_start',
            'Is_quarter_end',
            'Is_quarter_start',
            'Is_year_end',
            'Is_year_start',
            'Elapsed'
        ]
    if not np.issubdtype(df[date_column].dtype, np.datetime64):
        df[date_column] = pd.to_datetime(df[date_column], infer_datetime_format=True)
    
    s = df[date_column]
    for part in parts:
        if part == 'Week':
            df[prefix + part] = s.dt.isocalendar()['week']
        elif part == 'Elapsed':
            df[prefix + part] = s.astype(np.int64) // 10 ** 9
        else:
            df[prefix + part] = getattr(s.dt, part.lower())

The google trend table columns contain information about date and state name, but needs to be extracted from them. In addition, we fix the name of one state to be consistent with the other tables.

In [None]:
googletrend = data['googletrend']
googletrend['Date'] = pd.to_datetime(googletrend.week.str.split(' - ', expand=True)[0])
googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', 'State'] = 'HB,NI'
add_date_parts(googletrend, 'Date', parts=['Year', 'Week'])
googletrend.head()

Google trends data has a special category for whole of Germany (instead of state-wise information). We will extract this to join it by date to each row in the train/test datasets.

In [None]:
trend_de = googletrend[googletrend.file == 'Rossmann_DE'][['trend', 'Year', 'Week']]
trend_de.head()

Next, we join the weather information table with the state name mapping, and also, we add year and week information to be able to join with google trends.

In [None]:
weather = data['weather'].merge(data['state_names'], how='left', left_on='file', right_on='StateName').drop('file', axis=1)
add_date_parts(weather, 'Date', parts=['Year', 'Week'])
weather.head()

Next we merge the `store` table with the `store_states` table to add state information to that table for merging with train/test datasets.

In [None]:
store = data['store'].merge(data['store_states'], how='left', on='Store')
store.head()

Add all date parts information to the train/test datasets.

In [None]:
add_date_parts(data['train'], 'Date')
add_date_parts(data['test'], 'Date')

Now we can merge all the support tables: `store`, `googletrend`, `trend_de`, and `weather` with train/test into a single table.

In [None]:
def merge_all(df):
    out = (df
        .merge(store, how='left', on='Store')
        .merge(googletrend, how='left', on=['State', 'Year', 'Week'], suffixes=('', '_y'))
        .merge(trend_de, how='left', on=['Year', 'Week'], suffixes=('', '_DE'))
        .merge(weather, how='left', on=['State', 'Date'], suffixes=('', '_y'))
    )
    # Drop replicated columns for right merged tables and a couple unwanted ones
    drop_cols = list(out.columns[out.columns.str.endswith('_y')]) + ['week', 'file']
    out.drop(drop_cols, inplace=True, axis=1)
    
    # Check if the merge resulted in any new nulls
    print('Merge has nulls:', any([
        any(out.StoreType.isnull()),
        any(out.trend.isnull()),
        any(out.trend_DE.isnull()),
        any(out.Mean_TemperatureC.isnull()),
    ]))
    return out

train = merge_all(data['train'])
test = merge_all(data['test'])

# Feature engineering

We will need to do the same processing steps for both train and test data.The following function adds the these features:
1. `CompetitionMonthsOpen`: The number of months the nearest competitor has opened since the current row's date since. This is clipped between 0 and 24 months.
1. `Promo2Weeks`: The number of weeks since the last/next continuing promotion.
1. `{Before, After}{SchoolHoliday, StateHoliday, Promo}`: The number of days since the last event before it or first event after it.
1. `{SchoolHoliday, StateHoliday, Promo}{_bw, _fw}`: The total number of each event within the last week.

In [None]:
%%time

def add_features(df):
    df = df.copy()
    # Convert StateHoliday to a flag (0/1) value instead of a categorical one
    df.StateHoliday = (df.StateHoliday != '0').astype(int)
    
    # Add the number of months a competition has been open
    df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
    df['CompetitionOpenSinceMonth'] = df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
    df['CompetitionOpenSince'] = pd.to_datetime(dict(year=df.CompetitionOpenSinceYear, month=df.CompetitionOpenSinceMonth, day=15))
    df['CompetitionDaysOpen'] = df.Date.subtract(df.CompetitionOpenSince).dt.days
    df.loc[df.CompetitionDaysOpen < 0, "CompetitionDaysOpen"] = 0
    df.loc[df.CompetitionOpenSinceYear < 1990, "CompetitionDaysOpen"] = 0
    df['CompetitionMonthsOpen'] = (df.CompetitionDaysOpen//30).clip(0, 24)
    
    # Add the number of weeks since promotion 
    df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)
    df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)
    df['Promo2Since'] = pd.to_datetime([
        Week(y[0], int(y[1])).monday() for y in df[['Promo2SinceYear', 'Promo2SinceWeek']].values
    ])
    df['Promo2Days'] = df.Date.subtract(df.Promo2Since).dt.days
    df.loc[df.Promo2Days < 0, 'Promo2Days'] = 0
    df['Promo2Weeks'] = (df['Promo2Days']//7).clip(0, 25)
    
    # Add elapsed time since and to next of the following flags
    columns = ['SchoolHoliday', 'StateHoliday', 'Promo']

    df.sort_values(['Store', 'Date'], inplace=True)
    for column in columns:
        mask = df[column] == 1    
        for name, method in zip([f'After{column}', f'Before{column}'], ['ffill', 'bfill']):
            df.loc[mask, name] = df.loc[mask, 'Date']
            df[name] = df.groupby('Store')[name].fillna(method=method)
            df[name] = (df.Date - df[name]).dt.days.fillna(0).astype(int)
        
    # Set the active index to Date, so we can do rolling sums
    df.set_index('Date', inplace=True)
    
    # We will sum total number of the following in last/next week
    bw = df.sort_index().groupby('Store')[columns].rolling(7, min_periods=1).sum()    
    fw = df.sort_index(ascending=False).groupby('Store')[columns].rolling(7, min_periods=1).sum()
    df = (
        df
        .merge(bw, how='left', on=['Store', 'Date'], suffixes=['', '_bw'])
        .merge(fw, how='left', on=['Store', 'Date'], suffixes=['', '_fw'])
    )
    return df.reset_index()

train = add_features(train)
test = add_features(test)

Now it is time to checkpoint our data work.

In [None]:
train.reset_index().to_feather('/kaggle/working/train')
test.to_feather('/kaggle/working/test')

# Train/Val data preparation

Re-read data if not done yet.

In [None]:
train = pd.read_feather('/kaggle/working/train')
test = pd.read_feather('/kaggle/working/test')

Now we have all the required features, it is time to prepare the data for training and evaluation. We will manually select and distinguish the categorical and continuous features to keep.

Also, following what the original authors have done, we remove all columns in the training dataset which have Sales equal to 0 (indicating days which store is closed, e.g. for renovation). This probably because metric does not support 0 labels.

In [None]:
categorical_cols = [
    'Store','DayOfWeek', 'Year', 'Month', 
    'Day', 'StateHoliday', 'CompetitionMonthsOpen','Promo2Weeks',
    'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear',
    'Promo2SinceYear', 'State', 'Week', 'Events',
    'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
    'SchoolHoliday_fw', 'SchoolHoliday_bw'
]
continuous_cols = [
    'CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
    'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
    'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
    'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday'
]
target = 'Sales'

all_cols = categorical_cols + continuous_cols

train = train[train.Sales > 0]

# Select those specified columns
train = train[all_cols + [target, 'Date']]
test = test[all_cols + ['Date', 'Id']]

For the categorical columns, convert them first to pandas `categorical` column `dtype` and then use the numerical codes of those categories to get their integer values. Note that since some columns have nulls, that automatically is mapped to -1 in pandas. We add 1 so nulls will always have a 0 value.

For the continuous columns, fill out nulls with zeros, convert `dtype` to 32-bits float, and then do standard normalization of the columns.

In [None]:
# Convert categorical columns to numbers
for c in categorical_cols:    
    train[c] = train[c].astype('category').cat.as_ordered()
    test[c] = test[c].astype('category').cat.as_ordered()
    test[c].cat.set_categories(train[c].cat.categories, ordered=True, inplace=True)
    
    train[c] = train[c].cat.codes + 1
    test[c] = test[c].cat.codes + 1
    
for c in continuous_cols:
    train[c] = train[c].fillna(0).astype('float32')
    test[c] = test[c].fillna(0).astype('float32')

scaler = StandardScaler()
train[continuous_cols] = scaler.fit_transform(train[continuous_cols])
test[continuous_cols] = scaler.transform(test[continuous_cols])

# Modeling with deep learning

The following will define functions to generate the required train/valdation data and the model itself.

In [None]:
def get_data(has_validation=True):
    # train here access a global value, copies it and stores it locally (to the function)
    data = train.copy().set_index('Date')
    X = data[all_cols]
    y = np.log(data[target])
    y_max = y.max()
    y_min = y.min()
    
    if has_validation:        
        split_date = '2015-06-15'
        val_split_date = '2015-06-16'

        X_train, X_val = X.loc[:split_date], X.loc[val_split_date:]

        # Now, convert the training data into a list 
        X_train = [X_train[continuous_cols].values] + [X_train[c].values[..., None] for c in categorical_cols]
        X_val = [X_val[continuous_cols].values] + [X_val[c].values[..., None] for c in categorical_cols]

        # Get the labels
        y_train, y_val = y.loc[:split_date].values, y.loc[val_split_date:].values
        
        y_train = (y_train - y_min)/(y_max - y_min)
        y_val = (y_val - y_min)/(y_max - y_min)
        return y_max, y_min, X_train, y_train, X_val, y_val
    else:
        X_train = [X[continuous_cols].values] + [X[c].values[..., None] for c in categorical_cols]
        y_train = y.values
        y_train = (y_train - y_min)/(y_max - y_min)
        return y_max, y_min, X_train, y_train


def get_rmspe(y_min, y_max):
    def rmspe(y_true, y_pred):
        y_true = tf.math.exp(y_true * (y_max - y_min) + y_min)
        y_pred = tf.math.exp(y_pred * (y_max - y_min) + y_min)
        return tf.math.sqrt(tf.reduce_mean(tf.square((y_true - y_pred)/y_true)))
    return rmspe


def get_model(y_min, y_max, learning_rate=1e-3, dropout=0.1):
    # get the embedding sizes
    embedding_map = []
    cardinalities = list(train[categorical_cols].nunique().values+1)
    for name, cardinality in zip(categorical_cols, cardinalities):
        embedding_map.append({'cardinality': cardinality, 'size': min(50, (cardinality+1)//2)})

    # Define the neural network
    keras.backend.clear_session()
    inputs = [keras.Input(shape=(X_train[0].shape[1],))] + [keras.Input(shape=(1,)) for _ in categorical_cols]
    outputs = [inputs[0]]
    for cat_input, embedding, name in zip(inputs[1:], embedding_map, categorical_cols):
        out = Embedding(embedding['cardinality'], embedding['size'], input_length=1, name=name)(cat_input)
        outputs.append(Flatten()(out))   
    output = concatenate(outputs)
    output = Dense(1024, activation='relu')(output)
    output = Dropout(dropout)(output)
    output = Dense(512, activation='relu')(output)
    output = Dropout(dropout)(output)
    output = Dense(1, activation='sigmoid')(output)

    model = keras.Model(inputs=inputs, outputs=output)
    model.compile(
        loss='mean_absolute_error', 
        optimizer=keras.optimizers.Adam(learning_rate=learning_rate), 
        metrics=[get_rmspe(y_min, y_max)]
    )
    return model

Now we train the model using a validation set so that we select the best hyperparameters

In [None]:
def schedule(epoch):
    if epoch < 20:
        return 1e-3
    else:
        return 5e-4
    
callbacks = [keras.callbacks.LearningRateScheduler(schedule, verbose=0)]

epochs = 30
batch_size = 256
learning_rate = 2e-3
dropout = 0.05

y_max, y_min, X_train, y_train, X_val, y_val = get_data(has_validation=True)

model = get_model(y_min, y_max, learning_rate, dropout)
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val))

Next we train the model with all the data so we have access to the most recent data, which might be important to get good score using the test dataset.

In [None]:
epochs = 30
batch_size = 256
learning_rate = 1e-3
dropout = 0.05

y_max, y_min, X_train, y_train = get_data(has_validation=False)
model = get_model(y_min, y_max, learning_rate, dropout)
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, callbacks=callbacks)

# Predict and submit

We use the final trained model to predict the test dataset. Note that since the model only predicts values between 0 and 1, we must rescale it with the maximum value of Sales in the training set (initially normalized to keep values between 0 and 1).

In [None]:
test = test.set_index('Date')
X = test[all_cols]
ids = test.Id

# Now, convert the data into a list 
X_test = [X[continuous_cols].values] + [X[c].values[..., None] for c in categorical_cols]

In [None]:
y_pred = np.exp(model.predict(X_test) * (y_max - y_min) + y_min)

In [None]:
y_pred.min()

In [None]:
submit = pd.DataFrame({'Id': ids, 'Sales': y_pred.squeeze()}).sort_values('Id')
submit.to_csv('/kaggle/working/submission.csv', index=False)