# Deep learning using the fastai library

 
This is a notebook for practicing deep learning model using the fastai library. 
The feature engineering part is in another notebook:

https://www.kaggle.com/zongtseng/rossmann-time-series-data-engineering

It is mainly based on the fastai course notebook but with some additional features added, such as running length encoding...etc.

https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson6-rossmann.ipynb

https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb


# Env Setup

In [None]:
from fastai.tabular import *
import os, tarfile
import random
import matplotlib.pyplot as plt
import pandas as pd
import re
from datetime import *

%matplotlib inline
%reload_ext autoreload
%autoreload 2

np.random.seed(23)
np.set_printoptions(threshold=50, edgeitems=20)

# Load data

In [None]:
!ls -al /kaggle/input/rossmann-time-series-data-engineering

In [None]:
OUTPUT = '/kaggle/working/'
PATH='/kaggle/input/rossmann-time-series-data-engineering/'
df = pd.read_feather(f'{PATH}df')
train_df = pd.read_feather(f'{PATH}joined2')
test_df = pd.read_feather(f'{PATH}joined2_test')
train_df.shape, test_df.shape

# Setup dataset

Construct the dataset for deep learning model using relevant variables

In [None]:
cat_vars = ['Store', 'DayOfWeek', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day',
       'Is_year_end', 'Is_year_start', 'StoreType', 'Assortment', 
       'Promo2', 'PromoInterval', 'State',   
       'Events',  'CompetitionMonthsOpen', 
       'Promo2Weeks',
       'SchoolHoliday_bw','StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw', 'StateHoliday_fw','Promo_fw', 
       'SchoolHoliday_DaySum', 'StateHoliday_DaySum', 'Promo_DaySum', 
       'SchoolHoliday_DayCount', 'StateHoliday_DayCount', 'Promo_DayCount']

cont_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
            'Max_Humidity','Mean_Humidity', 'Min_Humidity',
            'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h','Precipitationmm','CloudCover',
            'trend', 'trend_DE', 'CompetitionDaysOpen', 'Promo2Days',
            'AfterSchoolHoliday', 'BeforeSchoolHoliday', 'AfterStateHoliday',
            'BeforeStateHoliday', 'AfterPromo', 'BeforePromo']

dep_var = 'Sales'
df = train_df[cat_vars + cont_vars + [dep_var,'Date']].copy()

Determine the time frame used for validation. Take the most recent date from the training set and use the same length as the length of the test set. 

In [None]:
test_df['Date'].min(), test_df['Date'].max(), len(test_df)

In [None]:
cut = train_df['Date'][(train_df['Date'] == train_df['Date'][len(test_df)])].index.max()
valid_idx = range(cut) ; valid_idx

In [None]:
train_df['Date'][0], train_df['Date'][cut] 

We are taking the last 'n' samples (most recent in time) as validation set. While the 'n' has the same length as the test set.

Now we can construct the dataset using fastai's databunch method. We will first use a larger batch size to have the model converge faster (smaller batch size gives more noise while large batch size risk over fitting? to be confirmed...)

In [None]:
procs=[FillMissing, Categorify, Normalize]

datalist = (TabularList.from_df(df, path=OUTPUT, cat_names=cat_vars, cont_names=cont_vars, procs=procs,)
                .split_by_idx(valid_idx=valid_idx)
                .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                .add_test(TabularList.from_df(test_df, path=PATH, cat_names=cat_vars, cont_names=cont_vars)))
data = datalist.databunch(bs=512)

Check the GPU device. (Should be type = 'cuda')

In [None]:
defaults.device

We will define a boundary condition for our neural network output. (y_range)
And build a fastai learner object. The two fully connected dense layer with size 1000, 500, and dropout rate 0.001, 0.01 as well as the dropout rate for the embedded layer 0.04 are taken from the fastai tutorial notebook directly. 

In [None]:
max_log_y = np.log(np.max(train_df['Sales'])*1.2)  # whether it is better to have +20% max sales need to be verified
y_range = torch.tensor([0, max_log_y], device=defaults.device)
learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, 
                        y_range=y_range, metrics=exp_rmspe)

Convert the learner to fp16 to increase the efficiency

In [None]:
learn.to_fp16

Check the batch size

In [None]:
learn.data.batch_size


# Start training
Use the learning rate finder to estimate the best learning rate to start

In [None]:
import fastai
fastai.__version__

In [None]:
learn.lr_find(end_lr=100, wd=0.3)
learn.recorder.plot()

The steepest part is around 2e-2. So we will start at the 10 times less (2e-3). The weight decay 0.3 is choosen to be higher than normally used 0.1 or 0.2 because we have stuffed in almost all the variables without feature selection. As a result, a higher wd to avoid overfitting at the beginning. 

In [None]:
learn.fit_one_cycle(5, 3e-3, wd=0.3)

In [None]:
learn.save('bs512_5ep_2e-3_wd0.3')

In [None]:
learn.fit_one_cycle(5, 1e-3, wd=0.3)

In [None]:
learn.recorder.plot_losses()

In [None]:
learn.save('bs512_2_5ep_1e-3_wd0.3')

In [None]:
learn.fit_one_cycle(5, 3e-4, wd=0.3)
learn.recorder.plot_losses()

In [None]:
learn.save('bs512_3_5ep_3e-4_wd0.3')

We do not get much improvement at this moment. Reduce the batch size to 128.

In [None]:
data = datalist.databunch(bs=128)
learn.data = data
learn.data.batch_size

In [None]:
learn.fit_one_cycle(5, 1e-3, wd=0.2)
learn.recorder.plot_losses()

In [None]:
learn.save('bs128_5ep_1e-3_wd0.2')

In [None]:
learn.fit_one_cycle(5, 1e-3, wd=0.2)
learn.recorder.plot_losses()

In [None]:
learn.fit_one_cycle(5, 1e-3, wd=0.1)
learn.recorder.plot_losses()

In [None]:
learn.fit_one_cycle(20, 1e-3, wd=0.1)
learn.recorder.plot_losses()

In [None]:
# learn.fit_one_cycle(5, 5e-4, wd=0.1)
# learn.recorder.plot_losses()

In [None]:

learn.save('last')

# Predict on test data

In [None]:
test_preds=learn.get_preds(DatasetType.Test)
test_df["Sales"]=np.exp(test_preds[0].data).numpy().T[0]
test_df[["Id","Sales"]]=test_df[["Id","Sales"]].astype("int")
test_df[["Id","Sales"]].to_csv("rossmann_submission.csv",index=False)