# TPS - Jan22 - fastai approach

This notebook is basically me getting familiar with the [fastai](https://docs.fast.ai/) framework for machine learning.
While learning myself, I want to output something for the community, therefore I've tried to document all steps that are not clear by just looking at the code, hoping that others can benefit from this notebook.
This is also a challenge to see how far "basic" neural networks can get (me) in such a competiton which deals with time-series data.
I will try to keep updating the notebook, using more advanced techniques and trying to improve the model performance.

If there are any mistakes in the notebook that you observe, please let me know in the comments. Also consider liking the notebook if you find it useful.

In [None]:
# Import libraries and set seed for reproducability
from fastai.tabular.all import *
import pandas as pd
import numpy as np
import itertools

set_seed(42)

In [None]:
path = Path('../input/tabular-playground-series-jan-2022')
path.ls()

In [None]:
train_df = pd.read_csv(path/'train.csv', index_col='row_id')
test_df = pd.read_csv(path/'test.csv', index_col='row_id')
train_df.head()

In [None]:
# No need to do any imputation or taking care of missing values otherwise
train_df.isnull().sum(), test_df.isnull().sum()

In [None]:
dropped = ['Elapsed', 'Dayofyear', 'Day']

# Add columns relevant to a 'date' column in order to process it 
train_df = add_datepart(train_df, 'date', drop=False)
# Drop some columns that don't seem important
train_df = train_df.drop(columns=dropped)
test_df = add_datepart(test_df, 'date', drop=False).drop(columns=dropped)
                                                         
train_df.head()

Adding informatin about GDP per capita is benefitial according to [this](https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/300148) discussion.


In [None]:
# Adding info about GDP per capita
gdp_per_capita = pd.read_csv('../input/gdp-per-capita-finland-norway-sweden-201519/GDP_per_capita_2015_to_2019_Finland_Norway_Sweden.csv')
gdp_per_capita = gdp_per_capita.rename(columns={'year': 'Year'})
gdp_per_capita.head()

In [None]:
# Convert dataframe from wide to long format
gdp_per_capita = gdp_per_capita.melt(id_vars='Year', value_vars=['Finland', 'Norway', 'Sweden'],
                    var_name='country', value_name='gdp')
gdp_per_capita

In [None]:
# Merge the training df with gdp dataset
until_2019 = gdp_per_capita['Year'] < 2019
since_2019 = gdp_per_capita['Year'] >= 2019

train_df = train_df.merge(gdp_per_capita[until_2019], on=['Year', 'country'], how='left')
test_df = test_df.merge(gdp_per_capita[since_2019], on=['Year', 'country'], how='left')
train_df

Adding information about festivities in the nordic countries from [this](https://www.kaggle.com/lucamassaron/festivities-in-finland-norway-sweden-tsp-0122) dataset.

In [None]:
festives_df = pd.read_csv('../input/festivities-in-finland-norway-sweden-tsp-0122/nordic_holidays.csv').drop(columns='Unnamed: 0')
make_date(festives_df, 'date')
festives_df.head()

In [None]:
train_df = train_df.merge(festives_df, on=['date', 'country'], how='left')
test_df = test_df.merge(festives_df, on=['date', 'country'], how='left')
train_df.head(5)

In [None]:
# All days that do not have a holiday will now have a NaN entry in the 'holiday' column, so we need to fix that
print('Missing values before: ', train_df['holiday'].isna().sum())
train_df['holiday'].fillna('no_holiday', inplace=True)
print('Missing values after: ', train_df['holiday'].isna().sum())
test_df['holiday'].fillna('no_holiday', inplace=True)

Adding information about the Consumer Price index, as it can improve CV/LB according to [this](https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/300963) discussion.

In [None]:
cpi_df = pd.read_csv('../input/consumer-price-index-20152019-nordic-countries/Best_CPI.csv', index_col='Unnamed: 0').rename(columns={'GDP': 'CPI'})
cpi_df.head()

In [None]:
train_df = train_df.merge(cpi_df, left_on=['Year', 'country'], right_on=['year', 'country'], how='left')
test_df = test_df.merge(cpi_df, left_on=['Year', 'country'], right_on=['year', 'country'], how='left')
train_df.head()

In [None]:
# Since we don't need the 'date' column anymore, we can drop it
train_df = train_df.drop(columns='date')
test_df = test_df.drop(columns='date')

# Useful function that splits values into continuous and categorical columns - However not necessarily correct here
cont_names, cat_names = cont_cat_split(train_df, dep_var='num_sold')
cont_names, cat_names

In [None]:
# 80-20 Train-Validation split, EndSplitter to avoid look-ahead bias
splits = EndSplitter(valid_pct=0.2)(range_of(train_df))

# Continuous and categorical variables
cont_names = ['gdp',
              'Year',
              'CPI']
cat_names = ['country',
  'store',
  'product',
  'Month',
  'Dayofweek',
  'Is_month_end',
  'Is_month_start',
  'Is_quarter_end',
  'Is_quarter_start',
  'Is_year_end',
  'Is_year_start',
  'Week',
  'holiday']

# Create dataloader
to = TabularPandas(train_df,
                   y_names='num_sold', 
                   y_block=RegressionBlock,
                   cat_names=cat_names,
                   cont_names=cont_names,
                   procs=[Categorify, Normalize],
                   splits=splits)

dls = to.dataloaders(bs=128)

In [None]:
dls.show_batch()

Evaluation Metric: SMAPE as presented in [this](https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298201) discussion.

In [None]:
def SMAPE(y_true, y_pred):
    denominator = (y_true + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.mean(diff)

smape = AccumMetric(SMAPE, to_np=True, invert_arg=True)

In [None]:
# Creater a learner object
learn = tabular_learner(dls, metrics=smape)

# Find a learning rate
learn.lr_find()

In [None]:
# Run 50 epochs, and saving the model with best SMAPE validation score
learn.fit_one_cycle(50, cbs=[SaveModelCallback(monitor='SMAPE', comp=np.less)])

In [None]:
learn.show_results()

In [None]:
submission_df = pd.read_csv(path/'sample_submission.csv')
submission_df.head()

In [None]:
dl = learn.dls.test_dl(test_df)
y, _ = learn.get_preds(dl=dl)

"It may be worth rounding up ones submission.csv to the nearest integer (for example with something like np.ceil)" - [discussion](https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298201)

In [None]:
submission_df['num_sold'] = np.ceil(y)
submission_df.to_csv('submission.csv', index=False)

submission_df.head()

In [None]:
submission_df['num_sold'].mean()