# The January 2022 competition with Fastai v2

This notebook is a quick demonstration, who to use the Fastai v2 library for a Kaggle tabular competition. Fastai v2 is based on pytorch and allows you, to build a decent machine learning application. For more information please visit the Fastai documentation: https://docs.fast.ai/. I will link to "Chapter 9, Tabular Modelling Deep Dive" and the notebook "09_tabular.ipynb"

In [None]:
from fastai.tabular.all import * 
from fastai.test_utils import show_install

show_install()

In [None]:
np.random.seed(41)
torch.manual_seed(41)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

The data set is located in the following directory

In [None]:
path = Path('../input/tabular-playground-series-jan-2022')
nordic_path = Path('../input/festivities-in-finland-norway-sweden-tsp-0122')
Path.BASE_PATH = path
path.ls(), nordic_path.ls()

I use Pandas to import them and to verify, where null values are there or some values are missing. The result shows, that the data set is complete, so that no additional data completion is needed. That's a goog result!

In [None]:
train_df = pd.read_csv(os.path.join(path, 'train.csv'))
test_df = pd.read_csv(os.path.join(path, 'test.csv'))
sample_submission = pd.read_csv(os.path.join(path, 'sample_submission.csv'))
nordic_holidays = pd.read_csv(os.path.join(nordic_path, 'nordic_holidays.csv'),
                              index_col=0,
                              header=0,
                              names=['holiday_date', 'holiday', 'holiday_country'])
        
train_df.isna().sum().sum(), test_df.isna().sum().sum(), train_df.isnull().sum().sum(), test_df.isnull().sum().sum()

Info about the nordic holidays are published in this dataset: https://www.kaggle.com/lucamassaron/festivities-in-finland-norway-sweden-tsp-0122
Thanks to the author!

Let's display our data.

In [None]:
train_df.head()

The column 'num_sold' is our depended variable. For numerical reasons the logarithmic value is usefull. The value range of the depended variables is needed to build the tabular learner later on. Setting the range will add a sigmoid function at the last output. 

In [None]:
dep_var = ['num_sold']
train_df[dep_var] = np.log(train_df[dep_var])
max_dep_value = np.max(train_df[dep_var].max()) * 1.05
min_dep_value = np.min(train_df[dep_var].min()) * 0.95
dep_value_range = torch.tensor([min_dep_value, max_dep_value], device=device)
dep_value_range, dep_var

Let's find out how the data in the object columns 'country', 'store' and 'product' are distributed. The good news is that they are all equal distributed!

In [None]:
np.unique(train_df['country'], return_counts=True), np.unique(train_df['store'], return_counts=True), np.unique(train_df['product'], return_counts=True)

In [None]:
np.unique(test_df['country'], return_counts=True), np.unique(test_df['store'], return_counts=True), np.unique(test_df['product'], return_counts=True)

The provided data contain a date value and i will see, whether the dates in the training and in the test data are overlapping. They don't!

In [None]:
train_df['date'].min(), train_df['date'].max(), test_df['date'].min(), test_df['date'].max(), len(test_df), len(train_df)- len(test_df)

For the training process of the used neural network, i must split the values in the data frame train_df into a training 
and a validation part. But how should i select them? The task of this competion is to predict values, 
which are located in future. I will use the same amount of validation data as the amount of test data i have.
Therefore i will use values with the row index (0-19727) for the training data and rows (19728, 26297) for the 
validation data. 

In [None]:
cut = train_df['date'][(train_df['date'] == train_df['date'][len(test_df)])].index.max()
train_idx = range(len(train_df)-cut-1)
valid_idx = range(len(train_df)-cut, len(train_df)-1)
splits = (list(train_idx),list(valid_idx))
train_idx, valid_idx

Let's try add more features about the nordic holidays to the data frame. Can we improve the overall score with these additional datas? I will define a flag to control this feature. 

In [None]:
add_nordic_holidays=True

In [None]:
nordic_holidays.head()

The following function calculates for each row in df the time distance between the current date and the last occurred holiday date and adds these values to the passed dataframe. The new column name is created by field and prefix name.

In [None]:
def get_elapsed(df, fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    res = []

    for c,c_h,h_d,d in zip(df.country.values, df.holiday_country.values, df[fld].values, df.date.values):
        if c == c_h :
            last_date = h_d
        res.append(((d-last_date).astype('timedelta64[D]') / day1))
    
    df[pre+fld] = res
    return df

Create a new data frame with the time difference before and after the last occurred holiday.

In [None]:
def get_distance_to_holiday_date(df, field='holiday_date'):
    
    
    columns = ['row_id', 'date', 'country','holiday', 'store', 'holiday_date', 'holiday_country']
    
    dist_df = df[columns].copy()
    dist_df = dist_df.sort_values(['store', 'date'])
    dist_df = get_elapsed(dist_df, field, 'After_')
    
    dist_df = dist_df.sort_values(['store', 'date'], ascending=[True, False])
    dist_df = get_elapsed(dist_df, field, 'Before_')
    
    dist_df['After_holiday_date'] = dist_df['After_holiday_date'].fillna(0).astype(int)
    dist_df['Before_holiday_date'] = dist_df['Before_holiday_date'].fillna(0).astype(int)
    
    dist_df.drop(['country', 'holiday_date', 'holiday', 'holiday_country'],  axis=1,  inplace=True)
    return dist_df

In [None]:
def add_nordic_holidays(df):
    df = pd.merge(df, nordic_holidays, left_on=['date', 'country'], right_on=['holiday_date', 'holiday_country'], how='left')
    
    # calculate the distance before and after the holidays
  #  dist_df = get_distance_to_holiday_date(df)
   # df = pd.merge(df, dist_df, left_on=['row_id', 'date', 'store'], right_on=['row_id', 'date', 'store'], how='left')
    
    df['holiday'] = df['holiday'].astype('category')
    
    # the values of 'holiday_date' aren't needed anymore, let's drop the column
    df.drop(['holiday_date', 'holiday_country'], axis=1, inplace=True)
    return df

In [None]:
if add_nordic_holidays:
    print("We will add info about the nordic holidays to the data frames")
    
    train_df = add_nordic_holidays(train_df)
    test_df = add_nordic_holidays(test_df)
    
    print("Done.")
else:
    print("No holiday infos added")
    
# set the index 
train_df.set_index('row_id', inplace=True)
test_df.set_index('row_id', inplace=True)

To process time series as in this competion, it's a clever way to extract more metadata from a date value like number of week, number of day in current month or year and so forth. The fastai library offers the function 'add_datepart' to execute this extraction. You specify the column, you want to extract. The parameter drop specifies, whether this column is droped or not, which is the default. I will do so

In [None]:
train_df = add_datepart(train_df, 'date', drop=True)
test_df = add_datepart(test_df, 'date', drop=True)

Let's see the modified dataframe and the new added values.

In [None]:
train_df.head()

I need a list of the column names, which are candidates for category variables and which are no candidates, also called continous variables. The Fastai library offers the function 'cont_cat_split' to do this for us. 

In [None]:
cont_vars, cat_vars = cont_cat_split(train_df, dep_var= dep_var,  max_card=12)
cat_vars, cont_vars

In [None]:
len(cat_vars), len(cont_vars)

The next step is to create a data loader. The Fastai library offers a powerful helper called 'TabularPandas'. It needs the data frame, list of the category and continous variables, the depened variable and a splitter. The splitter divides the data set into two parts: one for the training and one for the validation and for internal optimization step in each epoch. The batch size is set to 64.


In [None]:
procs=[Categorify, FillMissing, Normalize]
to_train = TabularPandas(train_df, 
                         procs=procs, 
                         cat_names=cat_vars, 
                         cont_names=cont_vars, 
                         splits=splits, 
                         device=device,
                         y_names=dep_var,
                         y_block=RegressionBlock())

In [None]:
dls = to_train.dataloaders(bs=128)
len(dls.train),len(dls.valid), type(dls.train), dls.train.device

We must define the SMAPE function based on Wikipedia. The function isn't part of pytorch or fastai currently.

In [None]:
def smape(y_pred, target):
    return torch.mean(2*torch.abs(y_pred - target)/(torch.abs(target) + torch.abs(y_pred)))

In [None]:
class MyLoss(nn.Module):
    def __init__(self, diff_weight=0.5):
        super().__init__()
        self.diff_weight = diff_weight
        
    def forward(self,y_pred, target):
        num_loss =  (1-self.diff_weight) * smape(y_pred[:,0], target[:,0])
        diff_loss = self.diff_weight * smape(y_pred[:,1], target[:,1])
        return num_loss + diff_loss    

At least i create a learner pasing the dataloader into it. The default settings are two hidden layers with 200 and 100 elements. But i use more hidden layers to increase the number of trainable parameters, as you can see in the reported summary. Increasing the number of parameters in the neural network will improve the accuarcy and score, hopefully.

In [None]:
my_config = tabular_config(ps=.15, embed_p=0.15, use_bn=True, y_range=dep_value_range)

learn = tabular_learner(dls, 
                        n_out = dls.c,
                        config = my_config,
                        layers = [64,256,1024,1024,256,64,16],
                        metrics = [smape, exp_rmspe]) 
                       
learn.summary()

In [None]:
learn.lr_find()

/home/egbert/tmp/submission_best.csvI will use a maximum learning rate of 3e-3. Starting the learning process is quite easy, i will run for 250 epochs, we have small data set and we can process this data in a few seconds per epoch. I will save the model with the best, with the lowest validation lost value. The Fastai library offers the SaveModelCallback callback. You must specify the file name only. The option with_opt=True stores the values of the optimizer also. You will find the new file under models/kaggle_tps_jan_2022.pth

In [None]:
learn.fit_one_cycle(300, 3e-3, cbs=SaveModelCallback(fname='kaggle_tps_jan_2022', with_opt=True))

In [None]:
learn.show_results(shuffle=False)

To calculate the predictions for this competition, i will load the best model from the training process. Best model means the model where the validation loss has the lowest value.

In [None]:
learn.load('kaggle_tps_jan_2022')

In [None]:
dlt = learn.dls.test_dl(test_df, bs=64) 
nn_preds, _ = learn.get_preds(dl=dlt) 
nn_preds.min(), nn_preds.max()

In [None]:
sample_submission["num_sold"] = np.exp(nn_preds)
sample_submission.to_csv("submission.csv", index=False)
sample_submission.head()

In [None]:
!ls -la

That's it for the begining..