# ASHRAE with fast.ai, Part 2: Training

This kernel leverages the convenient fast.ai API to prepare the dataset for training in just a few lines of code. It then trains a neural network adapted for tabular data.

In order to combine the large size of the ASHRAE dataset and the overhead of fast.ai's objects with the limited memory of Kaggle sessions, this kernel is part of a series which further includes:

- https://www.kaggle.com/michelezoccali/ashrae-with-fast-ai-part-1 (preprocessing)
- https://www.kaggle.com/michelezoccali/ashrae-with-fast-ai-part-3 (inference)

# Imports

In [None]:
import os
import gc
import sys
import psutil

import numpy as np
import pandas as pd
import datetime
import warnings

from tqdm.notebook import tqdm
from sklearn.metrics import mean_squared_error
from fastai.tabular.all import *

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

The pre-processed DataFrames can be accessed in the input folder, in the `ashrae-with-fast-ai-part-1` subdirectory.

In [None]:
path = '../input/ashrae-with-fast-ai-part-1/'

for dirname, _, filenames in os.walk(path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load data

Let's load the preprocessed training dataset without lag features.

In [None]:
#%%time
df_train = pd.read_hdf(path + 'preprocessing_no_lag.h5', 'train')
df_train.head()

In [None]:
df_train.info(memory_usage='deep')

In [None]:
gc.collect()

In [None]:
# use for snapshot of RAM consumption by process
psutil.test()

# Modeling

We will be using fast.ai's TabularLearner NN class. One possible way to instantiate such a class is by first defining, among other things:

- Categorical and continuous variables
- A training/validation split
- A set of transforms one wishes to apply to the data
- A TabularPandas object
- A DataLoaders object

Let's do so in order below.

In [None]:
dep_var = 'meter_reading'
cont, cat = cont_cat_split(df_train, max_card=25, dep_var=dep_var)
#cont, cat

In [None]:
df_train[cat].nunique()

Here we will be doing a simple 11/1 train/validation split for starters. Let us extract the last month of 2016 as validation set.

In [None]:
cond = df_train.month<12
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]

splits = (list(train_idx), list(valid_idx))

df_train = df_train.drop(columns='month') # remove to avoid overfit
cat.remove('month')

Let's tell fast.ai which transforms we wish to apply. Note that here we do not include `FillMissing`, contrary to common practice, as we took care of missing values during preprocessing.

In [None]:
pd.options.mode.chained_assignment = None # to be included alongside reduce_memory=True

procs_nn = [Categorify, Normalize]
df_train = TabularPandas(df_train, procs_nn, cat, cont,
                      splits=splits, y_names=dep_var,
                      inplace=True, reduce_memory=True)

len(df_train.train), len(df_train.valid)

TabularPandas instances have a handy method to directly create the `DataLoaders`:

In [None]:
dls = df_train.dataloaders(1024)
gc.collect()

Let's establish the range of the target variable, so that we may rescale the outputs of the NN to fill this range by means of a sigmoid activation added as the last layer.

In [None]:
df_train.train.y.min(), df_train.train.y.max()

Instantiate the model and inspect its architecture.

In [None]:
learn = tabular_learner(dls, y_range=(0,17), layers=[500,250], n_out=1, loss_func=F.mse_loss)
learn.model

In [None]:
# find appropriate learning rate
learn.lr_find()

Let us now train the model with the 1-cycle policy.

In [None]:
learn.fit_one_cycle(10, 1e-2)

Let us check the results of the training by hand.

In [None]:
# this gets predictions on the validation set by default
preds, targs = learn.get_preds()

rmse_valid = np.sqrt(mean_squared_error(to_np(targs.squeeze()), to_np(preds.squeeze())))
rmse_valid

In [None]:
del cond, train_idx, valid_idx, splits
del df_train, dls, preds, targs, rmse_valid
gc.collect()

It seems that the model is performing well on the validation set (as well as LGBM 
in [this kernel](https://www.kaggle.com/michelezoccali/ashrae-energy-prediction-single-lgbm)). Let's save it.

In [None]:
with open('tabular_nn.pickle', mode='wb') as f:
    pickle.dump(learn, f)

That's it. Now we only ought to do inference on the test set. However, trying to create a TabularPandas object for the entire test set in this kernel causes memory peaks which crash the session. Thus, let's continue in:

- https://www.kaggle.com/michelezoccali/ashrae-with-fast-ai-part-3