# Using Optuna for Fast.ai on Feb Playground

This notebooks show how to use pptuna to tune the hyperparamter of a Neural Network which is trained on the Februrary Playgorund data from kaggle.

A first notebook using fast.ai without optimizes hyperparameter Optimization can be found here: https://www.kaggle.com/martinmarenz/first-pred-feb-tabular-playground-with-fast-ai

Just to mention to save you a little bit time. By no means I expect that the resulting neural network rank high in the competition. If you are looking for something like that, you have to find it somewhere else.


## Import packages and load data

In [None]:
import random
from pathlib import Path

import joblib
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

import optuna
from optuna.integration import FastAIV2PruningCallback
from fastai import *
from fastai.tabular.all import *
import torch

In [None]:
# autocompletaion works better this way
%config Completer.use_jedi = False

In [None]:
# fixing seed
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED) # gpu vars

In [None]:
!ls ../input/tabular-playground-series-feb-2021

In [None]:
dpath = Path('../input/tabular-playground-series-feb-2021')
sample_sub = pd.read_csv(dpath / 'sample_submission.csv')
test_raw = pd.read_csv(dpath / 'test.csv')
train_raw = pd.read_csv(dpath / 'train.csv')

# Prepare data

I use the apply method to easily add later any kind of preprocessing. Honestly, I just copy this approach around different notebooks, since it allows to play around with different approaches and preprocessing steps.

Nevertheless, for now only the `id` column has to be dropped.

In [None]:
def split_data_fastai(df):
    df = df.reset_index(drop=True) # this makes the index going from 0 .. n-1 independently of any transformation before
    id = df['id']
    df = df.drop(columns=['id'])
    
    return (df, id)

In [None]:
def apply_all(df, funs, debug=False):
    """Helper function to apply a series of functions onto a DataFrame"""
    for fun in funs:
        if debug:
            print(f'Apply {fun.__name__}')
        df = fun(df)
    return df

In [None]:
train = train_raw.copy(deep = True)
prep_nn1 = lambda x: apply_all(x, [split_data_fastai])
train, train_ids = prep_nn1(train)
torch.device('cuda') # enable cuda, (activate GPU usage)

cont_names = [f'cont{i}' for i in range(14)] # set the continous variables
cat_names = [f'cat{i}' for i in range(10)] # set the categoriall variables
procs = [Categorify, Normalize] # different fast.ai preprocessing steps
dep_var = 'target' # our target variable

# Setting up optuna

Optuna's API is stunningly easy, you just have to wrap your normal training loop around a trial and let optuna create suggestion
for all hyperparameters you usally set manually. This to tune every aspect of your model. Nevertheless, the following code just
do standard stuff.

Firstly, via `num_layers = trial.suggest_int('n_layers', 2, 7)` the number of hidden layers can be between 2 and 7 layers, the final layer is added automatically by fast.ai.
For each of this layer the size will be drawn from the corresponding level of `pot_layers`. I do this to nudge the network architecture into a funnel.
Out of interest I have done before this optimization a run where the architecture was randomly sampled. However, the best trials where networks with shrinking layer sizes.

The rest is a standard training loop for fast.ai, where different hyperparamter values are created via optuna trials.
The `FastAIV2PruningCallback` allows optuna to stop trials which are not promising, and thus sample the search space much faster.
The other callback, `SaveModelCallback`  is used within each training loop to save the best model during this trial.
Then later, the model of the best trial is stored with the help of another callback.

When starting the study I set `n_warmup_steps=5` and `interval_steps=5` for two reasons. Firstly, I want to avoid that Optuna prune some trials right in the beginning, only because their loss is very high
at this moment. That could be just some random effects.
The interval_steps is set to 5 to reduce the effect of random fluctions in the loss a bit. Having the posibilty to define something like `patience` would be
better but for now this approach must be sufficient. (see: https://github.com/optuna/optuna/issues/1447)

In [None]:
dpath = Path('/kaggle/working/Feb2021Playground/OptunaFastAi')
dpath.mkdir(exist_ok=True, parents=True)

In [None]:
learner = None

lpath = Path(dpath/"best_learner.pkl")

if lpath.exists():
    best_learner = joblib.load(lpath)
else:
    best_learner = None

In [None]:
pot_layers = [
    [500, 1000, 1500, 2000, 2500,  3000, 3500, 4000, 5000],
    [100, 250, 500, 750, 1000, 1500, 2000, 2500, 3000],
    [50, 100, 200, 300, 400, 500, 750, 1000, 1500, 2000],
    [50, 100, 150, 200, 300, 400, 500, 750, 1000],
    [50, 100, 200, 300, 400, 500],
    [50, 100, 200, 300, 400],
    [50, 100, 200, 300]
]

In [None]:
def objective(trial: optuna.Trial):
    num_layers = trial.suggest_int('n_layers', 2, 7)
    num_layers += 1
    layers, ps = [], []
    
    # shrinking sizes for deeper layers
    pot_layers = [
        [500, 1000, 1500, 2000, 2500,  3000, 3500, 4000, 5000],
        [100, 250, 500, 750, 1000, 1500, 2000, 2500, 3000],
        [50, 100, 200, 300, 400, 500, 750, 1000, 1500, 2000],
        [50, 100, 150, 200, 300, 400, 500, 750, 1000],
        [50, 100, 200, 300, 400, 500],
        [50, 100, 200, 300, 400],
        [50, 100, 200, 300]
    ]

    # size of last layer is choosen automatically by fast.ai
    for i in range(num_layers - 1):
        num_units = trial.suggest_categorical(f'num_units_{i}', pot_layers[i])
        
        # although my inital intuition would be to reduce the dropout for deeper
        # layers, the optimization showd that this does not lead to the best results
        p = trial.suggest_uniform(f'ps_{i}', 0, 0.5)
        
        layers.append(num_units)
        ps.append(p)
    

     # to validate the results we use randomly 20% of the training set
    splits = RandomSplitter(valid_pct=0.25, seed=42)(train.index)

    dls = TabularPandas(
        train,
        cont_names=cont_names,
        cat_names=cat_names,
        procs=procs,
        y_names=dep_var,
        splits=splits
    ).dataloaders(bs=8224)


    callbacks = [
        SaveModelCallback(min_delta=0.0005, monitor='_rmse', comp=np.less, fname='model_triv_best'),
        FastAIV2PruningCallback(trial, monitor='_rmse')
    ]

    # I leave the automically size for the embedding layers in place, but trial on the embbeding droppout
    emb_drop = trial.suggest_uniform('emb_drop', 0, 0.35)
    
    cfg = tabular_config(embed_p=emb_drop, ps=ps)
    global learner
    learner = tabular_learner(dls, layers=layers, metrics=[rmse], config=cfg)
    
    # could be improved by an automatic learning rate finder, fast.ai brings the capabilities for that,
    # when doing so, the learning rate would fit to the size of the network
    
    
    lr_max = trial.suggest_uniform('lr_max', 0.01, 0.2)
    weight_decay = trial.suggest_uniform('weight_decay', 0.02, 0.25)
    learner.fit_one_cycle(55, lr_max=lr_max, wd=weight_decay, cbs=callbacks)

    return learner.validate()[-1]

This callback stores the best model in the global variable `best_learner`, so I can use it later without retraining.

In [None]:
def saveBestModelCallback(study, trial):
    global best_learner
    if study.best_trial == trial:
        best_learner = learner

In [None]:
# reload an existing study, if existing; this allows to rerun the notebook and get better results
spath = dpath / "study.pkl"
if spath.exists():
    study = joblib.load(spath)
else:
    study = optuna.create_study(pruner=optuna.pruners.MedianPruner(n_warmup_steps=4, interval_steps=4))

In [None]:
study.optimize(objective, timeout=60*60*4, callbacks=[saveBestModelCallback])

In [None]:
# store the study for further optimization
joblib.dump(study, spath)
joblib.dump(best_learner, lpath)

# Visualization of the Hyperparameter

Maybe the coolest stuff optuna offers, is the ability to visualize the hyperparameters.
One can have a look on the interdependence of the different hyperparameter, see what has worked in many trials and where understand what worked for the problem at hand.

In [None]:
df_trials = study.trials_dataframe() \
    .sort_values(by=['value']) \
    .drop(columns=['datetime_start', 'datetime_complete', 'number', 'state']) # drop uninteresting columns


df_trials['duration'] = df_trials['duration'].dt.total_seconds()/60.0
# shrik some colum names for better overview
par_cols = df_trials.columns[df_trials.columns.str.startswith('params')]
df_trials = df_trials.rename(columns={col: col[7:] for col in par_cols})

layer_cols = [f'num_units_{i}' for i in range(5)]
pdrop_cols = [f'ps_{i}' for i in range(5)]

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
optuna.visualization.plot_edf(study)

In [None]:
optuna.visualization.plot_contour(study,
                                  params=['lr_max',
                                          'n_layers',
                                          'weight_decay',
                                          'emb_drop'
                                         ])

In [None]:
optuna.visualization.plot_contour(study,
                                  params=['num_units_0',
                                          'num_units_1',
                                          'num_units_2',
                                          'num_units_3',
                                         ])

In [None]:
best_trials = df_trials.nsmallest(n=20, columns=['value'])

In [None]:
best_trials

In [None]:
# more layers seems to be better :)
best_trials['n_layers'].value_counts()

Conclussion from the best trials: (I hope they are still valid after the save run :))

* None of the trials beat optimized LightGBM, XGBoost or similiar approaches
* best number of layers is consistently 4
* the size of the layer fluctuates quite a lot, but in general a funnel network emerges
* dropout fluctuates quite a bit, but to seems highest in second layer, and does NOT go down to 0
* rest is quite standard:
    * lr_max [0.10, 0.13]
    * weight_decay [0.08, 0.11]
    * emb_drop [0.15, 0.25]

In [None]:
study.best_params

# Create the Submission

In [None]:
test, test_id = prep_nn1(test_raw.copy(deep=True))

test_dl = best_learner.dls.test_dl(test)

preds, _ = best_learner.get_preds(dl=test_dl)
preds = preds.numpy().T[0]

submission = pd.DataFrame(
    {'id': test_id,
     'target': preds}
)
submission.to_csv('submission_trivial_nn.csv', index=False)

In [None]:
# next to the submission I also store the results on the full training set somewhere
full_train_dl = best_learner.dls.test_dl(train)

preds, _ = best_learner.get_preds(dl=full_train_dl)
preds = preds.numpy().T[0]

full_train_results = pd.DataFrame(
    {'id': train_ids,
     'target': preds}
)

In [None]:
!mkdir -p '/kaggle/working/Feb2021Playground/OptunaFastAi'

In [None]:
submission.to_csv('/kaggle/working/Feb2021Playground/OptunaFastAi/test_results_fastai.csv', index=False)
full_train_results.to_csv('/kaggle/working/Feb2021Playground/OptunaFastAi/train_results_fastai.csv', index=False)