# Fast.ai Tabular Solver

This is a demo of using the Fast.ai Tabular Solver 
- https://docs.fast.ai/tutorial.tabular.html
- https://docs.fast.ai/tabular.core.html
- https://docs.fast.ai/tabular.data.html
- https://docs.fast.ai/tabular.learner.html

This method produces very fast and simple plug-and-play results with almost zero effort.

This competition is evaluated based on RMSE score

In [None]:
import numpy as np
import pandas as pd 
from fastai.tabular.all import *

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv', index_col='id')
test_df  = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv',  index_col='id')
display('train_df')
display( train_df )
display('test_df')
display( test_df )

# TabularDataLoaders

First we define a TabularDataLoader, specify `y_names` and `cont_names`, and `proc` preprocessing steps

In [None]:
splits = RandomSplitter(valid_pct=0.01)(range_of(train_df))
dls = TabularDataLoaders.from_csv(
    '../input/tabular-playground-series-jan-2021/train.csv', 
    y_names    = "target",
    cont_names = [ f'cont{n}' for n in range(1,14+1) ],
    cat_names  = [],
    procs = [
        # Categorify, 
        FillMissing, 
        Normalize
    ],
    valid_idx = splits[1]
)

# Hyperparameter Tuning

In [None]:
# for layers in [
#     [200,100],         # fast.ai defaults
#     [256,128,64],        
#     [256,128,64,32],  
#     [512,256,128,64],
#     [512,256,128],
#     [1024,512,256,128,64],
#     [2048,1024,512,256,128,64],
# ]:
#     for loss_func in [ 
#         L1LossFlat,   # better with larger  models
#         MSELossFlat,  # better with smaller models
#     ]:  
#         print(f'loss_func = {loss_func.__name__} | layers = {layers}')
#         learn = tabular_learner(
#             dls, 
#             metrics   = [ rmse ],
#             layers    = layers,
#             loss_func = loss_func(),
#         )
#         learn.fit_one_cycle(1)

Then find the optimal learning rate

In [None]:
learn = tabular_learner(
    dls, 
    metrics   = [ rmse ],
    layers    = [512, 256, 128],
    loss_func = L1LossFlat(),
)
learn.lr_find(start_lr = 1e-05, end_lr = 1e+05, num_it = 100)
# learn.recorder.plot()

# Training

We create a learner and run `learn.fit_one_cycle()`

In [None]:
%%time

# learn.fit_one_cycle(10)               # Score = 0.71423
# learn.fit_one_cycle(100, lr_max=0.01) # Score = 1.56993 | lr_max causes training instability 
learn.fit_one_cycle(20)                 # 

# Results

We can show preliminary results on the training dataset using `learn.show_results()`

In [None]:
learn.show_results()

Or generate predictions for individual rows

In [None]:
row, clas, probs = learn.predict(train_df.loc[1])
row.show()
print('clas ', clas)
print('probs', probs)

# Submission

This is the very slow way of generating results, using a python loop

In [None]:
# submission_df = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv', index_col='id')
# for idx in test_df.index:
#     row, clas, probs = learn.predict(test_df.loc[idx])
#     submission_df.loc[idx]['target'] = row['target']
# submission_df.to_csv('submission.csv')

A faster method is to run `learn.get_preds()` on the entire test dataframe, which returns a tuple `(pytorch.Tensor(), None)`

In [None]:
predictions   = learn.get_preds( dl=learn.dls.test_dl(test_df) )[0].numpy()

submission_df = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv', index_col='id')
submission_df['target'] = predictions
submission_df.to_csv('submission.csv')
!head submission.csv

# Further Reading

This notebook is part of a series exploring the [Tabular Playground](https://www.kaggle.com/c/tabular-playground-series-jan-2021)
- 0.72935 - [scikit-learn Ensemble](https://www.kaggle.com/jamesmcguigan/tabular-playground-scikit-learn-ensemble)
- 0.71423 - [Fast.ai Tabular Solver](https://www.kaggle.com/jamesmcguigan/fast-ai-tabular-solver)
- 0.70426 - [XGBoost](https://www.kaggle.com/jamesmcguigan/tabular-playground-xgboost)