## Import libraries & data
- `fastai` releases updates frequently, so I won't guarantee this notebook will work with versions later than the one specified here
- This notebook demonstrates how to quickly build both train a tabular NN model in `fastai` and use its `TabularPandas` api to train other ML models such as `xgboost`

In [None]:
!pip install -q fastai==2.2.5 fastcore==1.3.19

In [None]:
from fastai.tabular.all import *

SEED = 42
set_seed(SEED, reproducible=True)

In [None]:
path = Path('/kaggle/input/tabular-playground-series-jan-2021')
path.ls()

## Process data

In [None]:
train_df = pd.read_csv(path/'train.csv')
train_df.head()

In [None]:
y_names = ['target']
cont_names = list(train_df.columns.values)[1:-1]
cat_names = []
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter(seed=SEED)(range_of(train_df))
bs = 256

In [None]:
db = TabularPandas(
    train_df, 
    procs=procs, 
    cat_names=cat_names, 
    cont_names=cont_names, 
    y_names=y_names, 
    y_block=RegressionBlock(),
    splits=splits,
)

In [None]:
dls = db.dataloaders(bs=bs)
dls.show_batch()

## NN Training

In [None]:
model_name = 'nn'

In [None]:
# save the best model so far
cbs = [SaveModelCallback(monitor='_rmse', comp=np.less, fname=model_name+'_best')]

In [None]:
learn = tabular_learner(dls, layers=[200, 100], metrics=rmse, cbs=cbs)

In [None]:
learn.lr_find()

In [None]:
learn.fit_one_cycle(20, 1e-2)

In [None]:
learn.show_results()

## Evaluate on validation data

In [None]:
learn.load(model_name+'_best')

In [None]:
preds, targs = learn.get_preds()
nn_preds = preds.squeeze(1)

If you are doing **ensembling** below, take note that `preds=avg_val_preds`

| Model    | Min RMSE (Validation) |
|----------|----------|
| nn    | 0.7124   |
| xgb | 0.7027   |
| nn + xgb   | 0.7017   |

In [None]:
rmse(preds, targs)

## ML Training

In [None]:
X_train, y_train = dls.train.xs, dls.train.ys.values.ravel()
X_valid, y_valid = dls.valid.xs, dls.valid.ys.values.ravel()

In [None]:
!pip install -q xgboost

In [None]:
from xgboost import XGBRegressor
model_name = 'xgb'

In [None]:
model = XGBRegressor(n_estimators=100, max_depth=8, learning_rate=0.1, subsample=0.5)
model.fit(X_train, y_train)

In [None]:
xgb_preds = tensor(model.predict(X_valid))

In [None]:
rmse(xgb_preds, tensor(y_valid))

## Make predictions on test data

In [None]:
test_df = pd.read_csv(path/'test.csv')
test_df.head()

In [None]:
test_dl = dls.test_dl(test_df)

In [None]:
preds = tensor(model.predict(test_dl.xs))
xgb_preds = preds

In [None]:
preds, _ = learn.get_preds(dl=test_dl)
nn_preds = preds.squeeze(1)

In [None]:
submit = pd.read_csv(path/'sample_submission.csv')
submit['target'] = xgb_preds # nn_preds
submit.head()

## Ensembling
- Remember to set the same the seed value for `splitter` to have the same validation set

In [None]:
# Ensemling on the validation set 
# Go back to the upper part to see the final rmse

# avg_val_preds = (nn_preds + xgb_preds) / 2
# preds = avg_val_preds

In [None]:
# Ensemling on the test set 

# avg_test_preds = (nn_preds + xgb_preds) / 2
# preds = avg_test_preds

## Submit to Kaggle
- Download the `submission.csv` file and submit

In [None]:
submit.to_csv('submission.csv', index=False)