# Tabular Data competition -- May 2022

My strategy will be to quickly get to a prediction using DL, then use the model to help with EDA.

A couple main sources:
- [`fastai` tabular data tutorial](https://docs.fast.ai/tutorial.tabular.html)
- ["Iterate Like A Grandmaster" kaggle kernel"](https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster)

# 0. Get environment set up

Run first:
- Import useful stuff using fastai (pd, np, plt, etc.)
- Check if this notebook runs on kaggle or somewhere else

If you're on kaggle, do this first:
- If you're doing any actual training, add a GPU via three-dot menu at right -> "Accelerators"
- If you don't have the data saved locally, and you're running this on Kaggle, then use ["+ Add Data"](https://www.kaggle.com/docs/notebooks#datasets) in the right-hand sidebar


If you're not on kaggle:
- pip install [kaggle Python API](https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook)
- set up kaggle token

In [None]:
from fastai.imports import *
import os
iskaggle = len(os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')) > 0

if not iskaggle:
    print("We're running this notebook somewhere besides kaggle")
    !pip install kaggle
    ### CHECK THIS -- it may be missing pieces!
    # if runninng on non-kaggle system, replace following
    # with token from Kaggle user-profile section
    # with JSON format like
    # creds = '{"username":"xxx","key":"xxx"}'
    creds = ''

In [None]:
# install and/or import some stuff that's specific to tabular data

!pip install -Uqq waterfallcharts treeinterpreter dtreeviz

from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG

from sklearn.inspection import plot_partial_dependence
from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall
from sklearn.metrics import accuracy_score, roc_auc_score

import multiprocessing as mp

In [None]:
has_gpu = torch.cuda.is_available(); print("Has GPU:", has_gpu)
print("Torch version:", torch.__version__)
import fastai
print("Fast-ai version:", fastai.__version__)
import sys
print("Python version:", sys.version)

# 1. Load data

### get path to the data

If not on kaggle, use ["kaggle competitions download"](https://www.kaggle.com/competitions/tabular-playground-series-may-2022/data) to download data

In [None]:
path = (Path('../input/tabular-playground-series-may-2022') if iskaggle
    else Path.home()/'data'/'tabular-playground-series-may-2022')

if not iskaggle and not path.exists():
    from zipfile import ZipFile
    api.competition_download_cli(str(path))
    ZipFile(f'{path}.zip').extractall(path)
    
path.ls()

### Load the data!

In [None]:
# low_memory=False means pandas can read the full dataset at once
# --this may help Pandas avoid changing dtypes between rows
df = pd.read_csv(path/'train.csv', low_memory=False)
df.T

In [None]:
df = df_shrink(df)
df.corr();df.info();

# 2. Feature engineering for the text column

My tree-based learners haven't been using Column 27 enough.

Per [this kernel by c4rl05/v](https://www.kaggle.com/code/cv13j0/tps-may22-eda-gbdt), which has been using the text columns a lot (high feature importance), I am going to try two changes:
- convert characters to a numeric format using `ord` built-in function
- add a feature: count of number of unique characters in column 27

In [None]:
# what the heck is column 27?
print("Number of unique values:", df['f_27'].nunique())
print("Some examples:")
df['f_27'].sample(n=5)
## This might just be a 'red herring' column
## I am going to ignore it for now.

In [None]:
def substr_cols(df, colname):
    '''split a string column into one column per position in string
    also turn characters into numeric values in case these are handled
    better by lightgbm and xgboost
    '''
    maxchars = max(df[colname].str.len())
    for i in range(maxchars):
        subcolname = colname+"_"+str(i).zfill(len(str(maxchars)))
        # get(i) would just get the ith character in the string
        # ord function converts this character into a numeric value
        df[subcolname] = df[colname].str.get(i).apply(ord) - ord('A')
    return df

def prep_text_col(df, colname):
    '''
    splits text column into sub-string columns (one character per column),
    converts this column into numeric format,
    counts unique character occurences in this string column,
    and then drops the original text column
    '''
    df = substr_cols(df,colname)
    df[colname+"_nuniq_chars"] = df[colname].apply(lambda s: len(set(s)))
    return df.drop(labels=[colname], axis=1)

In [None]:
df = prep_text_col(df, "f_27")

## Count-occurences and count-chars features


I'll try doing this as well.

In [None]:
def count_unique_chars(df, colname):
    '''
    Count number of unique characters in a text column
    '''
    maxchars = max(df[colname].str.len())

    for i in range(maxchars):
        df[f'ch_{i}'] = df[colname].str.get(i).apply(ord) - ord('A')
    
    new_colname = colname + "_chars_uniq"
    df[new_colname] = df[colname].apply(lambda s: len(set(s)))
    return df

## 3. Deep-learning-based model using fastai

This follows the [`fastai` tabular learner tutorial](https://docs.fast.ai/tutorial.tabular)

In [None]:
cont_names, cat_names = cont_cat_split(df)
L(cont_names).remove("id"); L(cat_names).remove("target")
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

In [None]:
print("Ratio of 1s to 0s:",len(df[df['target']==1]) / len(df))
for c in cat_names:
    print(f"{c} uniques:", df[c].nunique())

In [None]:
to = TabularPandas(df, procs = [Categorify, FillMissing, Normalize],
    y_names = "target", y_block = CategoryBlock,
    cat_names = cat_names, cont_names = cont_names,
    splits = splits)

dls = to.dataloaders(bs=2**10)

Note I'm using a pretty big batch size (2**13).

I did some tests using the training code below. Small batch sizes trained slowly and I had tons of GPU memory left.

Larger batch sizes trained faster -- up to a point.

Here are my tests:

| Batch size                 | 65536 | 16384 | 4096 | 1024 |
|----------------------------|-------|-------|------|------|
| iterations to 81% accuracy | 25    | 10    | 6    | 4    |
| seconds to 81% accuracy    | 50    | 20    | 24   | 44   |

In [None]:
dls.show_batch()

In [None]:
learn = tabular_learner(dls, metrics=[accuracy,RocAucBinary()], loss_func=CrossEntropyLossFlat(),
                        layers=[2048,1024,512,250,100,64,64,16], cbs=[ShowGraphCallback()],
#                         opt_func=Adam(params=params, lr=self.lr)
                        opt_func=Adam
                       )

In [None]:
learn.opt_func; learn.loss_func; learn.opt_func

In [None]:
learn.lr_find()

In [None]:
learn.fit_one_cycle(15, lr_max=0.1, wd=1e-2)

In [None]:
dl_y_valid = learn.get_preds(dl=dls.valid)[0]
print("dl alone:")
print(roc_auc_score(dls.valid.y,dl_y_valid.argmax(dim=1)))

In [None]:
learn.show_results()

In [None]:
learn.fit(10, wd=0.1, lr=10**-1)

Ten more epochs

In [None]:
learn.fit(10, wd=0.1, lr=10**-2)

In [None]:
learn.fit(10, wd=0.1, lr=0.001)

In [None]:
learn.fit_one_cycle(15, lr_max=0.1, wd=1e-2)

In [None]:
print("dl alone:")
print(roc_auc_score(dls.valid.y,dl_y_valid.argmax(dim=1)))

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

In [None]:
learn.show_results();

# 4. Random forests model, then some EDA

In [None]:
xs, y = to.train.xs, to.train.y

In [None]:
m = tree.DecisionTreeClassifier(max_leaf_nodes=4)
m.fit(xs, y);

In [None]:
plot_tree(m)

In [None]:
print("here's the first column the tree split on:\n", xs.columns[23])

In [None]:
!pip install -Uqq dtreeviz

import dtreeviz

In [None]:
txs, ty = to.valid.xs, to.valid.y

In [None]:
# random forest can train on both train and validation data
# still use the to. version of data since it's been pre-processed
rf_xs = pd.concat([to.train.xs, to.valid.xs]).sort_index()
rf_y = pd.concat([to.train.y, to.valid.y]).sort_index()

In [None]:
# This doesn't work, could try to read docs at
# https://github.com/parrt/dtreeviz
samp_idx = np.random.permutation(len(y))[:500]
# dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, "value")

## Trying this time with RAPIDS ("cuML") Random Forest

Note annoying limitations:
- Doesn't have a way to compute OOB score
- Doesn't record feature importance

However, it is sooo much faster than training with scikit-learn random forest,
so I decided to use it to train a big forest.

In [None]:
from cuml import RandomForestClassifier as cuRF

In [None]:
# def rf(xs, y, n_estimators=100, max_samples=100000,
#        max_features=0.8, min_samples_leaf=1, **kwargs):
#     return RandomForestClassifier(n_jobs=-1, n_estimators=n_estimators,
#         max_samples=max_samples, max_features=max_features,
#         criterion="entropy", n_jobs: mp.cpu_count()
#         min_samples_leaf=min_samples_leaf, oob_score=True,).fit(xs, y)

def rf(xs, y, n_estimators=1000, max_samples=0.5, n_bins=256,
       max_features=0.8, min_samples_leaf=1, **kwargs):
    return cuRF(n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        split_criterion=1,
        min_samples_leaf=min_samples_leaf).fit(xs, y)

In [None]:
m = rf(rf_xs, rf_y);

In [None]:
## OOB error degraded when I changed max_features from "0.5" to "sqrt"
# print("OOB score (higher is better):",m.oob_score_)
print("AUC, higher is better:", roc_auc_score(rf_y,m.predict(rf_xs)))

### Feature importance

In [None]:
# def rf_feat_importance(m, df):
#     return pd.DataFrame({'cols':df.columns, 'imp': m.feature_importances_}
#                        ).sort_values('imp', ascending=False)

In [None]:
# fi = rf_feat_importance(m, rf_xs)

In [None]:
# def plot_fi(fi):
#     return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

# plot_fi(fi);

In [None]:
## TODO: Replace this with https://github.com/parrt/random-forest-importances
# from sklearn.inspection import plot_partial_dependence
# from sklearn.inspection import PartialDependenceDisplay
# fig,ax = plt.subplots(figsize=(12,4))
# PartialDependenceDisplay.from_estimator(m, xs,
#                                         ["f_26", "f_21", "f_27_08"],
#                                         grid_resolution=20,ax=ax)

# 5. Very quick gradient boosting

In [None]:
import xgboost as xgb

In [None]:
xgb_m = xgb.XGBClassifier(n_estimators = 5000, learning_rate=0.05, num_leaves=50,
                          colsample_bytree=0.9, predictor="gpu_predictor",
                          min_child_weight=0.96,
                          subsample=0.8, objective="binary:logistic", eval_metric="auc",
                         enable_categorical=True, tree_method="gpu_hist")
xgb_m_f = xgb_m.fit(to.train.xs, to.train.y)

What we could test for XGBoost: change max_depth

# 6. Very quick lightgbm test

Did some quick parameter testing based on https://neptune.ai/blog/lightgbm-parameters-guide

which is partly based on this amazing post: https://sites.google.com/view/lauraepp/parameters

...and then I just copied parameters used in this nice Kaggle kernel:
https://www.kaggle.com/code/jitensharma597/tsp2022-lgbm-light-gradient-boosting-machine#Modeling:-LGBM
(but without messing with the 5-fold CV)

In [None]:
import lightgbm as lgb

In [None]:
lg_m = lgb.LGBMClassifier(n_estimators = 5000, learning_rate=0.05, num_leaves=50,
                          colsample_bytree=0.9, min_child_samples=96,
                          max_bins=255,
                          subsample=0.8, objective="binary", metric="auc", device="gpu")
lg_m.fit(to.train.xs, to.train.y)

# 7. Test ensembling models

In [None]:
dl_y_valid = learn.get_preds(dl=dls.valid)[0]
rf_y_valid = Tensor(m.predict_proba(txs))
xg_y_valid = Tensor(xgb_m_f.predict_proba(txs))
lg_y_valid = Tensor(lg_m.predict_proba(txs))

In [None]:
print("dl alone:")
print(roc_auc_score(ty,dl_y_valid.argmax(dim=1)))
print("rf alone:")
print(roc_auc_score(ty,rf_y_valid.argmax(dim=1)))
print("xgb alone:")
print(roc_auc_score(ty,xg_y_valid.argmax(dim=1)))
print("lightgbm alone:")
print(roc_auc_score(ty,lg_y_valid.argmax(dim=1)))
print("ensemble by average of dl and rf:")
print(roc_auc_score(ty,((dl_y_valid+rf_y_valid)/2).argmax(dim=1)))

print("two-to-one ensemble by average of dl and rf:")
print(roc_auc_score(ty,((2*dl_y_valid+rf_y_valid)/3).argmax(dim=1)))

print("ensemble by average of dl and xg:")
print(roc_auc_score(ty,((dl_y_valid+xg_y_valid)/2).argmax(dim=1)))

print("ensemble by average of dl and lg:")
print(roc_auc_score(ty,((dl_y_valid+lg_y_valid)/2).argmax(dim=1)))

print("ensemble by average of dl, rf, and lg:")
print(roc_auc_score(ty,((3*dl_y_valid+rf_y_valid+lg_y_valid)/5).argmax(dim=1)))

print("ensemble by average of dl, rf, and xg:")
print(roc_auc_score(ty,((dl_y_valid+rf_y_valid+xg_y_valid)/3).argmax(dim=1)))

print("ensemble by average of dl, rf, xgb, and lightgbm:")
print(roc_auc_score(ty,((2*dl_y_valid+2*rf_y_valid+lg_y_valid+xg_y_valid)/6).argmax(dim=1)))

## Now let's try inference and submit some results

[This tutorial](https://fastai1.fast.ai/tutorial.inference.html#Tabular) only works for single items.

We need to do something like `get_preds` instead.

But first, need to load the test data.

In [None]:
learn.save('mini_train')
learn.export()
df_test = pd.read_csv(path/"test.csv")
df_test = prep_text_col(df_test, "f_27")
test_dl = learn.dls.test_dl(df_test)
# kaggle competitions submit -c tabular-playground-series-may-2022 -f submission.csv -m "Message"

In [None]:
# get probabilities on the test dataset
dlpreds = learn.get_preds(dl=test_dl)[0]
rfpreds = m.predict_proba(test_dl.xs)
xgpreds = xgb_m_f.predict_proba(test_dl.xs)
lgpreds = lg_m.predict_proba(test_dl.xs)

In [None]:
# ensemble the results
ens_three_preds = ((3*dlpreds+lgpreds+rfpreds)/5).argmax(dim=1)
ens_two_preds = ((3*dlpreds+2*rfpreds)/5).argmax(dim=1)

In [None]:
ens_two_preds_probs = ((3*dlpreds+2*rfpreds)/5)[:,1]

In [None]:
df_preds = pd.DataFrame({"id":df_test["id"].values, "target": ens_two_preds_probs[:,1]})
df_preds.to_csv("2022-05-31_1758_submission.csv", index=False)

## Ideas to try next:

- Do data exploration and modeling in some kind of loop
- Is this the right loss function?

### Data exploration
- [Correlations cross-plot](https://towardsdatascience.com/altair-plot-deconstruction-visualizing-the-correlation-structure-of-weather-data-38fb5668c5b1)
- Find top predictive features (how to do this?)
- Plot some stuff!!
- Find top losses
- [MC dropout](https://docs.fast.ai/callback.preds.html#MCDropoutCallback) and find high-uncertainty rows
- [Pair Plots](https://seaborn.pydata.org/examples/scatterplot_matrix.html) of top features?

### Modeling
1. Random forests
2. XG Boosts
3. Random forests using an embedding from DL model
4. Combine some models (average)

### DL refinements
1. Learning rate finder
2. Schedule learning rate (cosine saw or something)
