# Tabular Playground Series - Mar 2021

For this competition, we will be predicting a binary target based on a number of feature columns given in the data. All of the feature columns, `cat0` - `cat18` are categorical, and the feature columns `cont0` - `cont10` are continuous.

## Files
- `train.csv` - the training data with the `target` column
- `test.csv` - the test set; you will be predicting the `target` for each row in this file (the probability of the binary target)
- `sample_submission.csv` - a sample submission file in the correct format

## EDA
Alright, let's start by importing libs, reading in and inspecting the train dataset!

In [None]:
# import useful libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

pd.set_option('display.max_colwidth', -1)

import warnings
warnings.simplefilter('ignore')

In [None]:
# load in data and set seed, do a bit of cleaning
BASE = '../input/tabular-playground-series-mar-2021/'
SEED = 2021

train = pd.read_csv(f'{BASE}train.csv')
test = pd.read_csv(f'{BASE}test.csv')
ss = pd.read_csv(f'{BASE}sample_submission.csv')

In [None]:
# create vars that group columns by data type
ID_COL, TARGET_COL = 'id', 'target'

features = [c for c in train.columns if c not in [ID_COL, TARGET_COL]]

cat_cols = [f'cat{i}' for i in range(19)]
print(cat_cols)

num_cols = [f'cont{i}' for i in range(11)]
print(num_cols)

In [None]:
train.info() # looks like we have no null values and pandas correctly parsed out all columns! :)

In [None]:
train.describe() # continuous columns look reasonably standardised, don't think standard scaling will make much of a difference 
# (still a good idea to try for distribution sensitive models, such as Linear Regression or Neural Nets)

In [None]:
from pandas_profiling import ProfileReport

report_train = ProfileReport(train)

report_train

In [None]:
report_test = ProfileReport(test)

report_test

### Observations

- `cat11` - `cat18` and `cat0` are relatively low cardinality variables (maximum of 4 unique values)
- the rest of `cat` vars have 13-295 unique values, dummy encoding them would create a really wide dataset
- continuous variables have all kinds of distributions, mostly with multiple peaks, but are between 0 and 1, so not going to rescale them
- `target` binary variable is imbalanced, with 73.5% of values = 0 and 26.5% = 1
- interactions and correlations are useful to click through, as basic models will have a hard time learning them

## Baseline Model using h2o AutoML
Alright, after basic EDA of all variables, it's time to introduce h2o AutoML to set a baseline model.

### What is h2o?
h2o is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment. H2O's core code is written in Java.

It also offers an easy to use AutoML (Automatic Machine Learning) pipeline that quickly iterates through various models and then stacks them to provide a reasonable baseline.

<img src="https://www.h2o.ai/wp-content/uploads/2020/05/h2o-automl-logo_.jpg">

Let's try and apply h2o library to the Mar 2021 binary classification competition!

In [None]:
# starting H2O
import h2o
print(h2o.__version__)
from h2o.automl import H2OAutoML

h2o.init(max_mem_size='16G')

In [None]:
%%time
# import data using h2o
train = h2o.import_file(f'{BASE}train.csv')
test = h2o.import_file(f'{BASE}test.csv')

In [None]:
# check out h2o auto description
train.describe()

In [None]:
# target column needs to be enum type, so we encode it as factor
train[TARGET_COL] = train[TARGET_COL].asfactor()
train.describe()

In [None]:
# run AutoML for 1000 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=1000, seed=SEED, max_runtime_secs=31000, project_name='TPSMar2021')
aml.train(x=features, y=TARGET_COL, training_frame=train)

In [None]:
# view the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # print all rows instead of default (10 rows)

In [None]:
# the leader model is stored here
aml.leader

In [None]:
# predict and save submission
preds_test = aml.predict(test)
ss[TARGET_COL] = preds_test['p1'].as_data_frame().values.flatten()
ss.to_csv('submission.csv', index=False)
ss.head()

In [None]:
# and we're done! Please upvote! :)
'Done'