## Tabular Playground Series May 2021

<img src="https://i.imgur.com/uHVJtv0.png">
<br>
<p style="text-align:center;"><img src="https://i.imgur.com/4SxnawE.png" align="center"></p>

<br><br>

### Notebook Contents:

Having participated to some of the latest TPS challenges I noticed how more and more people have been using some autoMl library, not just to achieve great performances, but also to have a good starting point from where one can go on with the analysis. 

In the [April TPS challenge](https://www.kaggle.com/c/tabular-playground-series-apr-2021/overview) alone some great notebook used AutoML:

<ul>
    <li><a href=https://www.kaggle.com/alexryzhkov/n3-tps-april-21-lightautoml-starter> LightAutoML </a></li>
    <li><a href=https://www.kaggle.com/sureshmecad/tps-apr21-h2oautoml> H2OAutoML </a></li>
    <li><a href=https://www.kaggle.com/mt77pp/mljar-automl-tps-apr-21> MLJAR </a></li>
    <li><a href=https://www.kaggle.com/subinium/how-to-use-pycaret-with-feature-engineering> PYCARET </a></li>
</ul>

There was also [this](https://www.kaggle.com/andreshg/tps-apr-automl-libraries-comparison) awesome notebook comparing all of them. Please upvote them if you find them useful, definitely a lot to learn (at least for me) from all of those people. 

Having used PyCaret in some projects of mine I've decided to give it a try here, trying to keep things as simple and lean as possible.

In short: ***PyCaret*** *is a machine learning library which basically handles anything from data preprocessing to model search to hyperparameter optimization. You basically don't need anything other than the input data*. 

<div id="toc_container" style="background: #f9f9f9; border: 1px solid #aaa; display: table; font-size: 95%;
                               margin-bottom: 1em; padding: 20px; width: auto;">
<p class="toc_title" style="font-weight: 700; text-align: center">Notebook Contents</p>
<ul class="toc_list">
  <li><a href="#loading">0. Imports, Data Loading and Preprocessing</a>
  <li><a href="#pycaret">1. PyCaret </a>
      <br>
      <ul>
    <li><a href="#setup">1.0 Setup</a></li>
    <li><a href="#model_search">1.1 Model Search</a></li>
    <li><a href="#tuning">1.2 Model Tuning</a></li>
  </ul>
</li>
<li><a href="#submission">2. Submission</a></li>
</ul>
</div>

<a id="loading"></a>

##### 0. Imports, Data Loading and Preprocessing

In [None]:
!pip install pycaret
!pip install ngboost

In [None]:
import pycaret
import numpy as np
import pandas as pd
pd.options.display.max_columns = 100
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import QuantileTransformer, StandardScaler, PolynomialFeatures, LabelEncoder
from sklearn.tree import DecisionTreeRegressor
dtr_friedman_3 = DecisionTreeRegressor(criterion='friedman_mse', max_depth=3)
import warnings
warnings.filterwarnings('ignore')
import tqdm
import gc
import os
root_path = '/kaggle/input/tabular-playground-series-may-2021'

In [None]:
#data loading
train = pd.read_csv(os.path.join(root_path, 'train.csv'))
test = pd.read_csv(os.path.join(root_path, 'test.csv'))
sample_submission = pd.read_csv(os.path.join(root_path, 'sample_submission.csv'))

---

<a id="pycaret"></a>

### PyCaret

<p style="text-align:center;"><img src="https://i.imgur.com/4SxnawE.png" width="50%"></p>

_PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights. It is well suited for seasoned data scientists who want to increase the productivity of their ML experiments by using PyCaret in their workflows or for citizen data scientists and those new to data science with little or no background in coding. PyCaret allows you to go from preparing your data to deploying your model within seconds using your choice of notebook environment._

Look [here](https://pycaret.org/guide/) to start with PyCaret.

[Here](https://pycaret.readthedocs.io/en/latest/modules.html) you can find tutorial Notebooks for different tasks including Classification, Regression or Anomaly Detecetion. 

Furthermore I would take a look also at [this](https://www.learndatasci.com/tutorials/introduction-pycaret-machine-learning/) notebook and especially [this](https://www.kdnuggets.com/2020/11/5-things-doing-wrong-pycaret.html).

In [None]:
from pycaret.classification import setup, compare_models, predict_model
from pycaret.classification import create_model, tune_model, plot_model, pull, models

<a id = 'setup'></a>
*Before doing anything we must call the [**setup**](https://pycaret.org/classification/) function*:

_This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must called before executing any other function in pycaret. It takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column. All other parameters are optional._

It has tons of parameters...

In [None]:
#expand to see all parameters
?setup

You can specify how to handle categorical, ordinal, numeric, high cardinality features, how to deal with missing values, how to deal with collinearity, whether to use polynomial features, how to perform crossvalidation, whether to handle outliers or unbalanced data.  

But if you don't want to think about anything of that and just let PyCaret handle all the business **you will need just 2 parameters**: 

`data`: your Pandas Dataframe input data, where the preprocessing will be done and models will be trained and evaluated;

`target`: the target column, the one which will be predicted

In [None]:
NFOLDS = 5
#SAMPLE_SIZE = 10000
#Just for demonstration purposes I will use these parameters, change them accordingly

In [None]:
preprocessing = setup(data = train, ignore_features= ['id'],
                      fold=NFOLDS, target = 'target', silent = True) 

<a id ="model_search"></a>

---

<h5> Model search </h5>

Once the preprocessing is done through the `setup` function, we can compare the models, using `compare_models`. 

We just need to pick a metric to rank the models.

To see a list of available models you just need to call the `models` function:

In [None]:
models()

You can also include custom models, for instance `ngboost`:

In [None]:
from ngboost import NGBClassifier
from ngboost.distns import k_categorical

#ngb parameters from here: https://www.kaggle.com/tomwarrens/ngboost-probabilistic-predictions-tps-may-21?scriptVersionId=61814577
ngb_model = NGBClassifier(**{"random_state": 42, "Dist": k_categorical(4), 
                           "verbose": True, "verbose_eval": 100, "n_estimators": 500,
                           "Base": dtr_friedman_3, "natural_gradient": False,
                           "col_sample": 0.8756820351378953, "minibatch_frac": 0.3791506299009752,
                           "learning_rate": 0.1})

ngb_model = create_model(ngb_model, cross_validation=True, fold=NFOLDS)

model_comparison = compare_models(include = ['lr', 'catboost', 'lightgbm', 'ada', 'ridge', 'gbc', ngb_model], 
                                  n_select = 2,
                                  sort='Accuracy', fold = NFOLDS, verbose = True)

if you don't want to see all the printing out just set `verbose=False` and use the `pull` method:

In [None]:
pull()

`compare_model` returns a list of the top `n_select` models, already trained. That allows you to already predict your test data labels throught the `predict_model` function:

In [None]:
print(model_comparison[0])
predict_model(model_comparison[0], test.sample(100), raw_score = True)

---

<a id = "tuning"></a>

We can also directly create a model using the `create_model` function.

In [None]:
model = create_model('catboost', cross_validation=True, fold=NFOLDS)

In [None]:
plot_model(model, 'error')

In [None]:
plot_model(model, plot='feature')

---

<a id = "tuning"></a>

<h5> Model Tuning </h5>

PyCaret allows also to tune a model, using a GridSearchCV, after having trained it once (as we did using the `create_model` function or the `compare_model` one). 

You can either provide your own parameters in `custom_grid` or let it handle it by itself. 

In [None]:
params = {'n_estimators' : [10, 30], 'max_depth': [5, 6, 7, 8]}

tuned_model = tune_model(model_comparison[0], optimize='Accuracy', fold=NFOLDS, custom_grid=params, n_iter=10)

In [None]:
plot_model(tuned_model, 'error')

In [None]:
plot_model(tuned_model, plot='feature')

<a id = "submission"></a>

### Submission

Predictions are made using the `predict_model` function: once again it is very simple, we just need to provide a trained model and the test data.

In [None]:
test_predictions = (predict_model(model_comparison[0], data = test, raw_score = True)
                    [['id', 'Score_Class_1', 'Score_Class_2', 'Score_Class_3', 'Score_Class_4']]
                    .rename(columns = {'Score_Class_1': 'Class_1', 
                                       'Score_Class_2': 'Class_2',
                                       'Score_Class_3': 'Class_3',
                                       'Score_Class_4': 'Class_4'}))

In [None]:
assert len(test_predictions) == len(test)
test_predictions.to_csv('submission.csv', index = False)