### AutoGluon - AutoML framework

AutoGluon is built upon the emphasis of ensembling over hyperparameter tuning. Typically, in order to improve model performance, we can either pursue hyperparameter tuning in order to find the best set of hyperparameters corresponding to data or we can pursue model ensembling - bagging, boosting and stacking.

However, performing an exhaustive search among a large space of hyperparameters can be highly time-consuming. At the same time, if your training data changes, the best set of hyperparameters you found out may no longer be the best, and so you would have to find them again.

This is the reason why AutoGluon focuses on building highly stacked ensembles, believing that you can still achieve optimal model performances without tuning hyperparameters at all.

Tutorials: https://auto.gluon.ai/dev/tutorials/tabular_prediction/index.html

GitHub: https://github.com/awslabs/autogluon/

In [None]:
!pip install --upgrade mxnet-cu100
!pip install autogluon

In [None]:
# Load the dataset
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('../input/tabular-playground-series-jun-2021/train.csv')
test_data = TabularDataset('../input/tabular-playground-series-jun-2021/test.csv')

In [None]:
train_data.head(5)

In [None]:
test_data.head(5)

Some pointers to note about AutoGluon:
1. You can specify the metric that you want to track. In our case, it is **log_loss** and can be specified in the <code>eval_metric</code> argument.
2. You can specify which models to fit. Not specifying will iterate over all algorithms in the library.
3. You can also specify which models to exclude. Models like Neural Networks may take relatively longer to train.
4. It is very important to specify the time limits. Specifying a time limit of **8 hours** should be best since the Kaggle run-time limit is **9 hours** and the kernel shall take some time in making predictions beyond 8 hours of training.
5. Models will run on CPU. **AutoGluon in currently not GPU-compatible**, so don't waste your GPU run-time keeping it on!
    

**In order to get best predictions, we need to train on 100% of data.** AutoGluon ensures that the model **predictions made later are with the best model trained in the fitting history**. 

In order to confirm that, we can split the training data as 80/20 and track performance of various fitted models. 

In [None]:
# Fit AutoGluon on the data, using the 'target' column as the label.

label = 'target'
fit_args = {}

# If you want to speed up training, exclude neural network models via:
# fit_args['excluded_model_types'] = ['NN', 'FASTAI']

predictor = TabularPredictor(label=label, eval_metric='log_loss').fit(train_data, time_limit = 60*60*8, presets='best_quality', **fit_args)

Making predictions with the best model trained so far. 

In [None]:
# Get prediction probabilites
probs=predictor.predict_proba(test_data, as_multiclass=True)
probs

In [None]:
import pandas as pd

submit = test_data[['id']]
submit = pd.concat([submit, probs], axis=1)
submit.to_csv('submission.csv',index=False)