### AutoGluon - AutoML framework

AutoGluon is built upon the emphasis of ensembling over hyperparameter tuning. Typically, in order to improve model performance, we can either pursue hyperparameter tuning in order to find the best set of hyperparameters corresponding to data or we can pursue model ensembling - bagging, boosting and stacking.

However, performing an exhaustive search among a large space of hyperparameters can be highly time-consuming. At the same time, if your training data changes, the best set of hyperparameters you found out may no longer be the best, and so you would have to find them again.

This is the reason why AutoGluon focuses on building highly stacked ensembles, believing that you can still achieve optimal model performances without tuning hyperparameters at all.

Tutorials: https://auto.gluon.ai/dev/tutorials/tabular_prediction/index.html

GitHub: https://github.com/awslabs/autogluon/

In [1]:
!pip install --upgrade mxnet-cu100
!pip install autogluon

Collecting mxnet-cu100
  Downloading mxnet_cu100-1.8.0.post0-py2.py3-none-manylinux2014_x86_64.whl (352.6 MB)
[K     |████████████████████████████████| 352.6 MB 10 kB/s 
Installing collected packages: mxnet-cu100
Successfully installed mxnet-cu100-1.8.0.post0
Collecting autogluon
  Downloading autogluon-0.2.0-py3-none-any.whl (5.4 kB)
Collecting autogluon.mxnet==0.2.0
  Downloading autogluon.mxnet-0.2.0-py3-none-any.whl (28 kB)
Collecting autogluon.tabular[all]==0.2.0
  Downloading autogluon.tabular-0.2.0-py3-none-any.whl (250 kB)
[K     |████████████████████████████████| 250 kB 7.5 MB/s 
[?25hCollecting autogluon.core==0.2.0
  Downloading autogluon.core-0.2.0-py3-none-any.whl (334 kB)
[K     |████████████████████████████████| 334 kB 29.7 MB/s 
[?25hCollecting autogluon.vision==0.2.0
  Downloading autogluon.vision-0.2.0-py3-none-any.whl (31 kB)
Collecting autogluon.features==0.2.0
  Downloading autogluon.features-0.2.0-py3-none-any.whl (48 kB)
[K     |█████████

In [2]:
# Load the dataset
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('../input/tabular-playground-series-jun-2021/train.csv')
test_data = TabularDataset('../input/tabular-playground-series-jun-2021/test.csv')

In [3]:
train_data.head(5)

  and should_run_async(code)


Unnamed: 0,id,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,...,feature_66,feature_67,feature_68,feature_69,feature_70,feature_71,feature_72,feature_73,feature_74,target
0,0,0,0,6,1,0,0,0,0,7,...,0,0,0,0,0,0,2,0,0,Class_6
1,1,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,1,0,Class_6
2,2,0,0,0,0,0,1,0,3,0,...,0,0,0,0,1,0,0,0,0,Class_2
3,3,0,0,7,0,1,5,2,2,0,...,0,4,0,2,2,0,4,3,0,Class_8
4,4,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Class_2


In [4]:
test_data.head(5)

  and should_run_async(code)


Unnamed: 0,id,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,...,feature_65,feature_66,feature_67,feature_68,feature_69,feature_70,feature_71,feature_72,feature_73,feature_74
0,200000,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,200001,1,2,0,0,0,0,0,0,0,...,3,1,3,0,0,0,0,3,0,0
2,200002,0,1,7,1,0,0,0,0,6,...,3,0,0,0,0,3,0,2,0,0
3,200003,0,0,0,4,3,1,0,0,0,...,0,0,0,1,0,0,0,4,0,0
4,200004,0,0,5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Some pointers to note about AutoGluon:
1. You can specify the metric that you want to track. In our case, it is **log_loss** and can be specified in the <code>eval_metric</code> argument.
2. You can specify which models to fit. Not specifying will iterate over all algorithms in the library.
3. You can also specify which models to exclude. Models like Neural Networks may take relatively longer to train.
4. It is very important to specify the time limits. Specifying a time limit of **8 hours** should be best since the Kaggle run-time limit is **9 hours** and the kernel shall take some time in making predictions beyond 8 hours of training.
5. Models will run on CPU. **AutoGluon in currently not GPU-compatible**, so don't waste your GPU run-time keeping it on!
    

**In order to get best predictions, we need to train on 100% of data.** AutoGluon ensures that the model **predictions made later are with the best model trained in the fitting history**. 

In order to confirm that, we can split the training data as 80/20 and track performance of various fitted models. 

In [5]:
# Fit AutoGluon on the data, using the 'target' column as the label.

label = 'target'
fit_args = {}

# If you want to speed up training, exclude neural network models via:
# fit_args['excluded_model_types'] = ['NN', 'FASTAI']

predictor = TabularPredictor(label=label, eval_metric='log_loss').fit(train_data, time_limit = 60*60*8, presets='best_quality', **fit_args)

  and should_run_async(code)


Making predictions with the best model trained so far. 

In [6]:
# Get prediction probabilites
probs=predictor.predict_proba(test_data, as_multiclass=True)
probs

  and should_run_async(code)


Unnamed: 0,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9
0,0.057734,0.407499,0.153324,0.025760,0.013165,0.155570,0.023386,0.046606,0.116956
1,0.043478,0.081267,0.061593,0.022088,0.014723,0.265655,0.085091,0.289005,0.137100
2,0.020708,0.030995,0.017041,0.012792,0.007381,0.699861,0.030195,0.128380,0.052647
3,0.046635,0.114253,0.083972,0.032089,0.018137,0.246149,0.079192,0.219304,0.160270
4,0.043450,0.110034,0.076845,0.024773,0.015059,0.301321,0.067011,0.217970,0.143539
...,...,...,...,...,...,...,...,...,...
99995,0.070833,0.364333,0.149832,0.029730,0.016357,0.109136,0.037636,0.081638,0.140506
99996,0.049903,0.232992,0.121352,0.029223,0.016334,0.192075,0.052516,0.147747,0.157859
99997,0.068199,0.251798,0.125259,0.030930,0.017384,0.182502,0.048262,0.118593,0.157074
99998,0.033203,0.024842,0.020395,0.013445,0.011014,0.376474,0.073458,0.369144,0.078025


In [7]:
import pandas as pd

submit = test_data[['id']]
submit = pd.concat([submit, probs], axis=1)
submit.to_csv('submission.csv',index=False)

  and should_run_async(code)
