### AutoGluon - AutoML framework

AutoGluon is built upon the emphasis of ensembling over hyperparameter tuning. Typically, in order to improve model performance, we can either pursue hyperparameter tuning in order to find the best set of hyperparameters corresponding to data or we can pursue model ensembling - bagging, boosting and stacking.

However, performing an exhaustive search among a large space of hyperparameters can be highly time-consuming. At the same time, if your training data changes, the best set of hyperparameters you found out may no longer be the best, and so you would have to find them again.

This is the reason why AutoGluon focuses on building highly stacked ensembles, believing that you can still achieve optimal model performances without tuning hyperparameters at all.

Tutorials: https://auto.gluon.ai/dev/tutorials/tabular_prediction/index.html

GitHub: https://github.com/awslabs/autogluon/

In [None]:
%%time

!pip install --upgrade mxnet-cu100
!pip install autogluon

In [None]:
import gc
import os
import shutil
import pandas as pd
from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset('../input/tabular-playground-series-jul-2021/train.csv')
test_data = TabularDataset('../input/tabular-playground-series-jul-2021/test.csv')
submit = TabularDataset('../input/tabular-playground-series-jul-2021/sample_submission.csv')

In [None]:
train_data.head(5)

In [None]:
test_data.head(5)

Some pointers to note about AutoGluon:
1. You can specify the metric that you want to track. As our evaluation metric is **RMSLE**, but since it is not in the AutoGluon library, we will consider **RMSE** as our metric which can be specified in the <code>eval_metric</code> argument.
2. You can specify which models to fit. Not specifying will iterate over all algorithms in the library.
3. You can also specify which models to exclude. Models like Neural Networks may take relatively longer to train.
4. It is very important to specify the time limits. Specifying a time limit of **~2 hours** for each model should be best since the Kaggle run-time limit is **9 hours** and the kernel shall take some time in making predictions beyond 6 hours of training.
5. Models will run on CPU. **AutoGluon in currently not GPU-compatible**, so don't waste your GPU run-time keeping it on!
    

In [None]:
cols = train_data.columns.tolist()
cols.remove('target_carbon_monoxide')
cols.remove('target_benzene')
cols.remove('target_nitrogen_oxides')

X = train_data[cols]
y1 = train_data['target_carbon_monoxide']
y2 = train_data['target_benzene']
y3 = train_data['target_nitrogen_oxides']

train_data1 = pd.concat([X,y1],axis=1)
train_data2 = pd.concat([X,y2],axis=1)
train_data3 = pd.concat([X,y3],axis=1)

train_data1.shape, train_data2.shape, train_data3.shape

**In order to get best predictions, we need to train on 100% of data.** By default, AutoGluon splits your data as 80/20 (train/validation), [reference](https://auto.gluon.ai/dev/tutorials/image_prediction/kaggle.html#automatic-training-validation-split). So, you can choose to refit the best model based on validation score to fit on complete data (train+validation) using the <code>set_best_to_refit_full=True</code> argument, [reference](https://auto.gluon.ai/api/autogluon.task.html#:~:text=enable%20this%20functionality.-,set_best_to_refit_full,-bool%2C%20default%20%3D%20False).

Some pointers about fit arguments:

1. AutoGluon ensures that the model **predictions made later are with the best model trained in the fitting history**. Nonetheless, we are also explicitly specifying to keep the best model with <code>keep_only_best</code> argument.
2. We have also allowed for stacking using the <code>auto_stack</code>. This shall take considerably longer but should also give better predictive performance.
3. We will also delete all the unused models while keeping the best models to save space, using the <code>save_space</code> argument.

For more information about other arguments, please look at the documentation: https://auto.gluon.ai/api/autogluon.task.html

In [None]:
# Fit AutoGluon on the data, using the 'target' column as the label.

target = 'target_carbon_monoxide'
fit_args = {}

# If you want to speed up training, exclude neural network models via:
fit_args['excluded_model_types'] = ['NN', 'FASTAI']

predictor1 = TabularPredictor(label=target, eval_metric='rmse').fit(train_data1, time_limit = 60*60/3, presets='best_quality', auto_stack=True, 
                                                                   keep_only_best=True, save_space=True, **fit_args, verbosity=0)

predictor1.leaderboard(silent=True, extra_info=False)

In [None]:
# Fit AutoGluon on the data, using the 'target' column as the label.

target = 'target_benzene'
fit_args = {}

# If you want to speed up training, exclude neural network models via:
fit_args['excluded_model_types'] = ['NN', 'FASTAI']

predictor2 = TabularPredictor(label=target, eval_metric='rmse').fit(train_data2, time_limit = 60*60/3, presets='best_quality', auto_stack=True, 
                                                                   keep_only_best=True, save_space=True, **fit_args, verbosity=0)

predictor2.leaderboard(silent=True, extra_info=False)

In [None]:
# Fit AutoGluon on the data, using the 'target' column as the label.

target = 'target_nitrogen_oxides'
fit_args = {}

# If you want to speed up training, exclude neural network models via:
fit_args['excluded_model_types'] = ['NN', 'FASTAI']

predictor3 = TabularPredictor(label=target, eval_metric='rmse').fit(train_data3, time_limit = 60*60/3, presets='best_quality', auto_stack=True, 
                                                                   keep_only_best=True, save_space=True, **fit_args, verbosity=0)

predictor3.leaderboard(silent=True, extra_info=False)

**Making predictions with the best models trained so far.**

In [None]:
submit['target_carbon_monoxide'] = predictor1.predict(test_data)
submit['target_benzene'] = predictor2.predict(test_data)
submit['target_nitrogen_oxides'] = predictor3.predict(test_data)

In [None]:
submit.head()

In [None]:
submit.to_csv('submission.csv',index=False)

In [None]:
shutil.rmtree('AutogluonModels')

del predictor1
del predictor2
del predictor3

gc.collect()

### LightAutoML

In [None]:
!pip install -U lightautoml

In [None]:
# Standard python libraries
import os
import time

# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_log_error
import matplotlib.pyplot as plt
import torch

# LightAutoML presets, task and report generation
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from lightautoml.dataset.roles import DatetimeRole
from lightautoml.report.report_deco import ReportDeco

In [None]:
N_THREADS = 4
N_FOLDS = 5
RANDOM_STATE = 42
TIMEOUT = 60*60
TARGET_NAME = 'target'

In [None]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)

In [None]:
train_data = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
test_data = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')
sample_sub = pd.read_csv('../input/tabular-playground-series-jul-2021/sample_submission.csv')

def rmsle_metric(y_true, y_pred, **kwargs):
    return mean_squared_log_error(y_true, np.clip(y_pred, 0, None), **kwargs) ** 0.5

task = Task('reg', loss = 'rmsle', metric = rmsle_metric)

targets_and_drop = {
    'target_carbon_monoxide': ['target_benzene', 'target_nitrogen_oxides'],
    'target_benzene': ['target_carbon_monoxide', 'target_nitrogen_oxides'],
    'target_nitrogen_oxides': ['target_carbon_monoxide', 'target_benzene']
}

roles = {
    DatetimeRole(base_date=False, base_feats=True, seasonality=('d', 'wd', 'hour')): 'date_time'
}

importances = {}
dt = pd.to_datetime(train_data['date_time'])
for targ in targets_and_drop:
    print('='*50, '='*50, sep = '\n')
    automl = TabularAutoML(task = task, 
                           timeout = TIMEOUT,
                           cpu_limit = N_THREADS,
                           reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
                           general_params={'use_algos': [['lgb', 'lgb_tuned', 'cb', 'cb_tuned']]}
                          )

    roles['target'] = targ
    roles['drop'] = targets_and_drop[targ]
    
    if targ == 'target_nitrogen_oxides':
        oof_pred = automl.fit_predict(train_data[dt >= np.datetime64('2010-09-01')], roles = roles)
    else:
        oof_pred = automl.fit_predict(train_data, roles = roles)
    print('oof_pred:\n{}\nShape = {}'.format(oof_pred, oof_pred.shape))
    
    # Fast feature importances calculation
    fast_fi = automl.get_feature_scores('fast')
    importances[targ] = fast_fi
    
    test_pred = automl.predict(test_data)
    print('Prediction for te_data:\n{}\nShape = {}'.format(test_pred, test_pred.shape))
    
    sample_sub[targ] = np.clip(test_pred.data[:, 0], 0, None)

In [None]:
submit = sample_sub.copy()
submit_final = sample_sub.copy()

cols = sample_sub.columns.tolist()
cols.remove('date_time')

for i in cols:
    submit_final[i] = np.mean((submit[i].values, sample_sub[i].values), axis=0)

In [None]:
submit_final.head(2)

In [None]:
submit_final.to_csv('submission.csv',index=False)