# AutoGluon and RAPIDS

Recently we integrated RAPIDS with AutoGluon -- big thanks to [Nick Erickson](https://github.com/Innixma) for leading this effort!

[AutoGluon](https://auto.gluon.ai/stable/index.html) automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models on text, image, and tabular data.

For more on AutoGluon check out the following [AWS Machine Learning Blog](https://aws.amazon.com/blogs/opensource/machine-learning-with-autogluon-an-open-source-automl-library/).

# Demo Overview

In the notebook below, we demonstrate how easy it is to use a GPU accelerated algorithm ensemble powered by RAPIDS and XGBoost to find a high performing model on a tabular dataset.

Specifically we'll leverage the Airline ([Airline On-Time Statistics](https://www.transtats.bts.gov/ONTIME/) dataset from the US Bureau of Transportation, and our machine learning objective is to predict whether flights will be more than 15 minutes late arriving to their destination. 

Note that this demo assumes that the following [lifecycle script](https://github.com/rapidsai/cloud-ml-examples/blob/main/aws/environment_setup/lifecycle_script) has run to install RAPIDS into the SageMaker Jupyter set of available kernels (should show up as rapids-18). For more details on how to create and activate a lifecycle script so that it executes during launch of the notebook instance refer to these [instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html).

The flow of the dataset is as follows:
1. Download 115 million flights (spanning 1987-2008), 
2. Randomly sampling just 1 million flights,
3. Run an AutoGluon ensemble of 3 models (RandomForest, K-NearestNeighbors, and XGBoost)

# Install AutoGluon into the RAPIDS kernel

Since the lifecycle configuration script needs to finish in 5 minutes or less, we did not have time to install AutoGluon while the RAPIDS kernel was being added to the SageMaker notebook instance. As a result we will do the AutoGluon step live in the notebook below.

In [None]:
!source /home/ec2-user/rapids_kernel/bin/activate && pip install --pre autogluon

# Import AutoGluon

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
from autogluon.tabular import TabularDataset, TabularPredictor
from autogluon.core.utils import generate_train_test_split

# Download Data and Create TabularDataset Object

In [None]:
path_prefix = 'https://sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com/autogluon/'
path_train = path_prefix + 'train_data.parquet'

data = TabularDataset(path_train)

Let's take a brief look at the dataframe below, note that the shape of the data is 115M by 14 columns (13 features and 1 target label).

In [None]:
data

# Randomly Sample 1Million Flights

Since this is only a demo, we will reduce the size of the dataset to 1 million randomly sampled flights in order to make the runtime fairly short. Feel free to modify the random seed in order to get a different set of flights to train with.

In [None]:
LABEL = 'target'
SAMPLE = 1_000_000

In [None]:
if SAMPLE is not None and SAMPLE < len(data):
    data = data.sample(n=SAMPLE, random_state=0)

In [None]:
data.shape

# Split Train and Test Data

Next we'll separate the data into a training set and a test set. The train set will be used to update our models' parameters, while the test set will be used to evaluate the model performance on data unseen in training.

In [None]:
train_data, test_data, train_labels, test_labels = generate_train_test_split(
    X=data.drop(LABEL, axis=1),
    y=data[LABEL],
    problem_type='binary',
    test_size=0.1
)
train_data[LABEL] = train_labels
test_data[LABEL] = test_labels

# Run AutoGluon with Multiple RAPIDS Models

With our dataset downloaded and split, we can now call our AutoGluon AutoML library to do some automated pre-processing (e.g., label encoding), as well as to train a stacked ensemble of models to reach optimal peformance on our airline delay prediction task.

In [None]:
from autogluon.tabular.models.rf.rf_rapids_model import RFRapidsModel
from autogluon.tabular.models.knn.knn_rapids_model import KNNRapidsModel
from autogluon.tabular.models.lr.lr_rapids_model import LinearRapidsModel

predictor = TabularPredictor(
    label=LABEL,
    verbosity=3,
).fit(
    train_data=train_data,
    hyperparameters={        
        KNNRapidsModel : {},
        LinearRapidsModel : {},
        RFRapidsModel : {'n_estimators': 100},
        'XGB': {'ag_args_fit': {'num_gpus': 1},  'tree_method': 'gpu_hist', 'ag.early_stop': 10000},
    },
    time_limit=2000,
)

In [None]:
leaderboard = predictor.leaderboard()

leaderboard = predictor.leaderboard(test_data)

# Summary

As we can see from the results XGBoost model carries the majority of the acccuracy in the ensemble. This is not too surprising given that we allowed the XGBoost model to grow up to 10000 trees, while the RandomForest model was capped at 100. We invite you to experiment with different settings if you are curious to adapt the performance of the ensemble.

Hopefully this example shows how straightforward it is to run AutoML on tabular data with AutoGluon and RAPIDS working together!