https://auto.gluon.ai/0.1.0/tutorials/tabular_prediction/tabular-indepth.html
    
# Predicting Columns in a Table - In Depth: freMTPL2freq (regression)

Tip: If you are new to AutoGluon, review "Predicting Columns in a Table - Quick Start" to learn the basics of the AutoGluon API.

https://auto.gluon.ai/0.1.0/tutorials/tabular_prediction/tabular-quickstart.html#sec-tabularquick

This tutorial describes how you can exert greater control when using AutoGluon’s `fit()` or `predict()`. Recall that to maximize predictive performance, you should always first try `fit()` with all default arguments except `eval_metric` and `presets`, before you experiment with other arguments covered in this in-depth tutorial like `hyperparameter_tune_kwargs`, `hyperparameters`, `num_stack_levels`, `num_bag_folds`, `num_bag_sets`, etc.

Using the same census data table as in the "Predicting Columns in a Table - Quick Start" tutorial, we’ll now predict the occupation of an individual - a multiclass classification problem. Start by importing AutoGluon’s `TabularPredictor` and `TabularDataset`, and loading the data.

In [1]:
from autogluon.tabular import TabularDataset, TabularPredictor

import pandas as pd
import numpy as np

In [2]:
# train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
train_data = TabularDataset("freMTPL2freq_dataset_train.csv")
test_data = TabularDataset("freMTPL2freq_dataset_test.csv")
# subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
# train_data = train_data.sample(n=subsample_size, random_state=0)

In [3]:
train_data.info()

<class 'autogluon.core.dataset.TabularDataset'>
RangeIndex: 474765 entries, 0 to 474764
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   IDpol       474765 non-null  int64  
 1   ClaimNb     474765 non-null  int64  
 2   Exposure    474765 non-null  float64
 3   VehPower    474765 non-null  int64  
 4   VehAge      474765 non-null  int64  
 5   DrivAge     474765 non-null  int64  
 6   BonusMalus  474765 non-null  int64  
 7   VehBrand    474765 non-null  object 
 8   VehGas      474765 non-null  object 
 9   Area        474765 non-null  object 
 10  Density     474765 non-null  int64  
 11  Region      474765 non-null  object 
dtypes: float64(1), int64(7), object(4)
memory usage: 43.5+ MB


In [4]:
test_data.info()

<class 'autogluon.core.dataset.TabularDataset'>
RangeIndex: 203226 entries, 0 to 203225
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   IDpol       203226 non-null  float64
 1   ClaimNb     203226 non-null  int64  
 2   Exposure    203226 non-null  float64
 3   VehPower    203226 non-null  int64  
 4   VehAge      203226 non-null  int64  
 5   DrivAge     203226 non-null  int64  
 6   BonusMalus  203226 non-null  int64  
 7   VehBrand    203226 non-null  object 
 8   VehGas      203226 non-null  object 
 9   Area        203226 non-null  object 
 10  Density     203226 non-null  int64  
 11  Region      203226 non-null  object 
dtypes: float64(2), int64(6), object(4)
memory usage: 18.6+ MB


In [None]:
train_data.head().T

In [6]:
label = 'ClaimNb'
print("Summary of target column: \n", train_data[label].describe())

Summary of target column: 
 count    474765.000000
mean          0.038583
std           0.205458
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          11.000000
Name: ClaimNb, dtype: float64


## Data Cleaning

In [7]:
# IDpol: The policy ID, so drop it
train_data = train_data.drop(["IDpol"], axis=1)
test_data = test_data.drop(["IDpol"], axis=1)

In [8]:
train_data.shape

(474765, 11)

In [9]:
test_data.shape

(203226, 11)

In [10]:
# new_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
# test_data = new_data[5000:].copy()  # this should be separate data in your applications
y_test = test_data[label]
test_data_nolabel = test_data.drop(columns=[label])  # delete label column
# val_data = new_data[:5000].copy()

# metric = 'accuracy' # we specify eval-metric just for demo (unnecessary as it's the default)
metric = 'mean_absolute_error' 

## Specifying hyperparameters and tuning them

We first demonstrate hyperparameter-tuning and how you can provide your own validation dataset that AutoGluon internally relies on to: tune hyperparameters, early-stop iterative training, and construct model ensembles. One reason you may specify validation data is when future test data will stem from a different distribution than training data (and your specified validation data is more representative of the future data that will likely be encountered).

If you don’t have a strong reason to provide your own validation dataset, we recommend you omit the `tuning_data` argument. This lets AutoGluon automatically select validation data from your provided training set (it uses smart strategies such as stratified sampling). For greater control, you can specify the `holdout_fra`c argument to tell AutoGluon what fraction of the provided training data to hold out for validation.

**Caution:** Since AutoGluon tunes internal knobs based on this validation data, performance estimates reported on this data may be over-optimistic. For unbiased performance estimates, you should always call `predict()` on a separate dataset (that was never passed to `fit()`), as we did in the previous Quick-Start tutorial. We also emphasize that most options specified in this tutorial are chosen to minimize runtime for the purposes of demonstration and you should select more reasonable values in order to obtain high-quality models.

`fit()` trains neural networks and various types of tree ensembles by default. You can specify various hyperparameter values for each type of model. For each hyperparameter, you can either specify a single fixed value, or a search space of values to consider during hyperparameter optimization. Hyperparameters which you do not specify are left at default settings chosen automatically by AutoGluon, which may be fixed values or search spaces.

In [11]:
import autogluon.core as ag

In [12]:
# specifies non-default hyperparameter values for neural network models
#    num_epochs: number of training epochs (controls training time of NN models)
#    learning_rate: learning rate used in training (real-valued hyperparameter searched on log-scale)
#    activation: activation function used in NN (categorical hyperparameter, default = first entry)
#    layers: each choice for categorical hyperparameter 'layers' corresponds to list of sizes for each NN layer to use
#    dropout_prob: dropout probability (real-valued hyperparameter)
nn_options = {
    'num_epochs': 10,
    'learning_rate': ag.space.Real(1e-4, 1e-2, default=5e-4, log=True),
    'activation': ag.space.Categorical('relu', 'softrelu', 'tanh'),
    'layers': ag.space.Categorical([100], [1000], [200, 100], [300, 200, 100]),
    'dropout_prob': ag.space.Real(0.0, 0.5, default=0.1)
}

nn_options

{'num_epochs': 10,
 'learning_rate': Real: lower=0.0001, upper=0.01,
 'activation': Categorical['relu', 'softrelu', 'tanh'],
 'layers': Categorical[[100], [1000], [200, 100], [300, 200, 100]],
 'dropout_prob': Real: lower=0.0, upper=0.5}

In [13]:
# specifies non-default hyperparameter values for lightGBM gradient boosted trees
#    num_boost_round: number of boosting rounds (controls training time of GBM models)
#    num_leaves: number of leaves in trees (integer hyperparameter)
gbm_options = {  
    'num_boost_round': 100,
    'num_leaves': ag.space.Int(lower=26, upper=66, default=36)
}

gbm_options

{'num_boost_round': 100, 'num_leaves': Int: lower=26, upper=66}

In [14]:
# hyperparameters of each model type
# When these keys are missing from hyperparameters dict, no models of that type are trained
#    NN: NOTE: comment this line out if you get errors on Mac OSX
hyperparameters = {  
    'GBM': gbm_options,
    'NN': nn_options
}

hyperparameters

{'GBM': {'num_boost_round': 100, 'num_leaves': Int: lower=26, upper=66},
 'NN': {'num_epochs': 10,
  'learning_rate': Real: lower=0.0001, upper=0.01,
  'activation': Categorical['relu', 'softrelu', 'tanh'],
  'layers': Categorical[[100], [1000], [200, 100], [300, 200, 100]],
  'dropout_prob': Real: lower=0.0, upper=0.5}}

In [15]:
# train various models for ~2 min
# time_limit = 2*60

# https://www.kaggle.com/code/daikikatsuragawa/tps-mar-2021-benchmark-using-autogluon/notebook
# If the value of time_limit is too small, we will get the following error:
# ValueError: AutoGluon did not successfully train any models

# increased time_limit to 1 hour
time_limit = 60*60
time_limit

3600

In [16]:
# try at most 5 different hyperparameter configurations for each type of model
num_trials = 5
num_trials

5

In [17]:
# to tune hyperparameters using Bayesian optimization routine with a local scheduler
search_strategy = 'auto'
search_strategy

'auto'

In [18]:
# HPO is not performed unless hyperparameter_tune_kwargs is specified
hyperparameter_tune_kwargs = {
    'num_trials': num_trials,
    'searcher': search_strategy
}

hyperparameter_tune_kwargs

{'num_trials': 5, 'searcher': 'auto'}

In [19]:
label

'ClaimNb'

In [20]:
metric

'mean_absolute_error'

In [21]:
# NEED TO DEBUG
"""
predictor = TabularPredictor(label=label, eval_metric=metric).fit(
    train_data,
    tuning_data=val_data,
    time_limit=time_limit,
    hyperparameters=hyperparameters,
    hyperparameter_tune_kwargs=hyperparameter_tune_kwargs
)
"""

'\npredictor = TabularPredictor(label=label, eval_metric=metric).fit(\n    train_data,\n    tuning_data=val_data,\n    time_limit=time_limit,\n    hyperparameters=hyperparameters,\n    hyperparameter_tune_kwargs=hyperparameter_tune_kwargs\n)\n'

In [22]:
# specified problem_type to eliminate infering multi-class problem_type, e.g.,
#   problem_type="regression"

predictor = TabularPredictor(
    label=label, 
    eval_metric=metric,
    problem_type="regression"
).fit(
    train_data,
    time_limit=time_limit
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230702_210555/"
Beginning AutoGluon training ... Time limit = 3600s
AutoGluon will save models to "AutogluonModels/ag-20230702_210555/"
AutoGluon Version:  0.7.0
Python Version:     3.9.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu May 4 15:21:22 UTC 2023
Train Data Rows:    474765
Train Data Columns: 10
Label Column: ClaimNb
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    167728.76 MB
	Train Data (Original)  Memory Usage: 141.07 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatu

We again demonstrate how to use the trained models to predict on the test data.

In [23]:
y_pred = predictor.predict(test_data_nolabel)

In [24]:
# print("Predictions:  ", list(y_pred)[:5])
print("Predictions:\n", y_pred[:5])

Predictions:
 0    8.739490e-15
1    4.045679e-15
2    4.697084e-14
3    2.464720e-15
4    5.489513e-09
Name: ClaimNb, dtype: float32


In [25]:
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=False)

Evaluation: mean_absolute_error on test data: -0.039985119318079465
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "mean_absolute_error": -0.039985119318079465
}


Use the following to view a summary of what happened during fit. Now this command will show details of the hyperparameter-tuning process for each type of model:

In [26]:
results = predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0        NeuralNetTorch  -0.042334       0.034553  376.387879                0.034553         376.387879            1       True         10
1   WeightedEnsemble_L2  -0.042334       0.035145  376.750426                0.000593           0.362546            2       True         12
2       NeuralNetFastAI  -0.070266       0.048698  316.246186                0.048698         316.246186            1       True          8
3              LightGBM  -0.074476       0.015795    2.769888                0.015795           2.769888            1       True          4
4         LightGBMLarge  -0.074557       0.007816    3.565485                0.007816           3.565485            1       True         11
5               XGBoost  -0.074668       0.022556    3.800750                0.022556           3.

In the above example, the predictive performance may be poor because we specified very little training to ensure quick runtimes. You can call `fit()` multiple times while modifying the above settings to better understand how these choices affect performance outcomes. For example: you can comment out the `train_data.head` command or increase `subsample_size` to train using a larger dataset, increase the `num_epochs` and `num_boost_round` hyperparameters, and increase the `time_limit` (which you should do for all code in these tutorials). To see more detailed output during the execution of `fit()`, you can also pass in the argument: `verbosity=3`.

## Model ensembling with stacking/bagging

Beyond hyperparameter-tuning with a correctly-specified evaluation metric, two other methods to boost predictive performance are bagging and stack-ensembling:

"AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data": https://arxiv.org/abs/2003.06505

You’ll often see performance improve if you specify `num_bag_folds = 5-10, num_stack_levels = 1-3` in the call to `fit()`, but this will increase training times and memory/disk usage.

In [27]:
# last  argument is just for quick demo here, omit it in real applications
predictor = TabularPredictor(
    label=label, 
    eval_metric=metric,
    problem_type="regression"
).fit(
    train_data,
    num_bag_folds=5, num_bag_sets=1, num_stack_levels=1,
    # num_bag_folds=5, num_bag_sets=1, num_stack_levels=3,                                                              
    # hyperparameters = {'NN': {'num_epochs': 2}, 'GBM': {'num_boost_round': 20}},  
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230702_211907/"
	Consider setting `time_limit` to ensure training finishes within an expected duration or experiment with a small portion of `train_data` to identify an ideal `presets` and `hyperparameters` configuration.
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230702_211907/"
AutoGluon Version:  0.7.0
Python Version:     3.9.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu May 4 15:21:22 UTC 2023
Train Data Rows:    474765
Train Data Columns: 10
Label Column: ClaimNb
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    161883.65 MB
	Train Data (Original)  Memory Usage: 141.07 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features

You should not provide `tuning_data` when stacking/bagging, and instead provide all your available data as `train_data` (which AutoGluon will split in more intellgent ways). `num_bag_sets` controls how many times the k-fold bagging process is repeated to further reduce variance (increasing this may further boost accuracy but will substantially increase training times, inference latency, and memory/disk usage). Rather than manually searching for good bagging/stacking values yourself, AutoGluon will automatically select good values for you if you specify `auto_stack` instead:

In [28]:
save_path = 'agModels-predict_freMTPL2freq_2'  # folder where to store trained models

# last 2 arguments are for quick demo, omit them in real applications
predictor = TabularPredictor(
    label=label, 
    eval_metric=metric, 
    path=save_path,
    problem_type="regression"
).fit(
    train_data, 
    auto_stack=True,
    time_limit=30, 
    hyperparameters={'NN': {'num_epochs': 2}, 'GBM': {'num_boost_round': 20}} 
)

Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to "agModels-predict_freMTPL2freq_2/"
AutoGluon Version:  0.7.0
Python Version:     3.9.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu May 4 15:21:22 UTC 2023
Train Data Rows:    474765
Train Data Columns: 10
Label Column: ClaimNb
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    160034.71 MB
	Train Data (Original)  Memory Usage: 141.07 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNa

	-0.0727	 = Validation score   (-mean_absolute_error)
	0.01s	 = Training   runtime
	0.01s	 = Validation runtime
AutoGluon training complete, total runtime = 28.65s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("agModels-predict_freMTPL2freq_2/")


Often stacking/bagging will produce superior accuracy than hyperparameter-tuning, but you may try combining both techniques (note: specifying `presets='best_quality'` in `fit()` simply sets `auto_stack=True`).

## Prediction options (inference)

Even if you’ve started a new Python session since last calling `fit()`, you can still load a previously trained predictor from disk:

```
# `predictor.path` is another way to get the relative path needed to later load predictor.
predictor = TabularPredictor.load(save_path)  
```

Above `save_path` is the same folder previously passed to `TabularPredictor`, in which all the trained models have been saved. You can train easily models on one machine and deploy them on another. Simply copy the `save_path` folder to the new machine and specify its new path in `TabularPredictor.load()`.

We can make a prediction on an individual example rather than a full dataset:

In [29]:
datapoint = test_data_nolabel.iloc[[0]]  # Note: .iloc[0] won't work because it returns pandas Series instead of DataFrame

print(datapoint)
predictor.predict(datapoint)

2023-07-02 21:57:44,621	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
2023-07-02 21:57:44,622	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
2023-07-02 21:57:44,627	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
2023-07-02 21:57:44,628	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
2023-07-02 21:57:44,629	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UN

   Exposure  VehPower  VehAge  DrivAge  BonusMalus VehBrand   VehGas Area  \
0       0.1         5       0       55          50      B12  Regular    D   

   Density       Region  
0     1217  Rhone-Alpes  


0    0.021197
Name: ClaimNb, dtype: float32

To output predicted class probabilities instead of predicted classes, you can use:

In [30]:
predictor.predict_proba(datapoint)  # returns a DataFrame that shows which probability corresponds to which class



0    0.021197
Name: ClaimNb, dtype: float32

By default, `predict()` and `predict_proba()` will utilize the model that AutoGluon thinks is most accurate, which is usually an ensemble of many individual models. Here’s how to see which model this is:

In [31]:
predictor.get_model_best()

'WeightedEnsemble_L3'

We can instead specify a particular model to use for predictions (e.g. to reduce inference latency). Note that a ‘model’ in AutoGluon may refer to for example a single Neural Network, a bagged ensemble of many Neural Network copies trained on different training/validation splits, a weighted ensemble that aggregates the predictions of many other models, or a stacker model that operates on predictions output by other models. This is akin to viewing a Random Forest as one ‘model’ when it is in fact an ensemble of many decision trees.

Before deciding which model to use, let’s evaluate all of the models AutoGluon has previously trained on our test data:

In [32]:
# predictor.leaderboard(test_data, silent=True)
predictor.leaderboard(test_data, silent=True).T

Unnamed: 0,0,1,2,3
model,LightGBM_BAG_L2,WeightedEnsemble_L3,LightGBM_BAG_L1,WeightedEnsemble_L2
score_test,-0.073829,-0.073829,-0.073983,-0.073983
score_val,-0.072728,-0.072728,-0.072862,-0.072862
pred_time_test,0.606435,0.610742,0.281258,0.287324
pred_time_val,0.401058,0.408151,0.20376,0.210497
fit_time,4.485044,4.496381,2.237072,2.248087
pred_time_test_marginal,0.325177,0.004307,0.281258,0.006066
pred_time_val_marginal,0.197298,0.007093,0.20376,0.006737
fit_time_marginal,2.247972,0.011337,2.237072,0.011015
stack_level,2,3,1,2


The leaderboard shows each model’s predictive performance on the test data (`score_test`) and validation data (`score_val`), as well as the time required to: produce predictions for the test data (`pred_time_val`), produce predictions on the validation data (`pred_time_val`), and train only this model (`fit_time`). Below, we show that a leaderboard can be produced without new data (just uses the data previously reserved for validation inside `fit`) and can display extra information about each model:

In [33]:
# predictor.leaderboard(extra_info=True, silent=True)
predictor.leaderboard(extra_info=True, silent=True).T

Unnamed: 0,0,1,2,3
model,LightGBM_BAG_L2,WeightedEnsemble_L3,LightGBM_BAG_L1,WeightedEnsemble_L2
score_val,-0.072728,-0.072728,-0.072862,-0.072862
pred_time_val,0.401058,0.408151,0.20376,0.210497
fit_time,4.485044,4.496381,2.237072,2.248087
pred_time_val_marginal,0.197298,0.007093,0.20376,0.006737
fit_time_marginal,2.247972,0.011337,2.237072,0.011015
stack_level,2,3,1,2
can_infer,True,True,True,True
fit_order,3,4,1,2
num_features,11,1,10,1


The expanded leaderboard shows properties like how many features are used by each model (`num_features`), which other models are ancestors whose predictions are required inputs for each model (`ancestors`), and how much memory each model and all its ancestors would occupy if simultaneously persisted (`memory_size_w_ancestors`). See the leaderboard documentation for full details:

https://auto.gluon.ai/0.1.0/api/autogluon.predictor.html#autogluon.tabular.TabularPredictor.leaderboard

Here’s how to specify a particular model to use for prediction instead of AutoGluon’s default model-choice:

In [34]:
i = 0  # index of model to use
model_to_use = predictor.get_model_names()[i]
model_pred = predictor.predict(datapoint, model=model_to_use)
print("Prediction from %s model: %s" % (model_to_use, model_pred.iloc[0]))

Prediction from LightGBM_BAG_L1 model: 0.021949358


We can easily access various information about the trained predictor or a particular model:

In [35]:
all_models = predictor.get_model_names()
model_to_use = all_models[i]
specific_model = predictor._trainer.load_model(model_to_use)

# Objects defined below are dicts of various information (not printed here as they are quite large):
model_info = specific_model.get_info()
predictor_information = predictor.info()

The `predictor` also remembers what metric predictions should be evaluated with, which can be done with ground truth labels as follows:

In [36]:
y_pred = predictor.predict(test_data_nolabel)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Evaluation: mean_absolute_error on test data: -0.07382930544418408
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "mean_absolute_error": -0.07382930544418408,
    "root_mean_squared_error": -0.20785334362348507,
    "mean_squared_error": -0.043203012455462556,
    "r2": 0.030595522805380138,
    "pearsonr": 0.19263159632652416,
    "median_absolute_error": -0.03478221222758293
}


However, you must be careful here as certain metrics require predicted probabilities rather than classes. Since the label columns remains in the `test_data` DataFrame, we can instead use the shorthand:

In [37]:
perf = predictor.evaluate(test_data)

Evaluation: mean_absolute_error on test data: -0.07382930544418408
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "mean_absolute_error": -0.07382930544418408,
    "root_mean_squared_error": -0.20785334362348507,
    "mean_squared_error": -0.043203012455462556,
    "r2": 0.030595522805380138,
    "pearsonr": 0.19263159632652416,
    "median_absolute_error": -0.03478221222758293
}


which will correctly select between `predict()` or `predict_proba()` depending on the evaluation metric.

## Interpretability (feature importance)

To better understand our trained predictor, we can estimate the overall importance of each feature:

In [38]:
predictor.feature_importance(test_data)

Computing feature importance via permutation shuffling for 10 features using 5000 rows with 5 shuffle sets...
	8.39s	= Expected runtime (1.68s per shuffle set)
	1.8s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
BonusMalus,0.004605,0.000562,2.6e-05,5,0.005762,0.003449
Exposure,0.002813,0.000644,0.000308,5,0.004138,0.001487
DrivAge,0.000838,0.000132,7.2e-05,5,0.00111,0.000566
VehAge,0.000185,5.1e-05,0.000642,5,0.000291,7.9e-05
VehBrand,0.000174,2.8e-05,7.5e-05,5,0.000231,0.000117
Region,0.000144,7.1e-05,0.005421,5,0.00029,-3e-06
Density,0.000142,4.7e-05,0.001247,5,0.000238,4.5e-05
VehGas,4.4e-05,2.1e-05,0.004603,5,8.8e-05,1e-06
VehPower,2.6e-05,9e-06,0.00166,5,4.6e-05,7e-06
Area,-5e-06,8e-06,0.864395,5,1.2e-05,-2.2e-05


Computed via permutation-shuffling, these feature importance scores quantify the drop in predictive performance (of the already trained predictor) when one column’s values are randomly shuffled across rows. 

https://explained.ai/rf-importance/

The top features in this list contribute most to AutoGluon’s accuracy (for predicting when/if a patient will be readmitted to the hospital). Features with non-positive importance score hardly contribute to the predictor’s accuracy, or may even be actively harmful to include in the data (consider removing these features from your data and calling `fit` again). These scores facilitate interpretability of the predictor’s global behavior (which features it relies on for all predictions) rather than local explanations that only rationalize one particular prediction.

https://christophm.github.io/interpretable-ml-book/taxonomy-of-interpretability-methods.html

## Accelerating inference

We describe multiple ways to reduce the time it takes for AutoGluon to produce predictions.

### Keeping models in memory

By default, AutoGluon loads models into memory one at a time and only when they are needed for prediction. This strategy is robust for large stacked/bagged ensembles, but leads to slower prediction times. If you plan to repeatedly make predictions (e.g. on new datapoints one at a time rather than one large test dataset), you can first specify that all models required for inference should be loaded into memory as follows:

In [39]:
predictor.persist_models()

num_test = 20
preds = np.array([''] * num_test, dtype='object')
for i in range(num_test):
    datapoint = test_data_nolabel.iloc[[i]]
    pred_numpy = predictor.predict(datapoint, as_pandas=False)
    preds[i] = pred_numpy[0]

perf = predictor.evaluate_predictions(y_test[:num_test], preds, auxiliary_metrics=True)
print("Predictions: ", preds)

predictor.unpersist_models()  # free memory by clearing models, future predict() calls will load models from disk

Persisting 3 models in memory. Models will require 0.0% of memory.
Evaluation: mean_absolute_error on test data: -0.03614488542079926
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "mean_absolute_error": -0.03614488542079926,
    "root_mean_squared_error": -0.04348980311031695,
    "mean_squared_error": -0.0018913629745741342,
    "r2": 0.0,
    "pearsonr": NaN,
    "median_absolute_error": -0.02470323257148266
}
Unpersisted 3 models: ['LightGBM_BAG_L2', 'WeightedEnsemble_L3', 'LightGBM_BAG_L1']


Predictions:  [0.021197435 0.021197435 0.024703233 0.12624344 0.04158317 0.039908644
 0.021197435 0.019564483 0.04604639 0.019564483 0.021087468 0.019564483
 0.06287587 0.04638099 0.039490342 0.024703233 0.021197435 0.04662274
 0.04020452 0.019564483]


['LightGBM_BAG_L2', 'WeightedEnsemble_L3', 'LightGBM_BAG_L1']

You can alternatively specify a particular model to persist via the models argument of `persist_models()`, or simply set `models='all'` to simultaneously load every single model that was trained during `fit`.

## Using smaller ensemble or faster model for prediction

Without having to retrain any models, one can construct alternative ensembles that aggregate individual models’ predictions with different weighting schemes. These ensembles become smaller (and hence faster for prediction) if they assign nonzero weight to less models. You can produce a wide variety of ensembles with different accuracy-speed tradeoffs like this:

In [40]:
additional_ensembles = predictor.fit_weighted_ensemble(expand_pareto_frontier=True)
print("Alternative ensembles you can use for prediction:", additional_ensembles)

predictor.leaderboard(only_pareto_frontier=True, silent=True)

Fitting model: WeightedEnsemble_L3Best ...
	-0.0727	 = Validation score   (-mean_absolute_error)
	2.74s	 = Training   runtime
	0.01s	 = Validation runtime


Alternative ensembles you can use for prediction: ['WeightedEnsemble_L3Best']


Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L2,-0.072728,0.401058,4.485044,0.197298,2.247972,2,True,3
1,LightGBM_BAG_L1,-0.072862,0.20376,2.237072,0.20376,2.237072,1,True,1


The resulting leaderboard will contain the most accurate model for a given inference-latency. You can select whichever model exhibits acceptable latency from the leaderboard and use it for prediction.

In [41]:
model_for_prediction = additional_ensembles[0]
predictions = predictor.predict(test_data, model=model_for_prediction)

# delete these extra models so they don't affect rest of tutorial
predictor.delete_models(models_to_delete=additional_ensembles, dry_run=False)  

Deleting model WeightedEnsemble_L3Best. All files under agModels-predict_freMTPL2freq_2/models/WeightedEnsemble_L3Best/ will be removed.


### Collapsing bagged ensembles via `refit_full`

For an ensemble predictor trained with bagging (as done above), recall there ~10 bagged copies of each individual model trained on different train/validation folds. We can collapse this bag of ~10 models into a single model that’s fit to the full dataset, which can greatly reduce its memory/latency requirements (but may also reduce accuracy). Below we refit such a model for each original model but you can alternatively do this for just a particular model by specifying the model argument of `refit_full()`.

In [42]:
refit_model_map = predictor.refit_full()
print("Name of each refit-full model corresponding to a previous bagged ensemble:")
print(refit_model_map)
predictor.leaderboard(test_data, silent=True)

Refitting models via `predictor.refit_full` using all of the data (combined train and validation)...
	Models trained in this way will have the suffix "_FULL" and have NaN validation score.
	This process is not bound by time_limit, but should take less time than the original `predictor.fit` call.
	To learn more, refer to the `.refit_full` method docstring which explains how "_FULL" models differ from normal models.
Fitting 1 L1 models ...
Fitting model: LightGBM_BAG_L1_FULL ...
	0.56s	 = Training   runtime
Fitting model: WeightedEnsemble_L2_FULL | Skipping fit via cloning parent ...
	0.01s	 = Training   runtime
Fitting 1 L2 models ...
Fitting model: LightGBM_BAG_L2_FULL ...
	0.61s	 = Training   runtime
Fitting model: WeightedEnsemble_L3_FULL | Skipping fit via cloning parent ...
	0.01s	 = Training   runtime
Updated best model to "WeightedEnsemble_L3_FULL" (Previously "WeightedEnsemble_L3"). AutoGluon will default to using "WeightedEnsemble_L3_FULL" for predict() and predict_proba().
Ref

Name of each refit-full model corresponding to a previous bagged ensemble:
{'LightGBM_BAG_L1': 'LightGBM_BAG_L1_FULL', 'WeightedEnsemble_L2': 'WeightedEnsemble_L2_FULL', 'LightGBM_BAG_L2': 'LightGBM_BAG_L2_FULL', 'WeightedEnsemble_L3': 'WeightedEnsemble_L3_FULL'}


Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L2_FULL,-0.07381,,0.081527,,1.16834,0.043343,,0.605196,2,True,7
1,WeightedEnsemble_L3_FULL,-0.07381,,0.086516,,1.179677,0.004989,,0.011337,3,True,8
2,LightGBM_BAG_L2,-0.073829,-0.072728,0.593302,0.401058,4.485044,0.303397,0.197298,2.247972,2,True,3
3,WeightedEnsemble_L3,-0.073829,-0.072728,0.598036,0.408151,4.496381,0.004734,0.007093,0.011337,3,True,4
4,LightGBM_BAG_L1_FULL,-0.073976,,0.038184,,0.563145,0.038184,,0.563145,1,True,5
5,WeightedEnsemble_L2_FULL,-0.073976,,0.042882,,0.57416,0.004698,,0.011015,2,True,6
6,LightGBM_BAG_L1,-0.073983,-0.072862,0.289905,0.20376,2.237072,0.289905,0.20376,2.237072,1,True,1
7,WeightedEnsemble_L2,-0.073983,-0.072862,0.294486,0.210497,2.248087,0.004581,0.006737,0.011015,2,True,2


This adds the refit-full models to the leaderboard and we can opt to use any of them for prediction just like any other model. Note `pred_time_test` and `pred_time_val` list the time taken to produce predictions with each model (in seconds) on the test/validation data. Since the refit-full models were trained using all of the data, there is no internal validation score (`score_val`) available for them. You can also call `refit_full()` with non-bagged models to refit the same models to your full dataset (there won’t be memory/latency gains in this case but test accuracy may improve).

### Model distillation

While computationally-favorable, single individual models will usually have lower accuracy than weighted/stacked/bagged ensembles. Model Distillation offers one way to retain the computational benefits of a single model, while enjoying some of the accuracy-boost that comes with ensembling. 

"Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation": https://arxiv.org/abs/2006.14284

The idea is to train the individual model (which we can call the student) to mimic the predictions of the full stack ensemble (the teacher). Like `refit_full()`, the `distill()` function will produce additional models we can opt to use for prediction.

In [43]:
student_models = predictor.distill(time_limit=3600)  # specify much longer time limit in real applications: 30 -> 3600
print(student_models)

preds_student = predictor.predict(test_data_nolabel, model=student_models[0])
print(f"predictions from {student_models[0]}:", list(preds_student)[:5])
predictor.leaderboard(test_data)

Distilling with teacher='WeightedEnsemble_L3_FULL', teacher_preds=soft, augment_method=spunge ...
SPUNGE: Augmenting training data with 100000 synthetic samples for distillation...
Distilling with each of these student models: ['LightGBM_DSTL', 'NeuralNetMXNet_DSTL', 'CatBoost_DSTL', 'RandomForest_DSTL', 'NeuralNetTorch_DSTL']
Fitting 5 L1 models ...
Fitting model: LightGBM_DSTL ... Training model for up to 3600.0s of the 3600.0s of remaining time.


[1000]	valid_set's l1: 0.0745904


	-0.0745	 = Validation score   (-mean_absolute_error)
	5.31s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: NeuralNetMXNet_DSTL ... Training model for up to 3594.47s of the 3594.47s of remaining time.
		Unable to import dependency mxnet. A quick tip is to install via `pip install mxnet --upgrade`, or `pip install mxnet_cu101 --upgrade`
Fitting model: CatBoost_DSTL ... Training model for up to 3594.37s of the 3594.37s of remaining time.
	-0.0747	 = Validation score   (-mean_absolute_error)
	345.8s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: RandomForest_DSTL ... Training model for up to 3248.31s of the 3248.31s of remaining time.
	-0.0804	 = Validation score   (-mean_absolute_error)
	17.91s	 = Training   runtime
	0.12s	 = Validation runtime
Fitting model: NeuralNetTorch_DSTL ... Training model for up to 3228.46s of the 3228.45s of remaining time.
	-0.0423	 = Validation score   (-mean_absolute_error)
	489.26s	 = Training   runtime
	0.03s	 = Valida

['LightGBM_DSTL', 'CatBoost_DSTL', 'RandomForest_DSTL', 'NeuralNetTorch_DSTL', 'WeightedEnsemble_L2_DSTL']
predictions from LightGBM_DSTL: [0.007465063594281673, 0.010630946606397629, 0.011915091425180435, 0.11816341429948807, 0.03702206164598465]
                       model  score_test  score_val  pred_time_test  pred_time_val    fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0        NeuralNetTorch_DSTL   -0.039985  -0.042334        0.813594       0.031952  489.264411                 0.813594                0.031952         489.264411            1       True         12
1   WeightedEnsemble_L2_DSTL   -0.039985  -0.042334        0.823072       0.032447  489.424435                 0.009478                0.000495           0.160024            2       True         13
2              CatBoost_DSTL   -0.072362  -0.074689        0.353899       0.022489  345.797047                 0.353899                0.022489         345.79

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,NeuralNetTorch_DSTL,-0.039985,-0.042334,0.813594,0.031952,489.264411,0.813594,0.031952,489.264411,1,True,12
1,WeightedEnsemble_L2_DSTL,-0.039985,-0.042334,0.823072,0.032447,489.424435,0.009478,0.000495,0.160024,2,True,13
2,CatBoost_DSTL,-0.072362,-0.074689,0.353899,0.022489,345.797047,0.353899,0.022489,345.797047,1,True,10
3,LightGBM_DSTL,-0.072499,-0.074506,0.390013,0.017576,5.31325,0.390013,0.017576,5.31325,1,True,9
4,LightGBM_BAG_L2_FULL,-0.07381,,0.089174,,1.16834,0.049518,,0.605196,2,True,7
5,WeightedEnsemble_L3_FULL,-0.07381,,0.095431,,1.179677,0.006257,,0.011337,3,True,8
6,LightGBM_BAG_L2,-0.073829,-0.072728,0.66067,0.401058,4.485044,0.34365,0.197298,2.247972,2,True,3
7,WeightedEnsemble_L3,-0.073829,-0.072728,0.667152,0.408151,4.496381,0.006482,0.007093,0.011337,3,True,4
8,LightGBM_BAG_L1_FULL,-0.073976,,0.039655,,0.563145,0.039655,,0.563145,1,True,5
9,WeightedEnsemble_L2_FULL,-0.073976,,0.04585,,0.57416,0.006194,,0.011015,2,True,6


### Faster presets or hyperparameters

Instead of trying to speed up a cumbersome trained model at prediction time, if you know inference latency or memory will be an issue at the outset, then you can adjust the training process accordingly to ensure `fit()` does not produce unwieldy models.

One option is to specify more lightweight presets:

In [44]:
presets = ['good_quality_faster_inference_only_refit', 'optimize_for_deployment']
predictor_light = TabularPredictor(
    label=label, 
    eval_metric=metric,
    problem_type="regression"
).fit(
    train_data, 
    presets=presets, 
    time_limit=3600
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230702_222743/"
Preset alias specified: 'good_quality_faster_inference_only_refit' maps to 'good_quality'.
Presets specified: ['good_quality_faster_inference_only_refit', 'optimize_for_deployment']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 3600s
AutoGluon will save models to "AutogluonModels/ag-20230702_222743/"
AutoGluon Version:  0.7.0
Python Version:     3.9.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu May 4 15:21:22 UTC 2023
Train Data Rows:    474765
Train Data Columns: 10
Label Column: ClaimNb
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    157989.69 MB
	Train Data (Original)  Memory Usage: 141.07 MB (0.1% of available memory)
	Inferring data type of each feature based on co

	10.36s	 = Training   runtime
	1.64s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 1570.59s of the 1570.52s of remaining time.
	-0.0713	 = Validation score   (-mean_absolute_error)
	58.81s	 = Training   runtime
	9.55s	 = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 1501.45s of the 1501.38s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
	-0.0712	 = Validation score   (-mean_absolute_error)
	162.65s	 = Training   runtime
	0.47s	 = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L2 ... Training model for up to 1331.89s of the 1331.82s of remaining time.
	-0.0718	 = Validation score   (-mean_absolute_error)
	10.12s	 = Training   runtime
	9.97s	 = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L2 ... Training model for up to 1310.87s of the 1310.8s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStra

Another option is to specify more lightweight hyperparameters:

In [45]:
# time_limit=30 -> 3600

predictor_light = TabularPredictor(
    label=label, 
    eval_metric=metric,
    problem_type="regression"
).fit(
    train_data, 
    hyperparameters='very_light', 
    time_limit=3600
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230702_232500/"
Beginning AutoGluon training ... Time limit = 3600s
AutoGluon will save models to "AutogluonModels/ag-20230702_232500/"
AutoGluon Version:  0.7.0
Python Version:     3.9.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu May 4 15:21:22 UTC 2023
Train Data Rows:    474765
Train Data Columns: 10
Label Column: ClaimNb
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    155670.03 MB
	Train Data (Original)  Memory Usage: 141.07 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatu

Here you can set hyperparameters to either 'light', 'very_light', or 'toy' to obtain progressively smaller (but less accurate) models and predictors. Advanced users may instead try manually specifying particular models’ hyperparameters in order to make them faster/smaller.

Finally, you may also exclude specific unwieldy models from being trained at all. Below we exclude models that tend to be slower (K Nearest Neighbors, Neural Network, models with custom larger-than-default hyperparameters):

In [46]:
# time_limit=30 -> 3600 
    
excluded_model_types = ['KNN', 'NN', 'custom']

predictor_light = TabularPredictor(
    label=label, 
    eval_metric=metric,
    problem_type="regression"
).fit(
    train_data, 
    excluded_model_types=excluded_model_types, 
    time_limit=3600
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230702_233814/"
Beginning AutoGluon training ... Time limit = 3600s
AutoGluon will save models to "AutogluonModels/ag-20230702_233814/"
AutoGluon Version:  0.7.0
Python Version:     3.9.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu May 4 15:21:22 UTC 2023
Train Data Rows:    474765
Train Data Columns: 10
Label Column: ClaimNb
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    146481.64 MB
	Train Data (Original)  Memory Usage: 141.07 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatu

## If you encounter memory issues

To reduce memory usage during training, you may try each of the following strategies individually or combinations of them (these may harm accuracy):

- In `fit()`, set `num_bag_sets=1` (can also try values greater than 1 to harm accuracy less)

- In `fit()`, set `excluded_model_types=['KNN', 'XT', 'RF']` (or some subset of these models)

- Try different `presets` in `fit()`

- In `fit()`, set `hyperparameters='light'` or `hyperparameters='very_light'`

- Text fields in your table require substantial memory for N-gram featurization. To mitigate this in `fit()`, you can either:
    + (1) add `'ignore_text'` to your `presets` list (to ignore text features), or 
    + (2) specify the argument:

```
from sklearn.feature_extraction.text import CountVectorizer
from autogluon.features.generators import AutoMLPipelineFeatureGenerator

feature_generator = AutoMLPipelineFeatureGenerator(
    vectorizer=CountVectorizer(
        min_df=30, 
        ngram_range=(1, 3), 
        max_features=MAX_NGRAM, 
        dtype=np.uint8
    )
)
```

where `MAX_NGRAM=1000` say (try various values under 10000 to reduce the number of N-gram features used to represent each text field).

In addition to reducing memory usage, many of the above strategies can also be used to reduce training times.

To reduce memory usage during inference:

- If trying to produce predictions for a large test dataset, break the test data into smaller chunks as demonstrated in FAQ.
    + https://auto.gluon.ai/0.1.0/tutorials/tabular_prediction/tabular-faq.html#sec-faq

- If models have been previously persisted in memory but inference-speed is not a major concern, call `predictor.unpersist_models()`

- If models have been previously persisted in memory, bagging was used in `fit()`, and inference-speed is a concern: call `predictor.refit_full()` and use one of the refit-full models for prediction (ensure this is the only model persisted in memory)

## If you encounter disk space issues

To reduce disk usage, you may try each of the following strategies individually or combinations of them:

- Make sure to delete all `predictor.path` folders from previous `fit()` runs
    + These can eat up your free space if you call `fit()` many times
    + If you didn’t specify path, AutoGluon still automatically saved its models to a folder called: `“AutogluonModels/ag-[TIMESTAMP]”`, where TIMESTAMP records when `fit()` was called, so make sure to also delete these folders if you run low on free space

- Call `predictor.save_space()` to delete auxiliary files produced during `fit()`

- Call `predictor.delete_models(models_to_keep='best', dry_run=False)` if you only intend to use this predictor for inference going forward (will delete files required for non-prediction-related functionality like `fit_summary`)

- In `fit()`, you can add `'optimize_for_deployment'` to the presets list, which will automatically invoke the previous two strategies after training

Most of the above strategies to reduce memory usage will also reduce disk usage (but may harm accuracy).

## References

The following paper describes how AutoGluon internally operates on tabular data:

Erickson et al. "AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data". Arxiv, 2020: https://arxiv.org/abs/2003.06505