https://auto.gluon.ai/0.1.0/tutorials/tabular_prediction/tabular-quickstart.html

# Predicting Columns in a Table - Quick Start: porto_seguro & freMTPL2freq

Via a simple `fit()` call, AutoGluon can produce highly-accurate models to predict the values in one column of a data table based on the rest of the columns’ values. Use AutoGluon with tabular data for both classification and regression problems. This tutorial demonstrates how to use AutoGluon to produce a classification model that predicts whether or not a person’s income exceeds $50,000.

To start, import AutoGluon’s `TabularPredictor` and `TabularDataset` classes:

(Installed in Terminal using)

```
> pip3 install --user autogluon
```

(AutoGluon summary plots cannot be created because bokeh is not installed. To see plots, please do: "pip install bokeh==2.0.1")

```
> pip3 install --user bokeh==2.0.1
```

In [1]:
from autogluon.tabular import TabularDataset, TabularPredictor

import pandas as pd
import numpy as np

Load training data from a CSV file into an AutoGluon Dataset object. This object is essentially equivalent to a Pandas DataFrame and the same methods can be applied to both.

In [2]:
# train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
train_data = TabularDataset("porto_train.csv")
test_data = TabularDataset("porto_test.csv")
# subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
# train_data = train_data.sample(n=subsample_size, random_state=0)

In [3]:
# train_data.shape

In [4]:
# test_data.shape

In [5]:
train_data.info()

<class 'autogluon.core.dataset.TabularDataset'>
RangeIndex: 476170 entries, 0 to 476169
Data columns (total 60 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              476170 non-null  int64  
 1   target          476170 non-null  int64  
 2   ps_ind_01       476170 non-null  int64  
 3   ps_ind_02_cat   476170 non-null  int64  
 4   ps_ind_03       476170 non-null  int64  
 5   ps_ind_04_cat   476170 non-null  int64  
 6   ps_ind_05_cat   476170 non-null  int64  
 7   ps_ind_06_bin   476170 non-null  int64  
 8   ps_ind_07_bin   476170 non-null  int64  
 9   ps_ind_08_bin   476170 non-null  int64  
 10  ps_ind_09_bin   476170 non-null  int64  
 11  ps_ind_10_bin   476170 non-null  int64  
 12  ps_ind_11_bin   476170 non-null  int64  
 13  ps_ind_12_bin   476170 non-null  int64  
 14  ps_ind_13_bin   476170 non-null  int64  
 15  ps_ind_14       476170 non-null  int64  
 16  ps_ind_15       476170 non-null  int64  
 17  

In [6]:
test_data.info()

<class 'autogluon.core.dataset.TabularDataset'>
RangeIndex: 119042 entries, 0 to 119041
Data columns (total 59 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              119042 non-null  int64  
 1   target          119042 non-null  int64  
 2   ps_ind_01       119042 non-null  int64  
 3   ps_ind_02_cat   119042 non-null  int64  
 4   ps_ind_03       119042 non-null  int64  
 5   ps_ind_04_cat   119042 non-null  int64  
 6   ps_ind_05_cat   119042 non-null  int64  
 7   ps_ind_06_bin   119042 non-null  int64  
 8   ps_ind_07_bin   119042 non-null  int64  
 9   ps_ind_08_bin   119042 non-null  int64  
 10  ps_ind_09_bin   119042 non-null  int64  
 11  ps_ind_10_bin   119042 non-null  int64  
 12  ps_ind_11_bin   119042 non-null  int64  
 13  ps_ind_12_bin   119042 non-null  int64  
 14  ps_ind_13_bin   119042 non-null  int64  
 15  ps_ind_14       119042 non-null  int64  
 16  ps_ind_15       119042 non-null  int64  
 17  

In [7]:
train_data.head().T

Unnamed: 0,0,1,2,3,4
id,9.0,13.0,16.0,17.0,20.0
target,0.0,0.0,0.0,0.0,0.0
ps_ind_01,1.0,5.0,0.0,0.0,2.0
ps_ind_02_cat,1.0,4.0,1.0,2.0,1.0
ps_ind_03,7.0,9.0,2.0,0.0,3.0
ps_ind_04_cat,0.0,1.0,0.0,1.0,1.0
ps_ind_05_cat,0.0,0.0,0.0,0.0,0.0
ps_ind_06_bin,0.0,0.0,1.0,1.0,0.0
ps_ind_07_bin,0.0,0.0,0.0,0.0,1.0
ps_ind_08_bin,1.0,1.0,0.0,0.0,0.0


## Data Cleaning

In [8]:
train_data = train_data.replace(-1, np.nan)
test_data = test_data.replace(-1, np.nan)

In [9]:
# test_data.columns

In [10]:
train_data = train_data.drop(["id", "fold"], axis=1)
test_data = test_data.drop(["id"], axis=1)

In [11]:
train_data.shape

(476170, 58)

In [12]:
test_data.shape

(119042, 58)

In [13]:
cat_vars = [col for col in train_data.columns if 'cat' in col]
cat_vars

['ps_ind_02_cat',
 'ps_ind_04_cat',
 'ps_ind_05_cat',
 'ps_car_01_cat',
 'ps_car_02_cat',
 'ps_car_03_cat',
 'ps_car_04_cat',
 'ps_car_05_cat',
 'ps_car_06_cat',
 'ps_car_07_cat',
 'ps_car_08_cat',
 'ps_car_09_cat',
 'ps_car_10_cat',
 'ps_car_11_cat']

In [14]:
for col in cat_vars:
    test_data[col] = test_data[col].astype('category')
    
cat_vars = cat_vars + ["target"]

for col in cat_vars:
    train_data[col] = train_data[col].astype('category')

Note that we loaded data from a CSV file stored in the cloud (AWS s3 bucket), but you can you specify a local file-path instead if you have already downloaded the CSV file to your own machine (e.g., using `wget`). Each row in the table `train_data` corresponds to a single training example. In this particular dataset, each row corresponds to an individual person, and the columns contain various characteristics reported during a census.

Let’s first use these features to predict whether the person’s income exceeds $50,000 or not, which is recorded in the `class` column of this table.

In [15]:
label = "target"
print("Summary of class variable: \n", train_data[label].describe())

Summary of class variable: 
 count     476170
unique         2
top            0
freq      458811
Name: target, dtype: int64


Now use AutoGluon to train multiple models:

In [16]:
save_path = 'agModels-predictClass_porto_1'  # specifies folder to store trained models

predictor = TabularPredictor(label=label, path=save_path, eval_metric="roc_auc").fit(train_data)

	Consider setting `time_limit` to ensure training finishes within an expected duration or experiment with a small portion of `train_data` to identify an ideal `presets` and `hyperparameters` configuration.
Beginning AutoGluon training ...
AutoGluon will save models to "agModels-predictClass_porto_1/"
AutoGluon Version:  0.7.0
Python Version:     3.9.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu May 4 15:21:22 UTC 2023
Train Data Rows:    476170
Train Data Columns: 57
Label Column: target
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data

Next, load separate test data to demonstrate how to make predictions on new examples at inference time:

In [17]:
# test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')

y_test = test_data[label]  # values to predict

test_data_nolab = test_data.drop(columns=[label])  # delete label column to prove we're not cheating
test_data_nolab.head()

Unnamed: 0,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,2,2.0,5,1.0,0.0,0,1,0,0,0,...,9,1,5,8,0,1,1,0,0,1
1,5,1.0,4,0.0,0.0,0,0,0,1,0,...,4,2,0,9,0,1,0,1,1,1
2,5,1.0,11,0.0,0.0,0,0,0,1,0,...,4,1,3,9,0,0,0,0,1,0
3,5,1.0,8,0.0,0.0,1,0,0,0,0,...,3,1,6,5,0,0,0,1,0,0
4,0,1.0,2,0.0,0.0,1,0,0,0,0,...,7,2,2,4,0,1,0,0,1,0


We use our trained models to make predictions on the new data and then evaluate performance:

In [19]:
predictor = TabularPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file

# y_pred = predictor.predict(test_data_nolab)
y_pred = predictor.predict_proba(test_data_nolab)
print("Predictions:  \n", y_pred)

perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Predictions:  
                0         1
0       0.951413  0.048587
1       0.953420  0.046580
2       0.972430  0.027570
3       0.982235  0.017765
4       0.978142  0.021858
...          ...       ...
119037  0.962012  0.037988
119038  0.979733  0.020267
119039  0.966375  0.033625
119040  0.971194  0.028806
119041  0.975856  0.024144

[119042 rows x 2 columns]


  _warn_prf(average, modifier, msg_start, len(result))
Evaluation: roc_auc on test data: 0.6314907429408757
Evaluations on test data:
{
    "roc_auc": 0.6314907429408757,
    "accuracy": 0.9635842811780716,
    "balanced_accuracy": 0.5,
    "mcc": 0.0,
    "f1": 0.0,
    "precision": 0.0,
    "recall": 0.0
}


Now you’re ready to try AutoGluon on your own tabular datasets. As long as they’re stored in a popular format like CSV, you should be able to achieve strong predictive performance with just 2 lines of code:

```
from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(label=<variable-name>).fit(train_data=<file-name>)
```

Note: This simple call to `fit()` is intended for your first prototype model. In a subsequent section, we’ll demonstrate how to maximize predictive performance by additionally specifying two `fit()` arguments: `presets` and `eval_metric`.

## Description of `fit()`:

Here we discuss what happened during `fit()`.

Since there are only two possible values of the `class` variable, this was a binary classification problem, for which an appropriate performance metric is accuracy. AutoGluon automatically infers this as well as the type of each feature (i.e., which columns contain continuous numbers vs. discrete categories). AutogGluon can also automatically handle common issues like missing data and rescaling feature values.

We did not specify separate validation data and so AutoGluon automatically choses a random training/validation split of the data. The data used for validation is seperated from the training data and is used to determine the models and hyperparameter-values that produce the best results. Rather than just a single model, AutoGluon trains multiple models and ensembles them together to ensure superior predictive performance.

By default, AutoGluon tries to fit various types of models including neural networks and tree ensembles. Each type of model has various hyperparameters, which traditionally, the user would have to specify. AutoGluon automates this process.

AutoGluon automatically and iteratively tests values for hyperparameters to produce the best performance on the validation data. This involves repeatedly training models under different hyperparameter settings and evaluating their performance. This process can be computationally-intensive, so `fit()` can parallelize this process across multiple threads (and machines if distributed resources are available). To control runtimes, you can specify various arguments in `fit()` as demonstrated in the subsequent In-Depth tutorial.

For tabular problems, `fit()` returns a Predictor object. For classification, you can easily output predicted class probabilities instead of predicted classes:

In [20]:
pred_probs = predictor.predict_proba(test_data_nolab)
pred_probs.head(5)

Unnamed: 0,0,1
0,0.951413,0.048587
1,0.95342,0.04658
2,0.97243,0.02757
3,0.982235,0.017765
4,0.978142,0.021858


Besides inference, this object can also summarize what happened during fit.

In [21]:
results = predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.611518       0.311911  349.463983                0.001177           1.372172            2       True         14
1              CatBoost   0.602004       0.026258   47.678899                0.026258          47.678899            1       True          7
2        ExtraTreesEntr   0.601840       0.170448    9.796823                0.170448           9.796823            1       True          9
3               XGBoost   0.601470       0.042089    7.551451                0.042089           7.551451            1       True         11
4        NeuralNetTorch   0.594228       0.071939  283.064638                0.071939         283.064638            1       True         12
5            LightGBMXT   0.591881       0.012810    3.881337                0.012810           3.

From this summary, we can see that AutoGluon trained many different types of models as well as an ensemble of the best-performing models. The summary also describes the actual models that were trained during fit and how well each model performed on the held-out validation data. We can view what properties AutoGluon automatically inferred about our prediction task:

In [22]:
print("AutoGluon infers problem type is: ", predictor.problem_type)
print("AutoGluon identified the following types of features:")
print(predictor.feature_metadata)

AutoGluon infers problem type is:  binary
AutoGluon identified the following types of features:
('category', [])  : 13 | ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', ...]
('float', [])     : 11 | ['ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11', 'ps_car_12', ...]
('int', [])       : 15 | ['ps_ind_01', 'ps_ind_03', 'ps_ind_14', 'ps_ind_15', 'ps_calc_04', ...]
('int', ['bool']) : 18 | ['ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', ...]


AutoGluon correctly recognized our prediction problem to be a binary classification task and decided that variables such as `age` should be represented as integers, whereas variables such as `workclass` should be represented as categorical objects. The `feature_metadata` attribute allows you to see the inferred data type of each predictive variable after preprocessing (this is it’s raw dtype; some features may also be associated with additional special dtypes if produced via feature-engineering, e.g. numerical representations of a datetime/text column).

We can evaluate the performance of each individual trained model on our (labeled) test data:

In [23]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.635058,0.602004,0.274688,0.026258,47.678899,0.274688,0.026258,47.678899,1,True,7
1,XGBoost,0.634824,0.60147,0.951324,0.042089,7.551451,0.951324,0.042089,7.551451,1,True,11
2,WeightedEnsemble_L2,0.631491,0.611518,3.624071,0.311911,349.463983,0.008312,0.001177,1.372172,2,True,14
3,NeuralNetFastAI,0.625603,0.581631,1.454859,0.071112,468.257171,1.454859,0.071112,468.257171,1,True,10
4,LightGBMXT,0.621983,0.591881,0.122447,0.01281,3.881337,0.122447,0.01281,3.881337,1,True,3
5,LightGBMLarge,0.621455,0.587142,0.119742,0.01434,3.021576,0.119742,0.01434,3.021576,1,True,13
6,NeuralNetTorch,0.62074,0.594228,1.273045,0.071939,283.064638,1.273045,0.071939,283.064638,1,True,12
7,LightGBM,0.617411,0.581836,0.105527,0.011281,1.789302,0.105527,0.011281,1.789302,1,True,4
8,ExtraTreesEntr,0.612614,0.60184,1.116702,0.170448,9.796823,1.116702,0.170448,9.796823,1,True,9
9,ExtraTreesGini,0.609756,0.585093,1.302161,0.177928,9.282799,1.302161,0.177928,9.282799,1,True,8


When we call `predict()`, AutoGluon automatically predicts with the model that displayed the best performance on validation data (i.e. the weighted-ensemble). We can instead specify which model to use for predictions like this:

In [24]:
predictor.predict(test_data, model='LightGBM')

0         0
1         0
2         0
3         0
4         0
         ..
119037    0
119038    0
119039    0
119040    0
119041    0
Name: target, Length: 119042, dtype: object

Above the scores of predictive performance were based on a default evaluation metric (accuracy for binary classification). Performance in certain applications may be measured by different metrics than the ones AutoGluon optimizes for by default. If you know the metric that counts in your application, you should specify it as demonstrated in the next section.

## Maximizing predictive performance

Note: You should not call `fit()` with entirely default arguments if you are benchmarking AutoGluon-Tabular or hoping to maximize its accuracy. To get the best predictive accuracy with AutoGluon, you should generally use it like this:

In [None]:
time_limit = 60*60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
metric = "roc_auc"  # specify your evaluation metric here
predictor = TabularPredictor(label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets='best_quality')
predictor.leaderboard(test_data, silent=True)

No path specified. Models will be saved in: "AutogluonModels/ag-20230703_223843/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 3600s
AutoGluon will save models to "AutogluonModels/ag-20230703_223843/"
AutoGluon Version:  0.7.0
Python Version:     3.9.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu May 4 15:21:22 UTC 2023
Train Data Rows:    476170
Train Data Columns: 57
Label Column: target
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Gener

This command implements the following strategy to maximize accuracy:

- Specify the argument `presets='best_quality'`, which allows AutoGluon to automatically construct powerful model ensembles based on stacking/bagging, and will greatly improve the resulting predictions if granted sufficient training time
    + https://arxiv.org/abs/2003.06505
    + The default value of `presets` is `'medium_quality_faster_train'`, which produces less accurate models but facilitates faster prototyping
    + With `presets`, you can flexibly prioritize predictive accuracy vs. training/inference speed
    + For example, if you care less about predictive performance and want to quickly deploy a basic model, consider using: `presets=['good_quality_faster_inference_only_refit', 'optimize_for_deployment']`.

- Provide the `eval_metric` if you know what metric will be used to evaluate predictions in your application
    + Some other non-default metrics you might use include things like: `'f1'` (for binary classification), `'roc_auc'` (for binary classification), `'log_loss'` (for classification), `'mean_absolute_error'` (for regression), `'median_absolute_error'` (for regression)
    + You can also define your own custom metric function, see examples in the folder: `autogluon/core/metrics/`

- Include all your data in `train_data` and do not provide `tuning_data`
    + AutoGluon will split the data more intelligently to fit its needs

- Do not specify the `hyperparameter_tune_kwargs` argument (counterintuitively, hyperparameter tuning is not the best way to spend a limited training time budgets, as model ensembling is often superior)
    + We recommend you only use `hyperparameter_tune_kwargs` if your goal is to deploy a single model rather than an ensemble

- Do not specify `hyperparameters` argument (allow AutoGluon to adaptively select which models/hyperparameters to use)

- Set `time_limit` to the longest amount of time (in seconds) that you are willing to wait
    + AutoGluon’s predictive performance improves the longer `fit()` is allowed to run

## Regression (predicting numeric table columns):

To demonstrate that `fit()` can also automatically handle regression tasks, we now try to predict the numeric `age` variable in the same table based on the other features:

In [None]:
train_data = TabularDataset("freMTPL2freq_dataset_train.csv")
test_data = TabularDataset("freMTPL2freq_dataset_test.csv")

In [None]:
train_data.info()

In [None]:
test_data.info()

In [None]:
target_column = 'ClaimNb'
print("Summary of target variable: \n", train_data[target_column].describe())

In [None]:
# IDpol: The policy ID, so drop it
train_data = train_data.drop(["IDpol"], axis=1)
test_data = test_data.drop(["IDpol"], axis=1)

We again call `fit()`, imposing a time-limit this time (in seconds), and also demonstrate a shorthand method to evaluate the resulting model on the test data (which contain labels):

In [None]:
# specified problem_type to eliminate infering multi-class problem_type, e.g.,
#   problem_type="regression"
# and increased time_limit to 1 hour
predictor_ClaimNb = TabularPredictor(
    label=target_column, 
    path="agModels-predict_ClaimNb_1", 
    problem_type="regression",
    eval_metric="mean_absolute_error",
).fit(train_data, time_limit=3600) 

In [None]:
predictor_ClaimNb.evaluate(test_data)

In [None]:
predictor_ClaimNb.leaderboard(test_data, silent=True)

Note that we didn’t need to tell AutoGluon this is a regression problem, it automatically inferred this from the data and reported the appropriate performance metric (RMSE by default). To specify a particular evaluation metric other than the default, set the `eval_metric` argument of `fit()` and AutoGluon will tailor its models to optimize your metric, e.g.,

```
eval_metric='mean_absolute_error'
```` 

For evaluation metrics where higher values are worse (like RMSE), AutoGluon may sometimes flips their sign and print them as negative values during training (as it internally assumes higher values are better).

**Data Formats:** AutoGluon can currently operate on data tables already loaded into Python as pandas DataFrames, or those stored in files of CSV format or Parquet format. If your data live in multiple tables, you will first need to join them into a single table whose rows correspond to statistically independent observations (datapoints) and columns correspond to different features (aka. variables/covariates).

Refer to the TabularPredictor documentation to see all of the available methods/options

https://auto.gluon.ai/0.1.0/api/autogluon.predictor.html