https://auto.gluon.ai/0.1.0/tutorials/tabular_prediction/tabular-quickstart.html

# Predicting Columns in a Table - Quick Start: porto_seguro & freMTPL2freq

Via a simple `fit()` call, AutoGluon can produce highly-accurate models to predict the values in one column of a data table based on the rest of the columns’ values. Use AutoGluon with tabular data for both classification and regression problems. This tutorial demonstrates how to use AutoGluon to produce a classification model that predicts whether or not a person’s income exceeds $50,000.

To start, import AutoGluon’s `TabularPredictor` and `TabularDataset` classes:

(Installed in Terminal using)

```
> pip3 install --user autogluon
```

(AutoGluon summary plots cannot be created because bokeh is not installed. To see plots, please do: "pip install bokeh==2.0.1")

```
> pip3 install --user bokeh==2.0.1
```

In [1]:
from autogluon.tabular import TabularDataset, TabularPredictor

import pandas as pd
import numpy as np

Load training data from a CSV file into an AutoGluon Dataset object. This object is essentially equivalent to a Pandas DataFrame and the same methods can be applied to both.

## Regression (predicting numeric table columns):

To demonstrate that `fit()` can also automatically handle regression tasks, we now try to predict the numeric `age` variable in the same table based on the other features:

In [2]:
train_data = TabularDataset("freMTPL2freq_dataset_train.csv")
test_data = TabularDataset("freMTPL2freq_dataset_test.csv")

In [3]:
train_data.info()

<class 'autogluon.core.dataset.TabularDataset'>
RangeIndex: 474765 entries, 0 to 474764
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   IDpol       474765 non-null  int64  
 1   ClaimNb     474765 non-null  int64  
 2   Exposure    474765 non-null  float64
 3   VehPower    474765 non-null  int64  
 4   VehAge      474765 non-null  int64  
 5   DrivAge     474765 non-null  int64  
 6   BonusMalus  474765 non-null  int64  
 7   VehBrand    474765 non-null  object 
 8   VehGas      474765 non-null  object 
 9   Area        474765 non-null  object 
 10  Density     474765 non-null  int64  
 11  Region      474765 non-null  object 
dtypes: float64(1), int64(7), object(4)
memory usage: 43.5+ MB


In [4]:
test_data.info()

<class 'autogluon.core.dataset.TabularDataset'>
RangeIndex: 203226 entries, 0 to 203225
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   IDpol       203226 non-null  float64
 1   ClaimNb     203226 non-null  int64  
 2   Exposure    203226 non-null  float64
 3   VehPower    203226 non-null  int64  
 4   VehAge      203226 non-null  int64  
 5   DrivAge     203226 non-null  int64  
 6   BonusMalus  203226 non-null  int64  
 7   VehBrand    203226 non-null  object 
 8   VehGas      203226 non-null  object 
 9   Area        203226 non-null  object 
 10  Density     203226 non-null  int64  
 11  Region      203226 non-null  object 
dtypes: float64(2), int64(6), object(4)
memory usage: 18.6+ MB


In [5]:
target_column = 'ClaimNb'
print("Summary of target variable: \n", train_data[target_column].describe())

Summary of target variable: 
 count    474765.000000
mean          0.038583
std           0.205458
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          11.000000
Name: ClaimNb, dtype: float64


In [6]:
# IDpol: The policy ID, so drop it
train_data = train_data.drop(["IDpol"], axis=1)
test_data = test_data.drop(["IDpol"], axis=1)

We again call `fit()`, imposing a time-limit this time (in seconds), and also demonstrate a shorthand method to evaluate the resulting model on the test data (which contain labels):

In [7]:
# specified problem_type to eliminate infering multi-class problem_type, e.g.,
#   problem_type="regression"
# and increased time_limit to 1 hour
predictor_ClaimNb = TabularPredictor(
    label=target_column, 
    path="agModels-predict_ClaimNb_1", 
    problem_type="regression",
    eval_metric="mean_absolute_error",
).fit(train_data, time_limit=3600) 

Beginning AutoGluon training ... Time limit = 3600s
AutoGluon will save models to "agModels-predict_ClaimNb_1/"
AutoGluon Version:  0.7.0
Python Version:     3.9.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu May 4 15:21:22 UTC 2023
Train Data Rows:    474765
Train Data Columns: 10
Label Column: ClaimNb
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    222563.59 MB
	Train Data (Original)  Memory Usage: 141.07 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting Catego

In [8]:
predictor_ClaimNb.evaluate(test_data)

Evaluation: mean_absolute_error on test data: -0.039985119318079465
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "mean_absolute_error": -0.039985119318079465,
    "root_mean_squared_error": -0.21486121610324158,
    "mean_squared_error": -0.046165342185363875,
    "r2": -0.03587427964308243,
    "pearsonr": 0.010003043367767372,
    "median_absolute_error": -1.6990729027714646e-12
}


{'mean_absolute_error': -0.039985119318079465,
 'root_mean_squared_error': -0.21486121610324158,
 'mean_squared_error': -0.046165342185363875,
 'r2': -0.03587427964308243,
 'pearsonr': 0.010003043367767372,
 'median_absolute_error': -1.6990729027714646e-12}

In [9]:
predictor_ClaimNb.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,NeuralNetTorch,-0.039985,-0.042334,0.835173,0.042401,408.320539,0.835173,0.042401,408.320539,1,True,10
1,WeightedEnsemble_L2,-0.039985,-0.042334,0.843122,0.043046,408.731905,0.007949,0.000645,0.411366,2,True,12
2,NeuralNetFastAI,-0.067926,-0.070266,1.844203,0.054441,335.601196,1.844203,0.054441,335.601196,1,True,8
3,KNeighborsDist,-0.072297,-0.076406,8.450431,0.124074,1.004383,8.450431,0.124074,1.004383,1,True,2
4,LightGBM,-0.072425,-0.074476,0.217564,0.011429,2.683523,0.217564,0.011429,2.683523,1,True,4
5,LightGBMLarge,-0.072524,-0.074557,0.167122,0.010945,3.549958,0.167122,0.010945,3.549958,1,True,11
6,CatBoost,-0.072578,-0.074781,0.185585,0.012086,44.571321,0.185585,0.012086,44.571321,1,True,6
7,XGBoost,-0.07258,-0.074668,0.729292,0.025859,4.361375,0.729292,0.025859,4.361375,1,True,9
8,LightGBMXT,-0.072947,-0.075217,0.343346,0.022357,5.776668,0.343346,0.022357,5.776668,1,True,3
9,KNeighborsUnif,-0.073036,-0.076874,5.428477,0.168389,4.019855,5.428477,0.168389,4.019855,1,True,1


Note that we didn’t need to tell AutoGluon this is a regression problem, it automatically inferred this from the data and reported the appropriate performance metric (RMSE by default). To specify a particular evaluation metric other than the default, set the `eval_metric` argument of `fit()` and AutoGluon will tailor its models to optimize your metric, e.g.,

```
eval_metric='mean_absolute_error'
```` 

For evaluation metrics where higher values are worse (like RMSE), AutoGluon may sometimes flips their sign and print them as negative values during training (as it internally assumes higher values are better).

**Data Formats:** AutoGluon can currently operate on data tables already loaded into Python as pandas DataFrames, or those stored in files of CSV format or Parquet format. If your data live in multiple tables, you will first need to join them into a single table whose rows correspond to statistically independent observations (datapoints) and columns correspond to different features (aka. variables/covariates).

Refer to the TabularPredictor documentation to see all of the available methods/options

https://auto.gluon.ai/0.1.0/api/autogluon.predictor.html