# Configuring AutoML (Model Selection and Hyperparameter Optimization)

---

This notebook is part of https://github.com/risc-mi/catabra.

This short example demonstrates how to configure model selection and hyperparameter optimization when training prediction models in CaTabRa's main data analysis workflow (in particular function `analyze()`). The following topics are covered:
* [inspecting the default configuration](#Inspect-Default-Configuration),
* [changing the configuration](#Change-Configuration), and
* [grouped splitting](#Grouped-Splitting).

Familiarity with CaTabRa's main data analysis workflow is assumed. A step-by-step introduction can be found in [Workflow.ipynb](https://github.com/risc-mi/catabra/examples/Workflow.ipynb).

## Inspect Default Configuration

Let's start by having a look at CaTabRa's default configuration:

In [1]:
from catabra.core import config
config.DEFAULT_CONFIG

{'automl': 'auto-sklearn',
 'ensemble_size': 10,
 'ensemble_nbest': 10,
 'memory_limit': 3072,
 'time_limit': 1,
 'jobs': 1,
 'copy_analysis_data': False,
 'copy_evaluation_data': False,
 'static_plots': True,
 'interactive_plots': False,
 'bootstrapping_repetitions': 0,
 'explainer': 'shap',
 'binary_classification_metrics': ['roc_auc', 'accuracy', 'balanced_accuracy'],
 'multiclass_classification_metrics': ['accuracy', 'balanced_accuracy'],
 'multilabel_classification_metrics': ['f1_macro'],
 'regression_metrics': ['r2', 'mean_absolute_error', 'mean_squared_error'],
 'ood_class': 'autoencoder',
 'ood_source': 'internal',
 'ood_kwargs': {},
 'auto-sklearn_include': None,
 'auto-sklearn_exclude': None,
 'auto-sklearn_resampling_strategy': None,
 'auto-sklearn_resampling_strategy_arguments': None}

A detailed explanation of the individual config parameters can be found in [config.md](https://github.com/risc-mi/catabra/doc/config.md). The parameters that control model selection and hyperparameter optimization in general appear at the top of the list:
* `"automl"`: Selected AutoML backend. By default, CaTabRa uses [auto-sklearn](https://automl.github.io/auto-sklearn/master/index.html).
* `"ensemble_size"`: Size of the final ensemble, i.e., the number of individual models to include. Combining models to an ensemble typically improves overall performance and so is activated by default. It can be disabled by setting this parameter to `1`.
* `"ensemble_nbest"`: Number of individual models to consider for ensemble building.
* `"memory_limit"`: Memory limit for individual prediction models, in MB.
* `"time_limit"`: Time limit for overall model training, in minutes; negative means no time limit.
* `"jobs"`: Number of parallel jobs to use; negative means all available processors.

In addition, there are parameters specifically controlling the behavior of the auto-sklearn backend; they are described in detail [here](https://automl.github.io/auto-sklearn/master/api.html). Each of them is prefixed with `"auto-sklearn_"`:
* `"auto-sklearn_include"`: Components that are included in hyperparameter optimization, for each step of the modeling pipeline. Useful for restricting the search space to a clearly-defined subset, e.g., incvolving only one single model class.
* `"auto-sklearn_exclude"`: Components that are excluded from hyperparameter optimization, for each step of the modeling pipeline. If both `"auto-sklearn_include"` and `"auto-sklearn_exclude"` are given, precisely those components appearing in the former and not appearing in the latter are included.
* `"auto-sklearn_resampling_strategy"`: The resampling strategy to use for internal validation, i.e., for estimating how well a model generalizes to unseen data. Most frequently used values are strings like `"holdout"` and `"cv"`, but in principle any subclass of `sklearn.model_selection.BaseCrossValidator` can be provided.
* `"auto-sklearn_resampling_strategy_arguments"`:  Additional arguments for the resampling strategy, like the number of folds in *k*-fold cross validation (`"cv"`).

## Change Configuration

Changing the AutoML configuration is easy: simply update the config dict when calling `catabra.analysis.analyze()`, as demonstrated below. We focus on a binary classification problem here, but everything applies equally to other prediction tasks.

In [2]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

In [3]:
# add target labels to DataFrame
X['diagnosis'] = y

In [4]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)

Keyword argument `config` of function `analyze()` allows to update the default config dict (`catabra.core.config.DEFAULT_CONFIG`). The value passed to `config` can be either a dict, or the path to a JSON file containing such a dict. The latter is especially useful on the command line.

**NOTE**<br>
The time limit (`"time_limit"`) and number of parallel jobs (`"jobs"`) can also be passed to `analyze()` directly, as keyword arguments `time` and `jobs`, respectively. If they are specified in both ways, the keyword arguments take precedence.

We now analyze data and train a classifier. Deviating from CaTabRa's default setting, we set the time budget for AutoML to 3 minutes, use 2 parallel jobs, disable ensembling, restrict the model class to random forests, und employ 5-fold cross validation for internal validation.

In [5]:
from catabra.analysis import analyze

analyze(
    X,
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    time=3,                   # time budget for hyperparameter tuning, in minutes (optional)
    jobs=2,                   # number of parallel jobs to use for model training (optional)
    out='automl_example',
    config={
        'ensemble_size': 1,
        'auto-sklearn_include': {
            'classifier': ['random_forest']
        },
        'auto-sklearn_resampling_strategy': 'cv',
        'auto-sklearn_resampling_strategy_arguments': {
            'folds': 5
        },
    }
)

[CaTabRa] ### Analysis started at 2023-02-08 08:48:57.530544
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend auto-sklearn for binary_classification
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb


  self.metafeatures = self.metafeatures.append(metafeatures)
  self.algorithm_runs[metric].append(runs)




[CaTabRa] New model #1 trained:
    val_roc_auc: 0.990578
    val_accuracy: 0.945175
    val_balanced_accuracy: 0.941965
    train_roc_auc: 1.000000
    type: random_forest
    total_elapsed_time: 00:15
[CaTabRa] New model #2 trained:
    val_roc_auc: 0.981656
    val_accuracy: 0.940789
    val_balanced_accuracy: 0.938213
    train_roc_auc: 0.999876
    type: random_forest
    total_elapsed_time: 00:16
[CaTabRa] New model #3 trained:
    val_roc_auc: 0.983148
    val_accuracy: 0.942982
    val_balanced_accuracy: 0.941788
    train_roc_auc: 0.998510
    type: random_forest
    total_elapsed_time: 00:22
[CaTabRa] New model #4 trained:
    val_roc_auc: 0.991122
    val_accuracy: 0.953947
    val_balanced_accuracy: 0.952317
    train_roc_auc: 1.000000
    type: random_forest
    total_elapsed_time: 00:26
[CaTabRa] New model #5 trained:
    val_roc_auc: 0.976644
    val_accuracy: 0.940789
    val_balanced_accuracy: 0.934482
    train_roc_auc: 1.000000
    type: random_forest
    total_elaps

Iteration 31, loss = 0.00610047
Iteration 32, loss = 0.00592693
Iteration 33, loss = 0.00584019
Iteration 34, loss = 0.00580886
Iteration 35, loss = 0.00565513
Iteration 36, loss = 0.00553230
Iteration 37, loss = 0.00552978
Iteration 38, loss = 0.00542836
Iteration 39, loss = 0.00535562
Iteration 40, loss = 0.00531400
Iteration 41, loss = 0.00528915
Iteration 42, loss = 0.00525538
Iteration 43, loss = 0.00526631
Iteration 44, loss = 0.00523167
Iteration 45, loss = 0.00521995
Iteration 46, loss = 0.00520238
Iteration 47, loss = 0.00520256
Iteration 48, loss = 0.00518122
Iteration 49, loss = 0.00516981
Iteration 50, loss = 0.00518312
Iteration 51, loss = 0.00517175
Iteration 52, loss = 0.00515201
Iteration 53, loss = 0.00514509
Iteration 54, loss = 0.00513804
Iteration 55, loss = 0.00514761
Iteration 56, loss = 0.00523686
Iteration 57, loss = 0.00522094
Iteration 58, loss = 0.00517241
Iteration 59, loss = 0.00517111
Iteration 60, loss = 0.00515946
Iteration 61, loss = 0.00515752
Iteratio

## Grouped Splitting

CaTabRa natively supports grouped splitting/resampling. That means, all samples are assigned to groups, and when splitting/resampling the data for internal validation all samples belonging to the same group are ensured to be put into either the training- or the validation set. Refer to the [scikit-learn user guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data) for details.

To activate grouped splitting, all one needs to to is add a column with the corresponding grouping information to the data table and inform CaTabRa about it. There is no need to adjust the resampling strategy; this is taken care of automatically if the resampling strategy is given as a string, like `"holdout"` or `"cv"`.

In [8]:
import numpy as np
X['group'] = np.random.randint(50, size=len(X))

In [9]:
analyze(
    X,
    classify='diagnosis',
    group='group',              # name of the column to use for grouping
    split='train',
    time=1,
    jobs=1,
    out='automl_grouping_example',
    config={
        'ensemble_size': 1,
        'auto-sklearn_include': {
            'classifier': ['random_forest']
        },
        'auto-sklearn_resampling_strategy': 'cv',
        'auto-sklearn_resampling_strategy_arguments': {
            'folds': 5
        },
    }
)

[CaTabRa] ### Analysis started at 2023-02-08 09:18:22.258738




[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend auto-sklearn for binary_classification
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.989207
    n_constituent_models: 1
    total_elapsed_time: 00:06
[CaTabRa] New model #1 trained:
    val_roc_auc: 0.990527
    val_accuracy: 0.956140
    val_balanced_accuracy: 0.954456
    train_roc_auc: 1.000000
    type: random_forest
    total_elapsed_time: 00:06
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.989207
    n_constituent_models: 1
    total_elapsed_time: 00:11
[CaTabRa] New model #2 trained:
    val_roc_auc: 0.984413
    val_accuracy: 0.951754
    val_balanced_accuracy: 0.951015
    train_roc_auc: 0.999938
    type: random_forest
    total_elapsed_time: 00:11
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.989207
    n_constituent_models: 1
    total_elapsed_time: 00:24
[CaTabRa] New model #3 trained:
    val_roc_auc: 0.983311
    val_accuracy: 0.945175
    val_balanced_accuracy: 0.945531
    train_roc_auc: 0.998575
    type: random_forest
    total_elapsed_time: 00:24
[CaTabRa] New ensem

    total_elapsed_time: 00:29
[CaTabRa] New model #4 trained:
    val_roc_auc: 0.985170
    val_accuracy: 0.934211
    val_balanced_accuracy: 0.930524
    train_roc_auc: 1.000000
    type: random_forest
    total_elapsed_time: 00:29
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.989207
    n_constituent_models: 1
    total_elapsed_time: 00:35
[CaTabRa] New model #5 trained:
    val_roc_auc: 0.987843
    val_accuracy: 0.958333
    val_balanced_accuracy: 0.956501
    train_roc_auc: 0.999981
    type: random_forest
    total_elapsed_time: 00:34
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.989765
    n_constituent_models: 1
    total_elapsed_time: 00:40
[CaTabRa] New model #6 trained:
    val_roc_auc: 0.991570
    val_accuracy: 0.960526
    val_balanced_accuracy: 0.959362
    train_roc_auc: 1.000000
    type: random_forest
    total_elapsed_time: 00:40
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.989765
    n_constituent_models: 1
    total_elapsed_t

In the output above, note the warning that some groups in `"not_train"` overlap with the training set. This warning is shown because we randomly assigned samples to groups, ignoring the existing train-test split. In a real use-case the train-test split should respect the given grouping.

**NOTE**<br>
If no grouping is specified when calling `analyze()`, samples are implicitly grouped by the row index of the data table. Hence, if you do not want to group samples, ensure a unique row index.