# CaTabRa Workflow

---

This notebook is part of https://github.com/risc-mi/catabra.

This tutorial demonstrates CaTabRa's main workflow, in particular how it can be used to
* [analyze data with a binary target](#Step-1:-Analyze-Data-and-Train-Classifier),
* [calibrate the classifier on dedicated calibration data](#Step-2:-Calibrate-Classifier),
* [evaluate the classifier on held-out test data](#Step-3:-Evaluate-Classifier),
* [explain the classifier by computing SHAP feature importance scores](#Step-4:-Explain-Classifier),
* [apply the classifier to new data](#Step-5:-Apply-Classifier-to-New-Data), and
* [load the classifier into Python](#Load-Classifier-into-Python).

## Prerequisites

In [1]:
# generic package imports
from catabra.util import io

In [2]:
# output directory (where all generated artifacts, like statistics, models, etc. are saved)
output_dir = 'workflow'

## Step 0: Prepare Data

We are going to work with the [breast cancer](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) dataset, a well-known binary classification dataset.

CaTabRa assumes a table in the usual $samples \times attributes$ format as input, where the attributes encompass features, target labels, and possibly additional information like a predefined train-test split.

In [3]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

In [4]:
# add target labels to DataFrame
X['diagnosis'] = y

In [5]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)

In [6]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnosis,train
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0,True
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0,True
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0,True
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0,True
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0,True


**NOTE**<br>
The column specifying the train-test split may contain more than two values. For instance, values `"train"`, `"val"` and `"test"` would yield a three-way split with one training set and two test sets. Only make sure that the column name and -values clearly indicate what the *training* set is meant to be; the names of the remaining sets are arbitrary. Prediction models are evaluated on each set (including the training set) separately.

## Step 1: Analyze Data and Train Classifier

Analyze the prepared data `X`. Only one simple function call is required to produce descriptive statistics, a high-quality classifier with automatically tuned hyperparameters, and an Out-of-Distribution detector.

The corresponding command in CaTabRa's command-line interface is called `catabra analyze ...`.

In [12]:
from catabra.analysis import analyze

analyze(
    X,                        # table to analyze; can also be the path to a CSV/Excel/HDF5 file
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    time=3,                   # time budget for hyperparameter tuning, in minutes (optional)
    out=output_dir
)

[CaTabRa] ### Analysis started at 2023-02-02 11:02:08.734257
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend auto-sklearn for binary_classification
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb


  self.metafeatures = self.metafeatures.append(metafeatures)
  self.algorithm_runs[metric].append(runs)


[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.980337
    n_constituent_models: 1
    total_elapsed_time: 00:02
[CaTabRa] New model #1 trained:
    val_roc_auc: 0.980337
    val_accuracy: 0.927152
    val_balanced_accuracy: 0.928416
    train_roc_auc: 1.000000
    type: random_forest
    total_elapsed_time: 00:02
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.994744
    n_constituent_models: 1
    total_elapsed_time: 00:03
[CaTabRa] New model #2 trained:
    val_roc_auc: 0.994744
    val_accuracy: 0.947020
    val_balanced_accuracy: 0.947717
    train_roc_auc: 0.996970
    type: passive_aggressive
    total_elapsed_time: 00:03
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.994744
    n_constituent_models: 1
    total_elapsed_time: 00:04
[CaTabRa] New model #3 trained:
    val_roc_auc: 0.970098
    val_accuracy: 0.920530
    val_balanced_accuracy: 0.915458
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:04
[CaTabRa] 

[CaTabRa] New model #28 trained:
    val_roc_auc: 0.965024
    val_accuracy: 0.894040
    val_balanced_accuracy: 0.897880
    train_roc_auc: 0.988416
    type: random_forest
    total_elapsed_time: 00:51
[CaTabRa] New model #29 trained:
    val_roc_auc: 0.860729
    val_accuracy: 0.860927
    val_balanced_accuracy: 0.845324
    train_roc_auc: 0.987970
    type: passive_aggressive
    total_elapsed_time: 00:54
[CaTabRa] New model #30 trained:
    val_roc_auc: 0.979340
    val_accuracy: 0.927152
    val_balanced_accuracy: 0.930863
    train_roc_auc: 0.999421
    type: random_forest
    total_elapsed_time: 00:56
[CaTabRa] New model #31 trained:
    val_roc_auc: 0.838347
    val_accuracy: 0.821192
    val_balanced_accuracy: 0.831189
    train_roc_auc: 1.000000
    type: libsvm_svc
    total_elapsed_time: 00:57
[CaTabRa] New model #32 trained:
    val_roc_auc: 0.969011
    val_accuracy: 0.900662
    val_balanced_accuracy: 0.893711
    train_roc_auc: 0.973556
    type: sgd
    total_elapsed_

[CaTabRa] New model #63 trained:
    val_roc_auc: 0.994744
    val_accuracy: 0.940397
    val_balanced_accuracy: 0.942099
    train_roc_auc: 0.996302
    type: lda
    total_elapsed_time: 02:13
[CaTabRa] New model #64 trained:
    val_roc_auc: 0.966564
    val_accuracy: 0.907285
    val_balanced_accuracy: 0.909116
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 02:14
[CaTabRa] New model #65 trained:
    val_roc_auc: 0.919627
    val_accuracy: 0.867550
    val_balanced_accuracy: 0.858282
    train_roc_auc: 0.968678
    type: gaussian_nb
    total_elapsed_time: 02:18
[CaTabRa] New model #66 trained:
    val_roc_auc: 0.989670
    val_accuracy: 0.947020
    val_balanced_accuracy: 0.942823
    train_roc_auc: 0.998752
    type: liblinear_svc
    total_elapsed_time: 02:22
[CaTabRa] New model #67 trained:
    val_roc_auc: 0.471910
    val_accuracy: 0.589404
    val_balanced_accuracy: 0.500000
    train_roc_auc: 0.494475
    type: bernoulli_nb
    total_elapsed_t

By specifying a train-test split, CaTabRa not only trains a classifier (on the training set) but also evaluates it (on both sets). The last few lines of the above logging output inform about the performance of the classifier on "train" and "not_train". More detailed results are available as well, as we will see in [Step 3](#Step-3:-Evaluate-Classifier).

The newly created directory specified by `output_dir` contains all results generated during data analysis, including
* a copy of the used configuration: `config.json`,
* the arguments passed to function `analyze()`: `invocation.json`,
* [descriptive statistics of the analyzed data](#Descriptive-Statistics): `statistics/`,
* the trained prediction model: `model.joblib`,
* [information about the constituents of the prediction model and their hyperparameters](#Model-Summary): `model_summary.json`,
* [the training history](#Training-History): `training_history.xlsx` and `training_history.pdf`,
* the OOD-detector: `ood.joblib`, and
* evaluation results (because we specified a train-test split): `eval/`.

### Descriptive Statistics

Descriptive statistics are calculated for numeric and non-numeric (categorical) features separately and saved in `statistics/statistics_numeric.xlsx` and `statistics/statistics/non_numeric.xlsx`. It is easiest to simply view these files in Excel, but they can of course be loaded as pandas DataFrames, too.

CaTabRa provides a convience function for loading tables in arbitrary format, implemented in module `catabra.util.io`: `read_df()` for loading a single table and `read_dfs()` for loading all tables stored in a file. In classification tasks, descriptive statistics are computed both for the entire dataset and for each class individually and written to two different tables, so we use `read_dfs()` to load both of them:

In [3]:
stats = io.read_dfs(output_dir + '/statistics/statistics_numeric.xlsx')

In [6]:
# overall statistics
stats['overall'].head()

Unnamed: 0.1,Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,mean radius,569,14.127292,3.524049,6.981,11.7,13.37,15.78,28.11
1,mean texture,569,19.289649,4.301036,9.71,16.17,18.84,21.8,39.28
2,mean perimeter,569,91.969033,24.298981,43.79,75.17,86.24,104.1,188.5
3,mean area,569,654.889104,351.914129,143.5,420.3,551.1,782.7,2501.0
4,mean smoothness,569,0.09636,0.014064,0.05263,0.08637,0.09587,0.1053,0.1634


In [15]:
# statistics per class
stats['diagnosis']['Feature'].fillna(method='ffill', inplace=True)
stats['diagnosis'].head()

Unnamed: 0,Feature,diagnosis,count,mean,std,min,25%,50%,75%,max,mann_whitney_u
0,mean radius,0,212,17.46283,3.203971,10.95,15.075,17.325,19.59,28.11,2.6929430000000002e-68
1,mean radius,1,357,12.146524,1.780512,6.981,11.08,12.2,13.37,17.85,2.6929430000000002e-68
2,mean texture,0,212,21.604906,3.77947,10.38,19.3275,21.46,23.765,39.28,3.4286270000000002e-28
3,mean texture,1,357,17.914762,3.995125,9.71,15.15,17.39,19.76,33.81,3.4286270000000002e-28
4,mean perimeter,0,212,115.365377,21.854653,71.9,98.745,114.2,129.925,188.5,3.55387e-71


In the above per-class statistics, a [Mann-Whitney *U* test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) is performed to detect statistically significant differences in the distribution of a feature between the different classes, and the resulting *p*-values are reported in column `mann_whitney_u`.

For more information about the descriptive statistics computed by CaTabRa by default, refer to `/doc/statistics.md`.

Descriptive statistics can be computed manually as well, see module `catabra.util.statistics` for details.

### Model Summary

The final prediction model is summarized in `model_summary.json`. This file contains a dict with information about the individual constituent models (if the model is an ensemble), the used preprocessing steps, and the selected hyperparameter values. The exact format depends on the used AutoML backend, but for the default auto-sklearn backend the main information is contained in the list under the `"models"` key, as can be seen below:

In [16]:
io.load(output_dir + '/model_summary.json')

{'automl': 'auto-sklearn',
 'task': 'binary_classification',
 'models': [{'model_id': 55,
   'rank': 1,
   'cost': 0.002174700978615496,
   'ensemble_weight': 0.2,
   'data_preprocessor': "FeatTypeSplit(column_transformer=ColumnTransformer(sparse_threshold=0.0, transformers=[('numerical_transformer', NumericalPreprocessingPipeline(config=Configuration: imputation:strategy, Value: 'most_frequent' rescaling:__choice__, Value: 'robust_scaler' rescaling:robust_scaler:q_max, Value: 0.880381628929301 rescaling:robust_scaler:q_min, Value: 0.1341131843923... 'symmetry error': 'numerical', 'texture error': 'numerical', 'worst area': 'numerical', 'worst compactness': 'numerical', 'worst concave points': 'numerical', 'worst concavity': 'numerical', 'worst fractal dimension': 'numerical', 'worst perimeter': 'numerical', 'worst radius': 'numerical', 'worst smoothness': 'numerical', 'worst symmetry': 'numerical', 'worst texture': 'numerical'}, init_params={})",
   'balancing': "Balancing(random_stat

### Training History

Information about each model trained during hyperparameter optimization is contained in `training_history.xlsx` and visualized in `training_history.pdf`:

In [19]:
io.read_df(output_dir + '/training_history.xlsx').drop('Unnamed: 0', axis=1, errors='ignore').head()

Unnamed: 0,model_id,timestamp,total_elapsed_time,type,val_roc_auc,val_accuracy,val_balanced_accuracy,train_roc_auc,duration,ensemble_weight,ensemble_val_roc_auc
0,2,2023-02-02 11:02:12.170,0 days 00:00:02.362576485,random_forest,0.980337,0.927152,0.928416,1.0,1.134844,0.0,0.980337
1,3,2023-02-02 11:02:12.896,0 days 00:00:03.087956190,passive_aggressive,0.994744,0.94702,0.947717,0.99697,0.609202,0.0,0.994744
2,4,2023-02-02 11:02:13.851,0 days 00:00:04.043154955,gradient_boosting,0.970098,0.92053,0.915458,1.0,0.818402,0.0,0.994744
3,5,2023-02-02 11:02:15.418,0 days 00:00:05.609847546,random_forest,0.975535,0.933775,0.934034,0.999866,1.42174,0.0,0.994744
4,6,2023-02-02 11:02:17.481,0 days 00:00:07.673421621,mlp,0.969192,0.913907,0.914734,1.0,1.909986,0.0,0.994744


## Step 2: Calibrate Classifier

Classifiers can be calibrated to ensure that the probability estimates they return correspond to the "true" confidence of the model. As in the initial data analysis and model construction, one simple function call suffices to calibrate a classifier in CaTabRa.

Worth noting are the use of the `from_invocation` keyword argument, which automatically sets all unspecified arguments to the values stored in the given JSON file; this, for example, applies to `split`. The effect of setting `subset` to `True` is that the classifier is only calibrated on those samples whose value in the train-test-split column `"train"` is `True` (i.e., the training set). Normally, classifiers should not be calibrated on the training set, though. After calibration, `model.joblib` is replaced by the new, calibrated model.

The corresponding command in CaTabRa's command-line interface is `catabra calibrate ...`.

In [28]:
from catabra.calibration import calibrate

calibrate(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    subset=True,
    out=output_dir + '/calib'
)

[CaTabRa] ### Calibration started at 2023-02-02 14:11:34.968525
[CaTabRa] Restricting table to calibration subset train = True (456 entries)
[CaTabRa] ### Calibration finished at 2023-02-02 14:11:36.697882
[CaTabRa] ### Elapsed time: 0 days 00:00:01.729357
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/binary_classification/calib


## Step 3: Evaluate Classifier

Prediction models can be evaluated on (labeled) data that have the same format as the data they were initially trained on, as passed to function `analyze()`. Again, one simple function call is sufficient. If the data is split into two or more disjoint subsets via argument `split` (implicit in `from_invocation` below), the model is evaluated on each of these subsets separately.

[Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) can be used to obtain estimates on the variance, confidence interval, etc. of the performance of our classifier. We activate it by simply setting `bootstrapping_repetitions` to the desired number of repetitions.

Since the desired output directory has been created by function `analyze()` already, we are asked whether it should be replaced.

The corresponding command in CaTabRa's command-line interface is `catabra evaluate ...`.

In [29]:
from catabra.evaluation import evaluate

evaluate(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    bootstrapping_repetitions=1000,   # number of bootstrapping repetitions to perform; set to 0 to disable bootstrapping
    out=output_dir + '/eval'
)

Evaluation folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/binary_classification/eval" already exists. Delete? [y/n] y
[CaTabRa] ### Evaluation started at 2023-02-02 16:14:11.106915
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Evaluation results for train:
    roc_auc: 0.9971923536439665
    accuracy @ 0.5: 0.9736842105263158
    balanced_accuracy @ 0.5: 0.9702508960573477
[CaTabRa] Evaluation results for not_train:
    roc_auc: 0.9973474801061007
    accuracy @ 0.5: 0.9734513274336283
    balanced_accuracy @ 0.5: 0.9692749778956675
[CaTabRa] ### Evaluation finished at 2023-02-02 16:14:27.249299
[CaTabRa] ### Elapsed time: 0 days 00:00:16.142384
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/binary_classification/eval


Note how accuracy and balanced accuracy changed compared to the initial data analysis. This is because of model calibration, which potentially affects thresholded metrics (like accuracy and balanced accuracy) but leaves threshold-independent metrics, like ROC-AUC, unchanged.

### Performance Metrics (Non-Bootstrapped)

One of the main evaluation results produced by CaTabRa are tables with detailed information on model performance, and corresponding visualizations. In our case, they are contained in subdirectories `eval/train/` and `eval/not_train/`.

Non-bootstrapped performance metrics are saved in `metrics.xlsx`. In binary classification, this file consists of the three tables `"overall"`, `"thresholded"` and `"calibration"`.

In [8]:
metrics = io.read_dfs(output_dir + '/eval/not_train/metrics.xlsx')

Table `"overall"` contains non-thresholded performance metrics, like ROC-AUC, average precision, etc.:

In [9]:
metrics['overall']

Unnamed: 0.1,Unnamed: 0,pos_label,n,n_pos,roc_auc,average_precision,pr_auc,brier_loss,hinge_loss,log_loss
0,diagnosis,1,113,87,0.997347,0.999216,0.999211,0.023292,0.293878,0.086603


Table `"thresholded"` contains all performance metrics that depend on a specific decision threshold (a.k.a. cut-off point), like accuracy, balanced accuracy, F1-score, etc. These metrics are evaluated at different decision thresholds.

In [17]:
metrics['thresholded'].drop('Unnamed: 0', axis=1).head()

Unnamed: 0,threshold,accuracy,balanced_accuracy,f1,sensitivity,specificity,positive_predictive_value,negative_predictive_value,cohen_kappa,hamming_loss,jaccard,true_positive,true_negative,false_positive,false_negative
0,5.159968e-08,0.769912,0.5,0.87,1.0,0.0,0.769912,1.0,0.0,0.230088,0.769912,87,0,26,0
1,1.94259e-06,0.778761,0.519231,0.874372,1.0,0.038462,0.776786,1.0,0.058019,0.221239,0.776786,87,1,25,0
2,2.656417e-06,0.787611,0.538462,0.878788,1.0,0.076923,0.783784,1.0,0.113725,0.212389,0.783784,87,2,24,0
3,8.222603e-06,0.79646,0.557692,0.883249,1.0,0.115385,0.790909,1.0,0.167254,0.20354,0.790909,87,3,23,0
4,2.910255e-05,0.814159,0.596154,0.892308,1.0,0.192308,0.805556,1.0,0.26827,0.185841,0.805556,87,5,21,0


Table `"calibration"` contains the fraction of positive samples for different threshold intervals. The intervals are constructed such that each of them contains roughly the same number of samples.

In [16]:
metrics['calibration'].drop('Unnamed: 0', axis=1).head()

Unnamed: 0,threshold_lower,threshold_upper,pos_fraction
0,5.159968e-08,8e-06,0.0
1,8.222603e-06,3.2e-05,0.0
2,3.207743e-05,0.000108,0.0
3,0.0001075891,0.000149,0.0
4,0.0001485546,0.00079,0.0


### Bootstrapped Performance

Since we activated bootstrapping by setting `bootstrapping_repetitions` to a positive number, file `bootstrapping.xlsx` was generated. It contains two tables `"summary"` and `"details"` with summary statistics over all bootstrapping runs and the runs themselves, respectively.

In [18]:
bootstrapping = io.read_dfs(output_dir + '/eval/not_train/bootstrapping.xlsx')

In [19]:
bootstrapping['summary']

Unnamed: 0.1,Unnamed: 0,roc_auc,accuracy,balanced_accuracy,__threshold
0,count,1000.0,1000.0,1000.0,1000.0
1,mean,0.997224,0.973504,0.969157,0.5
2,std,0.002679,0.014629,0.020406,0.0
3,min,0.982759,0.920354,0.890215,0.5
4,25%,0.996055,0.964602,0.958152,0.5
5,50%,0.997841,0.973451,0.971778,0.5
6,75%,0.99916,0.982301,0.983516,0.5
7,max,1.0,1.0,1.0,0.5


Table `"details"` reports the performance metrics for each single run, together with the random seed used for resampling the data.

In [22]:
bootstrapping['details'].drop('Unnamed: 0', axis=1, errors='ignore').head()

Unnamed: 0,roc_auc,accuracy,balanced_accuracy,__seed
0,0.998864,0.982301,0.978598,3937181777
1,0.998739,0.964602,0.976471,3822484485
2,0.997537,0.955752,0.970238,192133578
3,0.999139,0.982301,0.975668,3230327960
4,0.999482,0.99115,0.994565,1867172167


### Sample-Wise Predictions

Finally, the model output for each individual sample is saved in `predictions.xlsx`.

In [24]:
predictions = io.read_df(output_dir + '/eval/not_train/predictions.xlsx')

The table contains the true label (column `"diagnosis"`) and the predicted probabilities of the negative and positive class, respectively. Note that in our cases the two classes are simply called `0` and `1`, which is why the corresponding columns are called `"0_proba"` and `"1_proba"`.

In [25]:
predictions.head()

Unnamed: 0.1,Unnamed: 0,diagnosis,0_proba,1_proba
0,456,1,0.162035,0.837965
1,457,1,0.034214,0.965786
2,458,1,0.017535,0.982465
3,459,1,0.000579,0.999421
4,460,0,0.999971,2.9e-05


## Step 4: Explain Classifier

Prediction models can be explained on data that have the same format as the data they were initially trained on, as passed to function `analyze()`. As before, one simple function call is sufficient. If the data is split into two or more disjoint subsets via argument `split` (implicit in `from_invocation` below), the model is explained on each of these subsets separately.

If the final model is an ensemble of several base models, each of them is expained separately.

The corresponding command in CaTabRa's command-line interface is `catabra explain ...`.

In [28]:
from catabra.explanation import explain

explain(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/explain'
)

Explanation folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/binary_classification/explain" already exists. Delete? [y/n] y
[CaTabRa] ### Explanation started at 2023-02-03 15:23:47.444648
[CaTabRa] *** Split train
Model 55 (1 of 7):
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 542.90it/s]
Model 70 (2 of 7):
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 639.03it/s]
Model 74 (3 of 7):
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 663.65it/s]
Model 76 (4 of 7):
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 670.85it/s]
Model 48 (5 of 7):
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 621.31it/s]
Model 59 (6 of 7):
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 540.28it/s]
Model 12 (7 of 7):
Sample batches: 100%|########################################| 15/1

The resulting SHAP feature importance scores are saved as HDF5 tables and visualized in so-called *beeswarm* plots, and can be found in the specified output directory.

## Step 5: Apply Classifier to New Data

Finally, the trained classifier can be applied to new data of the same format as the data it was initially trained on, possibly without the label column. For demonstration purposes we apply the classifier to the same data `X` we are using throughout, although in a real-world use-case this would not make sense.

The corresponding command in CaTabRa's command-line interface is `catabra apply ...`.

In [30]:
from catabra.application import apply

apply(
    X.drop('diagnosis', axis=1),   # data to apply the model to; column containing ground-truth labels is not needed (but would not harm either)
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/apply'
)

Application folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/binary_classification/apply" already exists. Delete? [y/n] y
[CaTabRa] ### Application started at 2023-02-03 15:40:10.451366
[CaTabRa] ### Application finished at 2023-02-03 15:40:12.484966
[CaTabRa] ### Elapsed time: 0 days 00:00:02.033600
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/binary_classification/apply


The results are saved in `predictions.xlsx` and contain the predicted probabilities of the two classes, for every sample.

In [31]:
predictions = io.read_df(output_dir + '/apply/predictions.xlsx')

In [32]:
predictions.head()

Unnamed: 0.1,Unnamed: 0,0_proba,1_proba
0,0,0.999903,9.7e-05
1,1,0.998371,0.001629
2,2,0.999981,1.9e-05
3,3,0.999989,1.1e-05
4,4,0.998161,0.001839


## Load Classifier into Python

Prediction models generated with CaTabRa can be easily loaded into a Python session. The easiest and most straight-forward way to do this is through the `catabra.io.CaTabRaLoader` class, which only needs to be instantiated with the directory containing model:

In [7]:
loader = io.CaTabRaLoader(output_dir)

The resulting class instance provides easy access to all sorts of artifacts generated by the functions above, in particular the trained classifier:

In [8]:
model = loader.get_model()

### Investigating the Model

The type of the loaded `model` object depends on the AutoML backend used for training it, in this case auto-sklearn:

In [9]:
type(model)

catabra.automl.askl.backend.AutoSklearnBackend

If we want a uniform representation of the model independent of the AutoML backend, we can convert it into a `FittedEnsemble`:

In [10]:
fe = model.fitted_ensemble()

A `FittedEnsemble` is, as its name suggests, an ensemble consisting of individual base models and a meta-estimator combining the predictions of the base models to a single output. These base models can be accessed via the `models_` attribute, which is a dict mapping model-IDs to instances of class `FittedModel`:

In [11]:
fe.models_

{55: FittedModel(
     preprocessing=[ColumnTransformer(sparse_threshold=0.0,
                   transformers=[('numerical_transformer',
                                  Pipeline(steps=[('imputation',
                                                   SimpleImputer(copy=False,
                                                                 strategy='most_frequent')),
                                                  ('variance_threshold',
                                                   VarianceThreshold()),
                                                  ('rescaling',
                                                   RobustScaler(copy=False,
                                                                quantile_range=(0.13411318439233166,
                                                                                0.880381628929301))),
                                                  ('dummy', 'passthrough')]),
                                  [True, True, True, True, Tr

In [12]:
fe.models_[55]

FittedModel(
    preprocessing=[ColumnTransformer(sparse_threshold=0.0,
                  transformers=[('numerical_transformer',
                                 Pipeline(steps=[('imputation',
                                                  SimpleImputer(copy=False,
                                                                strategy='most_frequent')),
                                                 ('variance_threshold',
                                                  VarianceThreshold()),
                                                 ('rescaling',
                                                  RobustScaler(copy=False,
                                                               quantile_range=(0.13411318439233166,
                                                                               0.880381628929301))),
                                                 ('dummy', 'passthrough')]),
                                 [True, True, True, True, True, True, True,
  

**NOTE**<br>
Predictions returned by `fe` may deviate slightly from those of `model` due to a [known bug in auto-sklearn](https://github.com/automl/auto-sklearn/issues/1483).

### Applying the Model

If we want to apply the model to new data, we first need to load the encoder that was constructed jointly with the model. Again, the `loader` object comes in handy:

In [35]:
encoder = loader.get_encoder()

In [38]:
model.predict_proba(encoder.transform(x=X))

array([[9.99903057e-01, 9.69427072e-05],
       [9.98370982e-01, 1.62901758e-03],
       [9.99981249e-01, 1.87510020e-05],
       ...,
       [9.67988655e-01, 3.20113446e-02],
       [9.99999948e-01, 5.15996838e-08],
       [2.80621887e-05, 9.99971938e-01]])