# CaTabRa Workflow

---

This notebook is part of the [CaTabRa GitHub repository](https://github.com/risc-mi/catabra).

This tutorial demonstrates CaTabRa's main workflow, in particular how it can be used to

* [analyze data with a binary target](#Step-1:-Analyze-Data-and-Train-Classifier),
* [calibrate the classifier on dedicated calibration data](#Step-2:-Calibrate-Classifier),
* [evaluate the classifier on held-out test data](#Step-3:-Evaluate-Classifier),
* [explain the classifier by computing SHAP- and permutation importance scores](#Step-4:-Explain-Classifier),
* [apply the classifier to new data](#Step-5:-Apply-Classifier-to-New-Data), and
* [load the classifier into Python](#Load-Classifier-into-Python).

## Prerequisites

In [15]:
# generic package imports
from catabra.util import io

In [16]:
# output directory (where all generated artifacts, like statistics, models, etc. are saved)
output_dir = 'workflow'

## Step 0: Prepare Data

We are going to work with the [breast cancer](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) dataset, a well-known binary classification dataset.

CaTabRa assumes a table in the usual $samples \times attributes$ format as input, where the attributes encompass features, target labels, and possibly additional information like a predefined train-test split.

In [17]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

In [18]:
# add target labels to DataFrame
X['diagnosis'] = y

In [19]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)

In [20]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnosis,train
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0,True
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0,True
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0,True
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0,True
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0,True


**NOTE**<br>
The column specifying the train-test split may contain more than two values. For instance, values `"train"`, `"val"` and `"test"` would yield a three-way split with one training set and two test sets. Only make sure that the column name and -values clearly indicate what the *training* set is meant to be; the names of the remaining sets are arbitrary. Prediction models are evaluated on each set (including the training set) separately.

## Step 1: Analyze Data and Train Classifier

Analyze the prepared data `X`. Only one simple function call is required to produce descriptive statistics, a high-quality classifier with automatically tuned hyperparameters, and an Out-of-Distribution detector.

The corresponding command in CaTabRa's command-line interface is called `catabra analyze ...`.

In [21]:
from catabra.analysis import analyze

analyze(
    X,                        # table to analyze; can also be the path to a CSV/Excel/HDF5 file
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    time=3,                   # time budget for hyperparameter tuning, in minutes (optional)
    out=output_dir
)

Output folder "/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow" already exists. Delete?
[CaTabRa] ### Analysis started at 2023-04-13 15:00:41.878108
[CaTabRa] Saving descriptive statistics completed


  return dict_stat, dict_non_num_stat, (df.corr() if df.shape[1] <= corr_threshold else None)


[CaTabRa] Using AutoML-backend auto-sklearn for binary_classification
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb
[CaTabRa] Using auto-sklearn 2.0.


  for col, series in prediction.iteritems():


[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.986260
    n_constituent_models: 1
    total_elapsed_time: 00:03
[CaTabRa] New model #1 trained:
    val_roc_auc: 0.989845
    val_accuracy: 0.947368
    val_balanced_accuracy: 0.946356
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:03
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.986260
    n_constituent_models: 1
    total_elapsed_time: 00:06
[CaTabRa] New model #2 trained:
    val_roc_auc: 0.945430
    val_accuracy: 0.921053
    val_balanced_accuracy: 0.924134
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:05
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.986260
    n_constituent_models: 1
    total_elapsed_time: 00:08
[CaTabRa] New model #3 trained:
    val_roc_auc: 0.971416
    val_accuracy: 0.921053
    val_balanced_accuracy: 0.919952
    train_roc_auc: 0.993877
    type: gradient_boosting
    total_elapsed_time: 00:08
[CaTabR

  loglike = -n_samples / 2 * np.log(x_trans.var())
  mode = stats.mode(array)
  loglike = -n_samples / 2 * np.log(x_trans.var())


[CaTabRa] Final training statistics:
    n_models_trained: 51
    ensemble_val_roc_auc: 0.9972122660294704
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-04-13 15:03:41.403394
[CaTabRa] ### Elapsed time: 0 days 00:02:59.525286
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow
[CaTabRa] ### Evaluation started at 2023-04-13 15:03:41.421730
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.




[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
    roc_auc: 0.9994623655913979
    accuracy @ 0.5: 0.9868421052631579
    balanced_accuracy @ 0.5: 0.9863799283154122




[CaTabRa] Evaluation results for not_train:
    roc_auc: 0.9991158267020337
    accuracy @ 0.5: 0.9469026548672567
    balanced_accuracy @ 0.5: 0.9655172413793103
[CaTabRa] ### Evaluation finished at 2023-04-13 15:03:44.241915
[CaTabRa] ### Elapsed time: 0 days 00:00:02.820185
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/eval


By specifying a train-test split, CaTabRa not only trains a classifier (on the training set) but also evaluates it (on both sets). The last few lines of the above logging output inform about the performance of the classifier on "train" and "not_train". More detailed results are available as well, as we will see in [Step 3](#Step-3:-Evaluate-Classifier).

The newly created directory specified by `output_dir` contains all results generated during data analysis, including

* a copy of the used configuration: `config.json`,
* the arguments passed to function `analyze()`: `invocation.json`,
* [descriptive statistics of the analyzed data](#Descriptive-Statistics): `statistics/`,
* the trained prediction model: `model.joblib`,
* [information about the constituents of the prediction model and their hyperparameters](#Model-Summary): `model_summary.json`,
* [the training history](#Training-History): `training_history.xlsx` and `training_history.pdf`,
* the OOD-detector: `ood.joblib`, and
* evaluation results (because we specified a train-test split): `eval/`.

### Descriptive Statistics

Descriptive statistics are calculated for numeric and non-numeric (categorical) features separately and saved in `statistics/statistics_numeric.xlsx` and `statistics/statistics/non_numeric.xlsx`. It is easiest to simply view these files in Excel, but they can of course be loaded as pandas DataFrames, too.

CaTabRa provides a convenience function for loading tables in arbitrary format, implemented in module [`catabra.util.io`](https://github.com/risc-mi/catabra/tree/main/catabra/util/io.py) `read_df()` for loading a single table and `read_dfs()` for loading all tables stored in a file. In classification tasks, descriptive statistics are computed both for the entire dataset and for each class individually and written to two different tables, so we use `read_dfs()` to load both of them:

In [22]:
stats = io.read_dfs(output_dir + '/statistics/statistics_numeric.xlsx')

In [23]:
# overall statistics
stats['overall'].head()

Unnamed: 0.1,Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,mean radius,569,14.127292,3.524049,6.981,11.7,13.37,15.78,28.11
1,mean texture,569,19.289649,4.301036,9.71,16.17,18.84,21.8,39.28
2,mean perimeter,569,91.969033,24.298981,43.79,75.17,86.24,104.1,188.5
3,mean area,569,654.889104,351.914129,143.5,420.3,551.1,782.7,2501.0
4,mean smoothness,569,0.09636,0.014064,0.05263,0.08637,0.09587,0.1053,0.1634


In [24]:
# statistics per class
stats['diagnosis']['Feature'].fillna(method='ffill', inplace=True)
stats['diagnosis'].head()

Unnamed: 0,Feature,diagnosis,count,mean,std,min,25%,50%,75%,max,mann_whitney_u
0,mean radius,0,212,17.46283,3.203971,10.95,15.075,17.325,19.59,28.11,2.6929430000000002e-68
1,mean radius,1,357,12.146524,1.780512,6.981,11.08,12.2,13.37,17.85,2.6929430000000002e-68
2,mean texture,0,212,21.604906,3.77947,10.38,19.3275,21.46,23.765,39.28,3.4286270000000002e-28
3,mean texture,1,357,17.914762,3.995125,9.71,15.15,17.39,19.76,33.81,3.4286270000000002e-28
4,mean perimeter,0,212,115.365377,21.854653,71.9,98.745,114.2,129.925,188.5,3.55387e-71


In the above per-class statistics, a [Mann-Whitney *U* test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) is performed to detect statistically significant differences in the distribution of a feature between the different classes, and the resulting *p*-values are reported in column `mann_whitney_u`.

For more information about the descriptive statistics computed by CaTabRa by default, refer to [Statistics](https://catabra.readthedocs.io/en/latest/app_docs/statistics_link.html).

Descriptive statistics can be computed manually as well, see module [`catabra.util.statistics`](https://github.com/risc-mi/catabra/tree/main/catabra/util/statistics.py) for details.

### Model Summary

The final prediction model is summarized in `model_summary.json`. This file contains a dict with information about the individual constituent models (if the model is an ensemble), the used preprocessing steps, and the selected hyperparameter values. The exact format depends on the used AutoML backend, but for the default auto-sklearn backend the main information is contained in the list under the `"models"` key, as can be seen below:

In [25]:
io.load(output_dir + '/model_summary.json')

{'automl': 'auto-sklearn',
 'task': 'binary_classification',
 'models': [{'model_id': 6,
   'rank': 1,
   'cost': 0.00561529271206688,
   'ensemble_weight': 0.0,
   'data_preprocessor': "FeatTypeSplit(column_transformer=ColumnTransformer(sparse_threshold=0.0, transformers=[('numerical_transformer', NumericalPreprocessingPipeline(config=Configuration(values={ 'imputation:strategy': 'median', 'rescaling:__choice__': 'power_transformer', }) , dataset_properties={'signed': False, 'sparse': False}, exclude={}, include={}, init_params={}, steps=[('imput... 'symmetry error': 'numerical', 'texture error': 'numerical', 'worst area': 'numerical', 'worst compactness': 'numerical', 'worst concave points': 'numerical', 'worst concavity': 'numerical', 'worst fractal dimension': 'numerical', 'worst perimeter': 'numerical', 'worst radius': 'numerical', 'worst smoothness': 'numerical', 'worst symmetry': 'numerical', 'worst texture': 'numerical'}, init_params={})",
   'balancing': 'Balancing(random_stat

### Training History

Information about each model trained during hyperparameter optimization is contained in `training_history.xlsx` and visualized in `training_history.pdf`:

In [26]:
io.read_df(output_dir + '/training_history.xlsx').drop('Unnamed: 0', axis=1, errors='ignore').head()

Unnamed: 0,model_id,timestamp,total_elapsed_time,type,val_roc_auc,val_accuracy,val_balanced_accuracy,train_roc_auc,duration,ensemble_weight,ensemble_val_roc_auc
0,2,2023-04-13 15:00:46.003,0 days 00:00:03.044101953,gradient_boosting,0.989845,0.947368,0.946356,1.0,2.39181,0.0,0.98626
1,3,2023-04-13 15:00:48.393,0 days 00:00:05.434068441,gradient_boosting,0.94543,0.921053,0.924134,1.0,2.265227,0.0,0.98626
2,4,2023-04-13 15:00:51.003,0 days 00:00:08.043761730,gradient_boosting,0.971416,0.921053,0.919952,0.993877,2.473614,0.0,0.98626
3,5,2023-04-13 15:00:53.396,0 days 00:00:10.437424898,gradient_boosting,0.96825,0.929825,0.926523,0.995034,2.244282,0.0,0.987834
4,6,2023-04-13 15:00:56.929,0 days 00:00:13.970171690,mlp,0.997073,0.971491,0.970072,0.999985,3.378461,0.0,0.996953


## Step 2: Calibrate Classifier

Classifiers can be calibrated to ensure that the probability estimates they return correspond to the "true" confidence of the model. As in the initial data analysis and model construction, one simple function call suffices to calibrate a classifier in CaTabRa.

Worth noting are the use of the `from_invocation` keyword argument, which automatically sets all unspecified arguments to the values stored in the given JSON file; this, for example, applies to `split`. The effect of setting `subset` to `True` is that the classifier is only calibrated on those samples whose value in the train-test-split column `"train"` is `True` (i.e., the training set). Normally, classifiers should not be calibrated on the training set, though. After calibration, `model.joblib` is replaced by the new, calibrated model.

The corresponding command in CaTabRa's command-line interface is `catabra calibrate ...`.

In [27]:
from catabra.calibration import calibrate

calibrate(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    subset=True,
    out=output_dir + '/calib'
)

[CaTabRa] ### Calibration started at 2023-04-13 15:03:44.475144
[CaTabRa] Restricting table to calibration subset train = True (456 entries)
[CaTabRa] ### Calibration finished at 2023-04-13 15:03:45.766419
[CaTabRa] ### Elapsed time: 0 days 00:00:01.291275
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/calib


## Step 3: Evaluate Classifier

Prediction models can be evaluated on (labeled) data that have the same format as the data they were initially trained on, as passed to function [`catabra.analysis.analyze()`](https://github.com/risc-mi/catabra/tree/main/catabra/analysis/main.py). Again, one simple function call is sufficient. If the data is split into two or more disjoint subsets via argument `split` (implicit in `from_invocation` below), the model is evaluated on each of these subsets separately.

[Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) can be used to obtain estimates on the variance, confidence interval, etc. of the performance of our classifier. We activate it by simply setting `bootstrapping_repetitions` to the desired number of repetitions.

Since the desired output directory has been created by function `analyze()` already, we are asked whether it should be replaced.

The corresponding command in CaTabRa's command-line interface is `catabra evaluate ...`.

In [28]:
from catabra.evaluation import evaluate

evaluate(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    bootstrapping_repetitions=1000,   # number of bootstrapping repetitions to perform; set to 0 to disable bootstrapping
    out=output_dir + '/eval'
)

Evaluation folder "/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/eval" already exists. Delete?
[CaTabRa] ### Evaluation started at 2023-04-13 15:03:45.781536
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.




[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.




[CaTabRa] Evaluation results for train:
    roc_auc: 0.9994623655913979
    accuracy @ 0.5: 0.9868421052631579
    balanced_accuracy @ 0.5: 0.9863799283154122
[CaTabRa] Evaluation results for not_train:
    roc_auc: 0.9991158267020337
    accuracy @ 0.5: 0.9469026548672567
    balanced_accuracy @ 0.5: 0.9655172413793103
[CaTabRa] ### Evaluation finished at 2023-04-13 15:04:01.642427
[CaTabRa] ### Elapsed time: 0 days 00:00:15.860891
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/eval


Note how accuracy and balanced accuracy changed compared to the initial data analysis. This is because of model calibration, which potentially affects thresholded metrics (like accuracy and balanced accuracy) but leaves threshold-independent metrics, like ROC-AUC, unchanged.

### Performance Metrics (Non-Bootstrapped)

One of the main evaluation results produced by CaTabRa are tables with detailed information on model performance, and corresponding visualizations. In our case, they are contained in subdirectories `eval/train/` and `eval/not_train/`.

Non-bootstrapped performance metrics are saved in `metrics.xlsx`. In binary classification, this file consists of the three tables `"overall"`, `"thresholded"` and `"calibration"`.

In [29]:
metrics = io.read_dfs(output_dir + '/eval/not_train/metrics.xlsx')

Table `"overall"` contains non-thresholded performance metrics, like ROC-AUC, average precision, etc.:

In [30]:
metrics['overall']

Unnamed: 0.1,Unnamed: 0,pos_label,n,n_pos,roc_auc,average_precision,pr_auc,brier_loss,hinge_loss,log_loss
0,diagnosis,1,113,87,0.999116,0.999742,0.99974,0.03369,0.283123,0.125328


Table `"thresholded"` contains all performance metrics that depend on a specific decision threshold (a.k.a. cut-off point), like accuracy, balanced accuracy, F1-score, etc. These metrics are evaluated at different decision thresholds.

In [31]:
metrics['thresholded'].drop('Unnamed: 0', axis=1).head()

Unnamed: 0,threshold,accuracy,balanced_accuracy,f1,sensitivity,specificity,positive_predictive_value,negative_predictive_value,cohen_kappa,hamming_loss,jaccard,true_positive,true_negative,false_positive,false_negative
0,0.012333,0.769912,0.5,0.87,1.0,0.0,0.769912,1.0,0.0,0.230088,0.769912,87,0,26,0
1,0.012333,0.79646,0.557692,0.883249,1.0,0.115385,0.790909,1.0,0.167254,0.20354,0.790909,87,3,23,0
2,0.012333,0.814159,0.596154,0.892308,1.0,0.192308,0.805556,1.0,0.26827,0.185841,0.805556,87,5,21,0
3,0.012333,0.823009,0.615385,0.896907,1.0,0.230769,0.813084,1.0,0.315981,0.176991,0.813084,87,6,20,0
4,0.012333,0.831858,0.634615,0.901554,1.0,0.269231,0.820755,1.0,0.361961,0.168142,0.820755,87,7,19,0


Table `"calibration"` contains the fraction of positive samples for different threshold intervals. The intervals are constructed such that each of them contains roughly the same number of samples.

In [32]:
metrics['calibration'].drop('Unnamed: 0', axis=1).head()

Unnamed: 0,threshold_lower,threshold_upper,pos_fraction
0,0.012333,0.012333,0.0
1,0.012333,0.012333,0.0
2,0.012333,0.012333,0.0
3,0.012333,0.012333,0.0
4,0.012333,0.012333,0.0


### Bootstrapped Performance

Since we activated bootstrapping by setting `bootstrapping_repetitions` to a positive number, file `bootstrapping.xlsx` was generated. It contains two tables `"summary"` and `"details"` with summary statistics over all bootstrapping runs and the runs themselves, respectively.

In [33]:
bootstrapping = io.read_dfs(output_dir + '/eval/not_train/bootstrapping.xlsx')

In [34]:
bootstrapping['summary']

Unnamed: 0.1,Unnamed: 0,roc_auc,accuracy,balanced_accuracy,__threshold
0,count,1000.0,1000.0,1000.0,1000.0
1,mean,0.999107,0.946177,0.965097,0.5
2,std,0.001177,0.021432,0.013757,0.0
3,min,0.990909,0.876106,0.922222,0.5
4,25%,0.998557,0.929204,0.956044,0.5
5,50%,0.999532,0.946903,0.966292,0.5
6,75%,1.0,0.964602,0.975904,0.5
7,max,1.0,1.0,1.0,0.5


Table `"details"` reports the performance metrics for each single run, together with the random seed used for resampling the data.

In [35]:
bootstrapping['details'].drop('Unnamed: 0', axis=1, errors='ignore').head()

Unnamed: 0,roc_auc,accuracy,balanced_accuracy,__seed
0,1.0,0.955752,0.970238,2854880344
1,1.0,0.938053,0.963158,1506600952
2,1.0,0.893805,0.931034,3277809138
3,0.997895,0.946903,0.96,3141104837
4,1.0,0.964602,0.977011,2847344748


### Sample-Wise Predictions

Finally, the model output for each individual sample is saved in `predictions.xlsx`.

In [36]:
predictions = io.read_df(output_dir + '/eval/not_train/predictions.xlsx')

The table contains the true label (column `"diagnosis"`) and the predicted probabilities of the negative and positive class, respectively. Note that in our cases the two classes are simply called `0` and `1`, which is why the corresponding columns are called `"0_proba"` and `"1_proba"`.

In [37]:
predictions.head()

Unnamed: 0.1,Unnamed: 0,diagnosis,0_proba,1_proba
0,456,1,0.007257,0.992743
1,457,1,0.224164,0.775836
2,458,1,0.727156,0.272844
3,459,1,0.005909,0.994091
4,460,0,0.987667,0.012333


### Out-of-Distribution Detection

In addition to the output of the prediction model we can also inspect the likelihood of samples (or the whole training- or test-set) being out-of-distribution (OOD). Predictions for samples with high OOD likelihood should be treated with care, as they might differ significantly from all samples the model has seen during training.

In [38]:
ood = io.read_df(output_dir + '/eval/not_train/ood.xlsx')

In [39]:
ood.head()

Unnamed: 0.1,Unnamed: 0,proba,decision
0,0,0,False
1,1,0,False
2,2,0,False
3,3,0,False
4,4,0,False


## Step 4: Explain Classifier

Prediction models can be explained on data that have the same format as the data they were initially trained on, as passed to function `analyze()`. As before, one simple function call is sufficient. If the data is split into two or more disjoint subsets via argument `split` (implicit in `from_invocation` below), the model is explained on each of these subsets separately.

If the final model is an ensemble of several base models, each of them is expained separately.

The corresponding command in CaTabRa's command-line interface is `catabra explain ...`.

In [40]:
from catabra.explanation import explain

explain(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/explain'
)

[CaTabRa] ### Explanation started at 2023-04-13 15:04:01.895024
[CaTabRa] *** Split train
Model 6 (1 of 2):
Sample batches: 100%|########################################| 15/15 [09:38<00:00, 38.53s/it]
Model 60 (2 of 2):
Sample batches: 100%|########################################| 15/15 [03:10<00:00, 12.70s/it]
[CaTabRa] *** Split not_train
Model 6 (1 of 2):
Sample batches: 100%|########################################| 4/4 [02:38<00:00, 39.42s/it]
Model 60 (2 of 2):
Sample batches: 100%|########################################| 4/4 [00:58<00:00, 14.50s/it]
[CaTabRa] ### Explanation finished at 2023-04-13 15:20:32.104575
[CaTabRa] ### Elapsed time: 0 days 00:16:30.209551
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/explain


By default, [SHAP](https://github.com/slundberg/shap) is used for generating *local (i.e., sample-wise) explanations* in terms of *feature importance scores*. These scores are saved as HDF5 tables and visualized in so-called *beeswarm* plots, and can be found in the specified output directory.

In addition to SHAP, CaTabRa also provides a ready-to-use implementation of *permutation importance*, which can be activated by setting `explainer="permutation"`:

In [41]:
explain(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/explain_permutation',
    explainer='permutation'
)

[CaTabRa] ### Explanation started at 2023-04-13 15:20:32.120711
[CaTabRa] *** Split train
Features: 100%|########################################| 30/30 [00:06<00:00, 4.93it/s]
[CaTabRa] *** Split not_train
Features: 100%|########################################| 30/30 [00:03<00:00, 11.82it/s]
[CaTabRa] ### Explanation finished at 2023-04-13 15:20:43.127964
[CaTabRa] ### Elapsed time: 0 days 00:00:11.007253
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/explain_permutation


Permutation importance generates *global (i.e., feature-wise) explanations*. The corresponding importance scores are saved as HDF5 tables and visualized in bar plots.

Refer to [Explanations](https://catabra.readthedocs.io/en/latest/app_docs/explanations_link.html) for more information about model explanations.

## Step 5: Apply Classifier to New Data

Finally, the trained classifier can be applied to new data of the same format as the data it was initially trained on, possibly without the label column. For demonstration purposes we apply the classifier to the same data `X` we are using throughout, although in a real-world use-case this would not make sense.

The corresponding command in CaTabRa's command-line interface is `catabra apply ...`.

In [42]:
from catabra.application import apply

apply(
    X.drop('diagnosis', axis=1),   # data to apply the model to; column containing ground-truth labels is not needed (but would not harm either)
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/apply'
)

[CaTabRa] ### Application started at 2023-04-13 15:20:43.169793
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] ### Application finished at 2023-04-13 15:20:43.760727
[CaTabRa] ### Elapsed time: 0 days 00:00:00.590934
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/apply


The results are saved in `predictions.xlsx` and contain the predicted probabilities of the two classes, for every sample. OOD scores are saved again in `ood.xlsx`.

In [43]:
predictions = io.read_df(output_dir + '/apply/predictions.xlsx')

In [44]:
predictions.head()

Unnamed: 0.1,Unnamed: 0,0_proba,1_proba
0,0,0.987667,0.012333
1,1,0.987667,0.012333
2,2,0.987667,0.012333
3,3,0.987667,0.012333
4,4,0.987667,0.012333


## Load Classifier into Python

Prediction models generated with CaTabRa can be easily loaded into a Python session. The easiest and most straight-forward way to do this is through the [`catabra.util.io.CaTabRaLoader`](https://github.com/risc-mi/catabra/tree/main/catabra/util/io.py) class, which only needs to be instantiated with the directory containing model:

In [45]:
loader = io.CaTabRaLoader(output_dir)

The resulting class instance provides easy access to all sorts of artifacts generated by the functions above, in particular the trained classifier:

In [46]:
model = loader.get_model()

### Investigating the Model

The type of the loaded `model` object depends on the AutoML backend used for training it, in this case auto-sklearn:

In [47]:
type(model)

catabra.automl.askl.backend.AutoSklearnBackend

If we want a uniform representation of the model independent of the AutoML backend, we can convert it into a [`catabra.automl.fitted_ensemble.FittedEnsemble`](https://github.com/risc-mi/catabra/tree/main/catabra/automl/fitted_ensemble.py):

In [48]:
fe = model.fitted_ensemble()

A `FittedEnsemble` is, as its name suggests, an ensemble consisting of individual base models and a meta-estimator combining the predictions of the base models to a single output. These base models can be accessed via the `models_` attribute, which is a dict mapping model-IDs to instances of class `FittedModel`:

In [49]:
fe.models_

{6: FittedModel(
     preprocessing=[ColumnTransformer(sparse_threshold=0.0,
                   transformers=[('numerical_transformer',
                                  Pipeline(steps=[('imputation',
                                                   SimpleImputer(copy=False,
                                                                 strategy='median')),
                                                  ('variance_threshold',
                                                   VarianceThreshold()),
                                                  ('rescaling',
                                                   PowerTransformer(copy=False)),
                                                  ('dummy', 'passthrough')]),
                                  [True, True, True, True, True, True, True,
                                   True, True, True, True, True, True, True,
                                   True, True, True, True, True, True, True,
                                   

In [50]:
list(fe.models_.values())[0]

FittedModel(
    preprocessing=[ColumnTransformer(sparse_threshold=0.0,
                  transformers=[('numerical_transformer',
                                 Pipeline(steps=[('imputation',
                                                  SimpleImputer(copy=False,
                                                                strategy='median')),
                                                 ('variance_threshold',
                                                  VarianceThreshold()),
                                                 ('rescaling',
                                                  PowerTransformer(copy=False)),
                                                 ('dummy', 'passthrough')]),
                                 [True, True, True, True, True, True, True,
                                  True, True, True, True, True, True, True,
                                  True, True, True, True, True, True, True,
                                  True, True, True, 

**NOTE**<br>
Predictions returned by `fe` may deviate slightly from those of `model` due to a [known bug in auto-sklearn](https://github.com/automl/auto-sklearn/issues/1483).

### Applying the Model

If we want to apply the model to new data, we first need to load the encoder that was constructed jointly with the model. Again, the `loader` object comes in handy:

In [51]:
encoder = loader.get_encoder()

In [52]:
model.predict_proba(encoder.transform(x=X))

array([[0.98766681, 0.01233319],
       [0.98766681, 0.01233319],
       [0.98766681, 0.01233319],
       ...,
       [0.98766655, 0.01233345],
       [0.98766681, 0.01233319],
       [0.00594657, 0.99405343]])