# Specify Fixed ML Pipelines

---

This notebook is part of the [CaTabRa GitHub repository](https://github.com/risc-mi/catabra).

This short example illustrates how a fixed ML pipeline can be specified in CaTabRa, i.e.,
* [how it can be composed](#Compose-Pipeline),
* [how it can be utilized in CaTabRa's data analysis workflow](#Utilize-Pipeline), and
* [how it can be configured](#Configure-Pipeline).

Fixed pipelines (without hyperparameter optimization) can be useful for quickly training and evaluating *baseline models*, like simple logistic regression.

For the related question of how to add a new full-fledged AutoML backend (with hyperparameter optimization), or extend the default auto-sklearn backend, refer to [this example](https://github.com/risc-mi/catabra/tree/main/examples/AutoML-Extension.ipynb).

## Compose Pipeline

We compose a simple pipeline, consisting of elementary preprocessing steps (scaling, imputation) followed by a logistic regression.

In [2]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

In [3]:
# preprocessing pipeline
preprocessing = make_pipeline(
    MinMaxScaler(),                                      # min-max scale all features to [0, 1] interval
    SimpleImputer(strategy='constant', fill_value=-1),   # impute missing values with -1
    'passthrough'                                        # no estimator in preprocessing pipeline
)

In [4]:
# final estimator
estimator = LogisticRegression()

**NOTE**: `catabra.automl.fixed_pipeline.standard_preprocessing()` is a convenient built-in implementation of the above preprocessing pipeline. In addition, it also one-hot encodes categorical features.

We can now register the fixed pipeline as a new AutoML backend (strictly speaking, the term "AutoML" is not appropriate in this case, but never mind):

In [5]:
from catabra.automl import fixed_pipeline

fixed_pipeline.register_backend(
    'logreg',
    preprocessing=preprocessing,
    estimator=estimator
)

**NOTE**: The `preprocessing` object must implement `fit_transform()` and `transform()`, and the `estimator` object must implement `fit()`, `predict()` and, if used for classification, `predict_proba()`. Both should subclass [`sklearn.base.BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) to be able to get/set hyperparameters with `get_params()` and `set_params()`, respectively. `preprocessing` is optional and can be set to `None`.

## Utilize Pipeline

`"logreg"` can be used in [CaTabRa's data analysis workflow](https://github.com/risc-mi/catabra/tree/main/examples/Workflow.ipynb) just as any other AutoML backend.

In [6]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

In [7]:
# add target labels to DataFrame
X['diagnosis'] = y

In [8]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)

When analyzing the data, we inform CaTabRa that we want to use the `"logreg"` backend by adjusting the config dict:

In [9]:
from catabra.analysis import analyze

analyze(
    X,
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    time=None,                # specifying a time budget has no effect on fixed pipelines
    out='logreg_example',
    config={
        'automl': 'logreg',   # name of the "AutoML" backend (in this case it's a fixed pipeline)
        'binary_classification_metrics': ['accuracy', 'roc_auc'],
    }
)

[CaTabRa] ### Analysis started at 2023-03-08 14:34:44.562167
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend logreg for binary_classification




[CaTabRa] Final training statistics:
    n_models_trained: 1
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-03-08 14:34:45.339168
[CaTabRa] ### Elapsed time: 0 days 00:00:00.777001
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example
[CaTabRa] ### Evaluation started at 2023-03-08 14:34:45.385179
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
    accuracy @ 0.5: 0.9758771929824561
    roc_auc: 0.9944444444444445
[CaTabRa] Evaluation results for not_train:
    accuracy @ 0.5: 0.9734513274336283
    roc_auc: 0.9991158267020337
[CaTabRa] ### Evaluation finishe

After implementing the fixed pipeline in a few lines of code, CaTabRa takes care of everything else: calculating descriptive statistics, splitting the data into training- and a test sets, training a classifier and an OOD detector, and evaluating the classifier on both training- and test set (including visualizations).

The classifier can furthermore be explained without ado:

In [10]:
from catabra.explanation import explain

explain(
    X,
    folder='logreg_example',
    from_invocation='logreg_example/invocation.json',
    out='logreg_example/explain'
)

[CaTabRa] ### Explanation started at 2023-03-08 14:39:12.430560
[CaTabRa] *** Split train
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 570.05it/s]
[CaTabRa] *** Split not_train
Sample batches: 100%|########################################| 4/4 [00:00<00:00, 455.20it/s]
[CaTabRa] ### Explanation finished at 2023-03-08 14:39:14.921295
[CaTabRa] ### Elapsed time: 0 days 00:00:02.490735
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example/explain


## Configure Pipeline

Although fixed pipelines are, well, *fixed* in the sense that hyperparameters are not automatically optimized, it is still possible to configure hyperparameters through the config dict.

Find out which hyperparameters there are:

In [12]:
preprocessing.get_params()

{'memory': None,
 'steps': [('minmaxscaler', MinMaxScaler()),
  ('simpleimputer', SimpleImputer(fill_value=-1, strategy='constant')),
  ('passthrough', 'passthrough')],
 'verbose': False,
 'minmaxscaler': MinMaxScaler(),
 'simpleimputer': SimpleImputer(fill_value=-1, strategy='constant'),
 'passthrough': 'passthrough',
 'minmaxscaler__clip': False,
 'minmaxscaler__copy': True,
 'minmaxscaler__feature_range': (0, 1),
 'simpleimputer__add_indicator': False,
 'simpleimputer__copy': True,
 'simpleimputer__fill_value': -1,
 'simpleimputer__missing_values': nan,
 'simpleimputer__strategy': 'constant',
 'simpleimputer__verbose': 0}

In [13]:
estimator.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

Hyperparameters can be configured by adding corresponding entries to the config dict. Keys must be prefixed by `"logreg_preprocessing__"` and `"logreg_estimator__"`, respectively:

In [16]:
analyze(
    X,
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    time=None,                # specifying a time budget has no effect on fixed pipelines
    out='logreg_example_configured',
    config={
        'automl': 'logreg',   # name of the "AutoML" backend (in this case it's a fixed pipeline)
        'binary_classification_metrics': ['accuracy', 'roc_auc'],
        
        'logreg_preprocessing__simpleimputer__strategy': 'mean',    # impute missing values with feature-wise mean
        'logreg_estimator__penalty': 'none',                        # don't regularize
        'logreg_estimator__max_iter': 500                           # increase number of iterations
    }
)

Output folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example_configured" already exists. Delete? [y/n] y
[CaTabRa] ### Analysis started at 2023-03-08 15:00:22.444121
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend logreg for binary_classification




[CaTabRa] Final training statistics:
    n_models_trained: 1
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-03-08 15:00:24.329333
[CaTabRa] ### Elapsed time: 0 days 00:00:01.885212
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example_configured
[CaTabRa] ### Evaluation started at 2023-03-08 15:00:24.334280
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
    accuracy @ 0.5: 1.0
    roc_auc: 1.0
[CaTabRa] Evaluation results for not_train:
    accuracy @ 0.5: 0.9557522123893806
    roc_auc: 0.9712643678160919
[CaTabRa] ### Evaluation finished at 2023-03-08 15:

## Bottom Line

Although it would be technically possible to incorporate hyperparameter optimization into fixed pipelines by utilizing `sklearn.model_selection.GridSearchCV` and related concepts, we **strongly recommend to implement a proper AutoML backend** instead. Refer to [AutoML-Extension.ipynb](https://github.com/risc-mi/catabra/tree/main/examples/AutoML-Extension.ipynb) for information how this works.