# Add New AutoML Backend

---

This notebook is part of the [CaTabRa GitHub repository](https://github.com/risc-mi/catabra).

This short example demonstrates how a new AutoML backend can be added to CaTabRa, i.e.,

* [how it can be implemented](#Implement-Random-Search), and
* [how it can be utilized in CaTabRa's data analysis workflow](#Utilize-Random-Search).

It also briefly explains [how the existing auto-sklearn backend can be extended](#Extend-Existing-Auto-Sklearn-Backend) without having to add new backend from scratch.

For the related question of how to conveniently utilize a fixed ML pipeline (without hyperparameter optimization) refer to [this example](https://catabra.readthedocs.io/en/latest/jupyter/fixed_pipeline.html).

## Implement Random Search

We implement a simple random search over a fixed, non-configurable parameter grid.

**ATTENTION!** This is an extremely reduced example that only serves demonstration purposes.
It lacks many capabilities normally expected from CaTabRa AutoML backends, like

* supporting different prediction tasks (not just binary- and multiclass classification),
* handling numerical and categorical features,
* handling unlabeled samples,
* supporting grouped splitting for internal validation,
* taking time- and memory constraints into accoount,
* taking different optimization objectives into account,
* logging the training process,
* building ensembles,
* etc.

If you intend to actually add a new AutoML backend, have a look at the implementation of the default auto-sklearn backend in [`catabra.automl.askl.backend`](https://github.com/risc-mi/catabra/tree/main/catabra/automl/askl/backend.py).

In [1]:
from typing import Optional
import numpy as np
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

from catabra.automl.base import FittedEnsemble, AutoMLBackend

AutoML backends need to implement the abstract base class [`catabra.automl.base.AutoMLBackend`](https://github.com/risc-mi/catabra/tree/main/catabra/automl/base.py). The main methods of interest are `fit()`, `predict()` and `predict_proba()`.

In [2]:
class RandomSearchBackend(AutoMLBackend):
    
    @property
    def name(self) -> str:
        return 'random_search'
    
    @property
    def model_ids_(self) -> list:
        [0]
    
    def summary(self) -> dict:
        return {0: [' '.join(repr(s[1]).replace('\n', ' ').split()) for s in self.random_search_.best_estimator_.steps]}
    
    def training_history(self) -> pd.DataFrame:
        hist = pd.DataFrame(self.random_search_.cv_results_)
        hist.rename({'mean_test_score': 'val_score'}, axis=1, inplace=True)   # for plotting
        return hist
    
    def fitted_ensemble(self, ensemble_only: bool = True) -> FittedEnsemble:
        pip = self.random_search_.best_estimator_
        return FittedEnsemble(
            task=self.task,
            models={
                0: dict(preprocessing=pip.steps[0][1], estimator=pip.steps[1][1])
            }
        )
    
    def fit(self, x_train: pd.DataFrame, y_train: pd.DataFrame, groups: Optional[np.ndarray] = None,
            sample_weights: Optional[np.ndarray] = None, time: Optional[int] = None, jobs: Optional[int] = None,
            dataset_name: Optional[str] = None, monitor=None) -> 'RandomSearchBackend':
        
        assert self.task in ('binary_classification', 'multiclass_classification')
        assert y_train.notna().all().all()
        assert groups is None
        assert sample_weights is None
        
        metrics = self.config.get(self.task + '_metrics', [])
        assert len(metrics) == 0 or metrics[0] == 'accuracy'
        
        pip = Pipeline(
            [
                ('imputer', SimpleImputer()),
                ('classifier', RandomForestClassifier())
            ]
        )
        
        param_dist = {
            'imputer__strategy': ['mean', 'median', 'most_frequent', 'constant'],
            'imputer__add_indicator': [True, False],
            'classifier__n_estimators': [10, 20, 50, 80, 100, 150, 200],
            'classifier__criterion': ['gini', 'entropy'],
            'classifier__max_depth': [None, 4, 10],
            'classifier__class_weight': [None, 'balanced', 'balanced_subsample'],
        }
        
        if time is None:
            time = 1
        
        # abuse `time` as number of iterations
        n_iter = time
        
        self.random_search_ = RandomizedSearchCV(pip, param_distributions=param_dist, n_iter=n_iter, refit=True)
        self.random_search_.fit(x_train.values, y_train.values[:, 0])
        
        return self
    
    def predict(self, x: pd.DataFrame, jobs: Optional[int] = None, batch_size: Optional[int] = None,
                model_id=None, calibrated: bool = 'auto') -> np.ndarray:
        return self.random_search_.predict(x)
    
    def predict_proba(self, x: pd.DataFrame, jobs: Optional[int] = None, batch_size: Optional[int] = None,
                      model_id=None, calibrated: bool = 'auto') -> np.ndarray:
        return self.random_search_.predict_proba(x)
    
    def predict_all(self, x: pd.DataFrame, jobs: Optional[int] = None, batch_size: Optional[int] = None) -> dict:
        return {0: self.predict(x, jobs=jobs, batch_size=batch_size)}
    
    def predict_proba_all(self, x: pd.DataFrame, jobs: Optional[int] = None, batch_size: Optional[int] = None) -> dict:
        return {0: self.predict_proba(x, jobs=jobs, batch_size=batch_size)}
    
    def get_versions(self) -> dict:
        return {}

In [3]:
AutoMLBackend.register('random_search', RandomSearchBackend)

## Utilize Random Search

In [4]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

In [5]:
# add target labels to DataFrame
X['diagnosis'] = y

In [6]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)

When analyzing the data, we inform CaTabRa that we want to use the `"random_search"` backend by adjusting the config dict:

In [7]:
from catabra.analysis import analyze

analyze(
    X,
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    time=20,                  # ONLY IN THIS CASE: number of random search iterations
    out='random_search_example',
    config={
        'automl': 'random_search',     # name of the AutoML backend
        'binary_classification_metrics': ['accuracy', 'roc_auc'],
    }
)

[CaTabRa] ### Analysis started at 2023-02-09 09:30:27.137817
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend random_search for binary_classification
[CaTabRa] Final training statistics:
    n_models_trained: 20
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type Autoencoder
[CaTabRa] Fitting out-of-distribution detector...
Iteration 1, loss = 0.06783438
Iteration 2, loss = 0.03997528
Iteration 3, loss = 0.02633058
Iteration 4, loss = 0.01948884
Iteration 5, loss = 0.01487442
Iteration 6, loss = 0.01228704
Iteration 7, loss = 0.01144362
Iteration 8, loss = 0.01063012
Iteration 9, loss = 0.00981005
Iteration 10, loss = 0.00913160
Iteration 11, loss = 0.00833614
Iteration 12, loss = 0.00764720
Iteration 13, loss = 0.00714880
Iteration 14, loss = 0.00660951
Iteration 15, loss = 0.00632128
Iteration 16, loss = 0.00613749
Iteration 17, loss = 0.00583286
Iteration 18, loss = 0.00577213
Iteration 19, loss = 0.00582528
It

After implementing the (simplistic) new AutoML backend in a few lines of code, CaTabRa takes care of everything else: calculating descriptive statistics, splitting the data into training- and a test sets, training a classifier and an OOD detector, and evaluating the classifier on both training- and test set (including visualizations).

We can inspect the training history and the model summary:

In [8]:
from catabra.util import io
training_history = io.read_df('random_search_example/training_history.xlsx')
model_summary = io.load('random_search_example/model_summary.json')

In [15]:
training_history.drop('Unnamed: 0', axis=1).sort_values('rank_test_score').head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_imputer__strategy,param_imputer__add_indicator,param_classifier__n_estimators,param_classifier__max_depth,param_classifier__criterion,param_classifier__class_weight,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,val_score,std_test_score,rank_test_score
10,0.102886,0.002096,0.007032,0.000338,median,False,100,,entropy,,"{'imputer__strategy': 'median', 'imputer__add_...",0.945652,0.956044,0.967033,0.956044,1.0,0.964955,0.018782,1
18,0.124079,0.002483,0.006806,0.000152,median,False,100,,entropy,balanced_subsample,"{'imputer__strategy': 'median', 'imputer__add_...",0.945652,0.956044,0.978022,0.967033,0.967033,0.962757,0.01102,2
5,0.082439,0.000397,0.005633,6.3e-05,most_frequent,False,80,,gini,balanced,"{'imputer__strategy': 'most_frequent', 'impute...",0.945652,0.956044,0.967033,0.967033,0.967033,0.960559,0.008583,3
14,0.0255,0.000564,0.001812,5.5e-05,median,True,20,10.0,gini,balanced_subsample,"{'imputer__strategy': 'median', 'imputer__add_...",0.956522,0.945055,0.967033,0.956044,0.978022,0.960535,0.011171,4
9,0.082382,0.001896,0.005584,0.000116,median,True,80,,entropy,balanced,"{'imputer__strategy': 'median', 'imputer__add_...",0.923913,0.956044,0.967033,0.978022,0.967033,0.958409,0.018596,5


In [10]:
model_summary

{'0': ["SimpleImputer(strategy='median')",
  "RandomForestClassifier(criterion='entropy')"]}

The classifier can be explained without further ado:

In [14]:
from catabra.explanation import explain

explain(
    X,
    folder='random_search_example',
    from_invocation='random_search_example/invocation.json',
    out='random_search_example/explain'
)

[CaTabRa] ### Explanation started at 2023-02-09 09:40:19.095028
[CaTabRa] *** Split train
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 275.13it/s]
[CaTabRa] *** Split not_train
Sample batches: 100%|########################################| 4/4 [00:00<00:00, 152.09it/s]
[CaTabRa] ### Explanation finished at 2023-02-09 09:40:21.726130
[CaTabRa] ### Elapsed time: 0 days 00:00:02.631102
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_search_example/explain


## Extend Existing Auto-Sklearn Backend

The existing auto-sklearn backend can be easily extended with new components, for instance, for data preprocessing, feature engineering, and predictive modeling. This is independent of CaTabRa and [documented on the official auto-sklearn website](https://automl.github.io/auto-sklearn/master/extending.html), with [examples](https://automl.github.io/auto-sklearn/master/examples/index.html#extension-examples). Additionally, you can check out [`catabra.automl.askl.addons.xgb`](https://github.com/risc-mi/catabra/tree/main/catabra/automl/askl/addons/xgb.py) for details about how CaTabRa adds XGBoost classifiers and regressors to auto-sklearn.