# Add New Out-of-Distribution Detector

---

This notebook is part of the [CaTabRa GitHub repository](https://github.com/risc-mi/catabra).

This notebook demonstrates how a new Out-of-Distribution (OOD) detector can be added to CaTabRa, i.e.,
* [how it can be implemented](#Implement-Random-OOD-Detector), and
* [how it can be utilized in CaTabRa's data analysis workflow](#Utilize-Random-OOD-Detector).

## Implement Random OOD-Detector

We implement a new dummy OOD-detector that assigns random OOD probabilities to each sample.

If you intend to actually add a proper new OOD detector, have a look at the implementation of one of the default detectors, like [`catabra.ood.pyod`](https://github.com/risc-mi/catabra/tree/main/catabra/ood/pyod.py).

In [1]:
import numpy as np
import pandas as pd

from catabra.ood.base import OODDetector

OOD-detectors need to implement the abstract base class [`catabra.ood.base.OODDetector`](https://github.com/risc-mi/catabra/tree/main/catabra/ood/base.py). The main methods of interest are `_fit_transformed()` and `_predict_proba_transformed()` for fitting the detector on training data and applying it to unseen samples, respectively.

In [2]:
class RandomOODDetector(OODDetector):
    
    def _fit_transformer(self, X: pd.DataFrame):
        pass

    def _transform(self, X: pd.DataFrame):
        return X

    def _fit_transformed(self, X: pd.DataFrame, y: pd.Series):
        pass

    def _predict_transformed(self, X):
        return self._predict_proba_transformed(X) >= 0.5

    def _predict_proba_transformed(self, X):
        return np.random.uniform(0, 1, size=len(X))

## Utilize Random OOD-Detector

In [3]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

In [4]:
# add target labels to DataFrame
X['diagnosis'] = y

In [5]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)

When analyzing the data, we inform CaTabRa that we want to use the new dummy OOD-detector by adjusting the config dict:

In [6]:
from catabra.analysis import analyze

analyze(
    X,
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    out='random_ood_example',
    config={
        'automl': None,                     # deactivate model building
        'ood_source': 'external',           # set to "external" for custom detectors
        'ood_class': '__main__.RandomOODDetector'    # name (and module) of the OODDetector subclass
    }
)

Output folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example" already exists. Delete? [y/n] y
[CaTabRa] ### Analysis started at 2023-02-13 08:52:47.970930
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] ### Analysis finished at 2023-02-13 08:52:50.348169
[CaTabRa] ### Elapsed time: 0 days 00:00:02.377239
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example
[CaTabRa] ### Evaluation started at 2023-02-13 08:52:50.400599
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] ### Evaluation finished at 2023-02-13 08:52:51.023008
[CaTabRa] ### Elapsed time: 0 days 00:00:00.622409
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example/eval


Although we deactivated model building by setting `"automl"` to `None`, there is still an `eval/` directory with descriptive statistics of training- and test set, and OOD probabilities:

In [6]:
from catabra.util import io

In [7]:
io.read_df('random_ood_example/eval/not_train/ood.xlsx').set_index('Unnamed: 0').head()

Unnamed: 0_level_0,proba,decision
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1
456,0.14955,True
457,0.300978,True
458,0.6767,True
459,0.118917,False
460,0.641352,True


The OOD detector can be applied to unseen samples using the `apply()` function, as usual:

In [8]:
from catabra.application import apply

apply(
    X,
    folder='random_ood_example',
    from_invocation='random_ood_example/invocation.json',
    out='random_ood_example/apply'
)

Application folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example/apply" already exists. Delete? [y/n] y
[CaTabRa] ### Application started at 2023-02-13 09:07:09.017188
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] ### Application finished at 2023-02-13 09:07:10.864471
[CaTabRa] ### Elapsed time: 0 days 00:00:01.847283
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example/apply
