# Preprocessing Plugins

Preprocessing datasets is a common requirement for many machine learning estimators. The techniques include:
 - dimensionality reduction: the process of reducing the dimension of your feature set.
 - feature scaling: the process of normalizing the range or the shape of the features in the dataset. 

**AutoPrognosis** provides a set of default preprocessing plugins and can be extended with any number of other plugins.

### Plugins 101

Every **AutoPrognosis plugin** must implement the **`Plugin`** interface provided by `autoprognosis/plugins/core/base_plugin.py`.

Each **AutoPrognosis preprocessing plugin** must implement the **`PreprocessorPlugin`** interface provided by `autoprognosis/plugins/preprocessors/base.py`

__Warning__ : If a plugin doesn't override all the abstract methods, it won't be loaded by the library.




__API__ : Every preprocessing plugin must implement the following methods:
- `name()` - a static method that returns the name of the plugin. e.g., linear_svm, pca, etc.

- `subtype()` - a static method that returns the plugin's subtype. e.g., "dimensionality_reduction", "feature_scaling" etc. It will be used for filtering the plugin in the optimization process.
    
- `hyperparameter_space()` - a static method that returns the hyperparameters that can be tuned during the optimization. The method will return a list of `skopt.space.Dimension` derived objects.
    
- `_fit()` - internal implementation, called by the `fit` method.

- `_transform()` - internal implementation, called by the `transform` method.

### Setup

In [None]:
# stdlib
import sys
import time
import warnings

import numpy as np
import pandas as pd
import tabulate
import xgboost as xgb

# third party
from IPython.display import HTML, display
from sklearn import metrics
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from tqdm import tqdm

if not sys.warnoptions:
    warnings.simplefilter("ignore")

### Loading the Preprocessing plugins

Make sure that you have installed AutoPrognosis in your workspace.

You can do that by running `pip install .` in the root of the repository.

In [None]:
# autoprognosis absolute
from autoprognosis.plugins.preprocessors import PreprocessorPlugin, Preprocessors

preprocessors = Preprocessors()

### List the existing plugins

In [None]:
preprocessors.list_available()

### Adding a new Preprocessing plugin

By default, AutoPrognosis automatically loads the preprocessing plugins with the pattern `autoprognosis/plugins/preprocessors/plugin_*`. 

Alternatively, you can call `Preprocessors().add(<name>, <PreprocessorPlugin derived class>)` at runtime.

Next, we show how to add a custom preprocessing plugin:

In [None]:
# third party
from sklearn.feature_selection import SelectFpr, chi2

custom_select_fpr = "custom_select_fpr"


class NewPlugin(PreprocessorPlugin):
    def __init__(self):
        super().__init__()
        self._model = SelectFpr(chi2)

    @staticmethod
    def name():
        return custom_select_fpr

    @staticmethod
    def hyperparameter_space(*args, **kwargs):
        return []

    @staticmethod
    def subtype() -> str:
        return "dimensionality_reduction"

    def _fit(self, *args, **kwargs):
        self._model.fit(*args, **kwargs)

        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

    def save(self) -> bytes:
        raise NotImplemented("placeholder")

    @classmethod
    def load(cls, buff: bytes) -> "NewPlugin":
        raise NotImplemented("placeholder")


preprocessors.add(custom_select_fpr, NewPlugin)

assert preprocessors.get(custom_select_fpr) is not None

### List the existing plugins

Now we should see the new plugins loaded.

In [None]:
preprocessors.list()

## Benchmarks

We test the preprocessing plugins using the [Wisconsin Breast Cancer dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)).

### Loading the data

In [None]:
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Duration benchmarks

__About__ : This step measures the preprocessing duration for each plugin on the dataset. The times are reported in milliseconds.

In [None]:
duration = []

plugins = preprocessors.list()

for plugin in tqdm(plugins):
    plugin_duration = [plugin]
    ctx = preprocessors.get(plugin)

    start = time.time() * 1000
    ctx.fit_transform(X, y)

    plugin_duration.append(round(time.time() * 1000 - start, 4))

    duration.append(plugin_duration)

### Duration(ms) results

In [None]:
display(
    HTML(
        tabulate.tabulate(duration, headers=["Plugin", "Duration(ms)"], tablefmt="html")
    )
)

### Prediction performance using feature processing and XGBoost

__Steps__
 - We run each preprocessing plugin on the dataset.
 - We train an XGBoost classifier using the processed dataset and report the accuracy,  AUROC, and AURPC metrics on the test set.

In [None]:
def get_metrics(X_train, y_train, X_test, y_test):
    xgb_clf = xgb.XGBClassifier(verbosity=0)
    xgb_clf = xgb_clf.fit(X_train, y_train)

    y_pred = xgb_clf.predict(X_test)

    score = xgb_clf.score(X_test, y_test)

    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
    auroc = metrics.auc(fpr, tpr)

    prec, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred)
    aurpc = metrics.auc(recall, prec)

    return round(score, 4), round(auroc, 4), round(aurpc, 4)


metrics_headers = ["Plugin", "Acc score", "AUROC", "AURPC"]
xgboost_test_score = []

xgboost_test_score.append(
    ["original dataset", *get_metrics(X_train, y_train, X_test, y_test)]
)

for plugin in plugins:
    fproc = preprocessors.get(plugin)

    X_train_preprocessed = fproc.fit_transform(X_train.copy(), y_train.copy())
    X_test_preprocessed = fproc.transform(X_test.copy())

    score, auroc, aurpc = get_metrics(
        X_train_preprocessed, y_train, X_test_preprocessed, y_test
    )

    xgboost_test_score.append([plugin, score, auroc, aurpc])

In [None]:
display(
    HTML(
        tabulate.tabulate(xgboost_test_score, headers=metrics_headers, tablefmt="html")
    )
)

# Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star AutoPrognosis on GitHub

The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.

- [Star AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
- [Star HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
