# Imputation Plugins

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets.

**AutoPrognosis** provides a set of default imputation plugins using [HyperImpute](https://github.com/vanderschaarlab/hyperimpute) and can be extended with any number of other plugins.

### Plugins 101

Every **AutoPrognosis plugin** must implement the **`Plugin`** interface provided by `autoprognosis/plugins/core/base_plugin.py`.

Each **AutoPrognosis imputation plugin** must implement the **`ImputerPlugin`** interface provided by `autoprognosis/plugins/imputers/base.py`

__Warning__ : If a plugin doesn't override all the abstract methods, it won't be loaded by the library.




__API__ : Every imputation plugin must implement the following methods:
- `name()` - a static method that returns the name of the plugin. e.g., EM, mice, etc.
    
- `hyperparameter_space()` - a static method that returns the hyperparameters that can be tuned during the optimization. The method will return a list of `skopt.space.Dimension` derived objects.
    
- `_fit()` - internal implementation, called by the `fit` method.

- `_transform()` - internal implementation, called by the `transform` method.

### Setup

In [None]:
# stdlib
import sys
import time
import warnings
from math import sqrt

import numpy as np
import pandas as pd
import tabulate
import xgboost as xgb

# third party
from IPython.display import HTML, display
from sklearn.datasets import load_breast_cancer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# autoprognosis absolute
from autoprognosis.plugins.utils.metrics import RMSE
from autoprognosis.plugins.utils.simulate import simulate_nan

if not sys.warnoptions:
    warnings.simplefilter("ignore")

### Loading the Imputation plugins

Make sure that you have installed AutoPrognosis in your workspace.

You can do that by running `pip install .` in the root of the repository.

In [None]:
# autoprognosis absolute
from autoprognosis.plugins.imputers import ImputerPlugin, Imputers

imputers = Imputers()

### List the existing plugins

In [None]:
imputers.list_available()

### Testing the performance

We simulate some testing datasets using 3 amputation strategies:
- **Missing Completely At Random** (MCAR) if the probability of being missing is the same for all observations
- **Missing At Random** (MAR) if the probability of being missing only depends on observed values.
- **Missing Not At Random** (MNAR) if the unavailability of the data depends on both observed and unobserved data such as its value itself.

#### Load the dataset

In [None]:
# third party
import pandas as pd

# autoprognosis absolute
from autoprognosis.plugins.preprocessors.feature_scaling.plugin_minmax_scaler import (
    plugin as minmax,
)

preproc = minmax()


def dataset():
    X, y = load_breast_cancer(return_X_y=True)
    X = preproc.fit_transform(X, y).to_numpy()
    return train_test_split(X, y, test_size=0.2)


def ampute(x, mechanism, p_miss):
    x_simulated = simulate_nan(x, p_miss, mechanism)

    mask = x_simulated["mask"]
    x_miss = x_simulated["X_incomp"]

    return x, x_miss, mask

In [None]:
datasets = {}
headers = ["Plugin"]

pct = 0.3

mechanisms = ["MAR", "MNAR", "MCAR"]
percentages = [pct]

plugins = ["mean", "median"]
## Uncomment this for full imputation tests
# plugins = imputers.list() #default plugins


X_train, X_test, y_train, y_test = dataset()

for ampute_mechanism in mechanisms:
    for p_miss in percentages:
        if ampute_mechanism not in datasets:
            datasets[ampute_mechanism] = {}

        headers.append(ampute_mechanism + "-" + str(p_miss))
        datasets[ampute_mechanism][p_miss] = ampute(X_train, ampute_mechanism, p_miss)

#### Evaluation

We compare the methods in terms of root mean squared error (RMSE) to the initial dataset.

In [None]:
results = []
duration = []

for plugin in tqdm(plugins):
    plugin_results = [plugin]
    plugin_duration = [plugin]

    for ampute_mechanism in mechanisms:
        for p_miss in percentages:
            ctx = imputers.get(plugin)
            x, x_miss, mask = datasets[ampute_mechanism][p_miss]

            start = time.time() * 1000
            x_imp = ctx.fit_transform(pd.DataFrame(x_miss))

            plugin_duration.append(round(time.time() * 1000 - start, 4))
            plugin_results.append(RMSE(x_imp.to_numpy(), x, mask))

    results.append(plugin_results)
    duration.append(plugin_duration)

### Reconstruction error(RMSE)

__Interpretation__ : The following table shows the reconstruction error -  the __Root Mean Square Error(RMSE)__ for each method applied on the original full dataset and the imputed dataset.

In [None]:
display(HTML(tabulate.tabulate(results, headers=headers, tablefmt="html")))

### XGBoost test score after imputation

__Interpretation__ The following table shows different metrics on the test set for an XGBoost classifier, after imputing the dataset with each method.
Metrics:
 - accuracy

In [None]:
# third party
from sklearn import metrics


def get_metrics(X_train, y_train, X_test, y_test):
    xgb_clf = xgb.XGBClassifier(verbosity=0)
    xgb_clf = xgb_clf.fit(X_train, y_train)

    y_pred = xgb_clf.predict(X_test)

    score = xgb_clf.score(X_test, y_test)

    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
    auroc = metrics.auc(fpr, tpr)

    prec, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred)
    aurpc = metrics.auc(recall, prec)

    return score, auroc, aurpc


metrics_headers = ["Plugin", "Accuracy", "AUROC", "AURPC"]
xgboost_test_score = []


x, x_miss, mask = datasets["MAR"][pct]

xgboost_test_score.append(
    ["original dataset", *get_metrics(X_train, y_train, X_test, y_test)]
)

for plugin in plugins:
    X_train_imp = imputers.get(plugin).fit_transform(pd.DataFrame(x_miss.copy()))

    score, auroc, aurpc = get_metrics(X_train_imp, y_train, X_test, y_test)

    xgboost_test_score.append([plugin, score, auroc, aurpc])

In [None]:
display(
    HTML(
        tabulate.tabulate(xgboost_test_score, headers=metrics_headers, tablefmt="html")
    )
)

### Duration(ms) results

__Info__ : Here we measure the duration of imputing the dataset with each method.

In [None]:
display(HTML(tabulate.tabulate(duration, headers=headers, tablefmt="html")))

## Debugging

AutoPrognosis supports **debug** logging. __WARNING__: Don't use it for release builds. 

In [None]:
# autoprognosis absolute
from autoprognosis import logger

imputers = Imputers()

logger.add(sink=sys.stderr, level="DEBUG")

x, x_miss, mask = datasets["MAR"][pct]

x_imp = imputers.get("EM").fit_transform(pd.DataFrame(x))

imputers.get("softimpute").fit_transform(pd.DataFrame(x_miss))

# Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star AutoPrognosis on GitHub

The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.

- [Star AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
- [Star HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
