# Imputation Plugins

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets.

**HyperImpute** provides a set of default imputation plugins and can be extended with any number of other plugins.

### Setup

In [1]:
import sys
import warnings
import time
from tqdm import tqdm
from math import sqrt

import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from hyperimpute.plugins.utils.metrics import RMSE
from hyperimpute.plugins.utils.simulate import simulate_nan

import xgboost as xgb

from IPython.display import HTML, display
import tabulate

if not sys.warnoptions:
    warnings.simplefilter("ignore")

### Loading the Imputation plugins

Make sure that you have installed HyperImpute in your workspace.

You can do that by running `pip install .` in the root of the repository.

In [2]:
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()

### List the existing plugins

In [3]:
imputers.list()

['softimpute',
 'hyperimpute',
 'mice',
 'nop',
 'most_frequent',
 'missforest',
 'sinkhorn',
 'EM',
 'mean',
 'median',
 'ice',
 'gain']

### Adding a new Imputation plugin

By default, HyperImpute automatically loads the imputation plugins with the pattern `hyperimpute/plugins/imputers/plugin_*`. 

Alternatively, you can call `Imputers().add(<name>, <ImputerPlugin derived class>)` at runtime.

Next, we show two examples of custom Imputation plugins.

In [4]:
custom_ice_plugin = "custom_ice"


class NewPlugin(ImputerPlugin):
    def __init__(self):
        super().__init__()
        lr = LinearRegression()
        self._model = IterativeImputer(
            estimator=lr, max_iter=500, tol=1e-10, imputation_order="roman"
        )

    @staticmethod
    def name():
        return custom_ice_plugin

    @staticmethod
    def hyperparameter_space():
        return []

    def _fit(self, *args, **kwargs) -> "NewPlugin":
        self._model.fit(*args, **kwargs)
        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

    def save(self) -> bytes:
        raise NotImplemented("placeholder")

    @classmethod
    def load(cls, buff: bytes) -> "NewPlugin":
        raise NotImplemented("placeholder")


imputers.add(custom_ice_plugin, NewPlugin)

assert imputers.get(custom_ice_plugin) is not None

### List the existing plugins

Now we should see the new plugins loaded.

In [5]:
imputers.list()

['softimpute',
 'hyperimpute',
 'mice',
 'nop',
 'most_frequent',
 'missforest',
 'sinkhorn',
 'EM',
 'mean',
 'median',
 'ice',
 'gain',
 'custom_ice']

### Testing the performance

We simulate some testing datasets using 3 amputation strategies:
- **Missing Completely At Random** (MCAR) if the probability of being missing is the same for all observations
- **Missing At Random** (MAR) if the probability of being missing only depends on observed values.
- **Missing Not At Random** (MNAR) if the unavailability of the data depends on both observed and unobserved data such as its value itself.

#### Load the dataset

In [6]:
from sklearn.preprocessing import MinMaxScaler

preproc = MinMaxScaler()


def dataset():
    X, y = load_breast_cancer(return_X_y=True)
    X = np.asarray(preproc.fit_transform(X, y))
    return train_test_split(X, y, test_size=0.2)


def ampute(x, mechanism, p_miss):
    x_simulated = simulate_nan(x, p_miss, mechanism)

    mask = x_simulated["mask"]
    x_miss = x_simulated["X_incomp"]

    return x, x_miss, mask

In [7]:
datasets = {}
headers = ["Plugin"]

pct = 0.3

mechanisms = ["MAR", "MNAR", "MCAR"]
percentages = [pct]

plugins = imputers.list()  # default plugins

X_train, X_test, y_train, y_test = dataset()

for ampute_mechanism in mechanisms:
    for p_miss in percentages:
        if ampute_mechanism not in datasets:
            datasets[ampute_mechanism] = {}

        headers.append(ampute_mechanism + "-" + str(p_miss))
        datasets[ampute_mechanism][p_miss] = ampute(X_train, ampute_mechanism, p_miss)

#### Evaluation

We compare the methods in terms of root mean squared error (RMSE) to the initial dataset.

In [8]:
results = []
duration = []

for plugin in tqdm(plugins):
    plugin_results = [plugin]
    plugin_duration = [plugin]

    for ampute_mechanism in mechanisms:
        for p_miss in percentages:
            ctx = imputers.get(plugin)
            x, x_miss, mask = datasets[ampute_mechanism][p_miss]

            start = time.time() * 1000
            x_imp = ctx.fit_transform(x_miss)

            plugin_duration.append(round(time.time() * 1000 - start, 4))
            plugin_results.append(RMSE(x_imp.to_numpy(), x, mask))

    results.append(plugin_results)
    duration.append(plugin_duration)

100%|██████████| 13/13 [23:58<00:00, 110.69s/it]


### Reconstruction error(RMSE)

__Interpretation__ : The following table shows the reconstruction error -  the __Root Mean Square Error(RMSE)__ for each method applied on the original full dataset and the imputed dataset.

In [9]:
display(HTML(tabulate.tabulate(results, headers=headers, tablefmt="html")))

Plugin,MAR-0.3,MNAR-0.3,MCAR-0.3
softimpute,0.112766,0.109049,0.0961995
hyperimpute,0.0818665,0.0776259,0.0619054
mice,0.0703866,0.0850556,0.0733156
nop,,,
most_frequent,0.242866,0.231878,0.209099
missforest,0.0859716,0.0881004,0.0744402
sinkhorn,0.093691,0.0885909,0.0762017
EM,0.0567481,0.0653695,0.0524116
mean,0.184329,0.164505,0.14423
median,0.197736,0.173701,0.149625


### XGBoost test score after imputation

__Interpretation__ The following table shows different metrics on the test set for an XGBoost classifier, after imputing the dataset with each method.
Metrics:
 - accuracy

In [10]:
from sklearn import metrics


def get_metrics(X_train, y_train, X_test, y_test):
    xgb_clf = xgb.XGBClassifier(verbosity=0)
    xgb_clf = xgb_clf.fit(X_train, y_train)

    y_pred = xgb_clf.predict(X_test)

    score = xgb_clf.score(X_test, y_test)

    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
    auroc = metrics.auc(fpr, tpr)

    prec, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred)
    aurpc = metrics.auc(recall, prec)

    return score, auroc, aurpc


metrics_headers = ["Plugin", "Accuracy", "AUROC", "AURPC"]
xgboost_test_score = []


x, x_miss, mask = datasets["MAR"][pct]

xgboost_test_score.append(
    ["original dataset", *get_metrics(X_train, y_train, X_test, y_test)]
)

for plugin in plugins:
    X_train_imp = imputers.get(plugin).fit_transform(x_miss.copy())

    score, auroc, aurpc = get_metrics(X_train_imp, y_train, X_test, y_test)

    xgboost_test_score.append([plugin, score, auroc, aurpc])

In [11]:
display(
    HTML(
        tabulate.tabulate(xgboost_test_score, headers=metrics_headers, tablefmt="html")
    )
)

Plugin,Accuracy,AUROC,AURPC
original dataset,0.991228,0.989796,0.992424
softimpute,0.991228,0.989796,0.992424
hyperimpute,0.982456,0.982104,0.989001
mice,0.991228,0.989796,0.992424
nop,0.991228,0.989796,0.992424
most_frequent,0.991228,0.989796,0.992424
missforest,0.991228,0.989796,0.992424
sinkhorn,0.991228,0.989796,0.992424
EM,0.982456,0.982104,0.989001
mean,0.973684,0.9719,0.981542


### Duration(ms) results

__Info__ : Here we measure the duration of imputing the dataset with each method.

In [12]:
display(HTML(tabulate.tabulate(duration, headers=headers, tablefmt="html")))

Plugin,MAR-0.3,MNAR-0.3,MCAR-0.3
softimpute,27735.8,16999.7,16842.0
hyperimpute,104270.0,108181.0,70727.8
mice,87594.6,155285.0,149379.0
nop,0.3501,0.1931,0.1396
most_frequent,7.144,6.1777,6.4756
missforest,56601.3,65379.5,65258.5
sinkhorn,23047.8,22940.0,23160.6
EM,76708.2,75736.8,74979.7
mean,12.6116,5.5164,6.0649
median,7.5566,7.6375,7.6201


## Debugging

HyperImpute supports **debug** logging. __WARNING__: Don't use it for release builds. 

In [13]:
from hyperimpute import logger

imputers = Imputers()

logger.add(sink=sys.stderr, level="DEBUG")

x, x_miss, mask = datasets["MAR"][pct]

x_imp = imputers.get("EM").fit_transform(x)

imputers.get("softimpute").fit_transform(x_miss)

[2021-12-28T13:37:41.494917+0200][325578][DEBUG] EMPlugin._fit took 0.0 seconds
[2021-12-28T13:37:41.516348+0200][325578][DEBUG] EM converged after 1 iterations.
[2021-12-28T13:37:41.517241+0200][325578][DEBUG] EMPlugin._transform took 0.0215 seconds
[2021-12-28T13:37:58.960419+0200][325578][DEBUG] SoftImputePlugin._fit took 17.4421 seconds
[2021-12-28T13:38:05.198815+0200][325578][DEBUG] SoftImputePlugin._transform took 6.235 seconds


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,0.552274,0.471933,0.552208,0.419871,0.496919,0.457702,0.360506,0.501491,0.427778,0.221778,...,0.529705,0.247868,0.498979,0.339609,0.638777,0.467357,0.384144,0.817182,0.323082,0.216479
1,0.248471,0.390260,0.317877,0.195080,0.343685,0.153580,0.034255,0.094235,0.230808,0.176706,...,0.263252,0.486674,0.238358,0.130333,0.379912,0.120315,0.049768,0.273643,0.130298,0.138594
2,0.416442,0.446398,0.427821,0.347622,0.567572,0.386529,0.499766,0.471123,0.523232,0.491786,...,0.436144,0.492537,0.397878,0.267106,0.588644,0.451349,0.587540,0.698969,0.368858,0.267273
3,0.352075,0.374621,0.350287,0.211665,0.405254,0.290534,0.219963,0.290209,0.413636,0.293597,...,0.298826,0.502132,0.294288,0.157589,0.475005,0.267107,0.249416,0.537801,0.227282,0.252460
4,0.171281,0.312479,0.176145,0.101191,0.399476,0.292375,0.149649,0.131312,0.435354,0.361038,...,0.155112,0.291045,0.138802,0.058887,0.331044,0.217530,0.155045,0.272371,0.271043,0.212379
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450,0.277770,0.394319,0.268399,0.157370,0.394079,0.195632,0.143533,0.092793,0.262626,0.235468,...,0.230167,0.399520,0.205289,0.113203,0.400683,0.161355,0.146805,0.192474,0.181944,0.173619
451,0.222869,0.498140,0.225140,0.133331,0.640697,0.416600,0.105787,0.225199,0.540909,0.507372,...,0.179651,0.537580,0.177848,0.074346,0.697550,0.288937,0.214834,0.449485,0.244037,0.292929
452,0.116191,0.291173,0.110773,0.057306,0.435768,0.177971,0.063496,0.069881,0.225253,0.413437,...,0.145500,0.346482,0.126401,0.062525,0.410289,0.075298,0.091374,0.173608,0.175241,0.172635
453,0.232335,0.387555,0.225278,0.123139,0.407150,0.189620,0.059864,0.108300,0.484343,0.272536,...,0.182142,0.404851,0.172718,0.082997,0.471703,0.185707,0.092971,0.283952,0.297654,0.121147


# Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star HyperImpute on GitHub

The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.

- [Star HyperImpute](https://github.com/vanderschaarlab/hyperimpute)