# Tutorial 9: Dealing with missing data

Synthcity does not handle missing data on it's own. It assumes that you have imputed any missing values yourself first. So, in this tutorial, we will learn how to do this with another of the van der Schaar Lab's modules, `HyperImpute`. We will then show that we can happily generate a synthetic data afterwards.

Before we start you will need to install the library hyperimpute. This can be done with the command `pip install hyperimpute`.

In [None]:
!pip install synthcity
!pip install hyperimpute
!pip uninstall -y torchaudio torchdata

### Imports

In [None]:
import sys
import warnings

import numpy as np
import pandas as pd

from sklearn.datasets import load_diabetes

from hyperimpute.plugins.utils.simulate import simulate_nan

from IPython.display import display

if not sys.warnoptions:
    warnings.simplefilter("ignore")

from synthcity.plugins.core.dataloader import GenericDataLoader

### Load the data

We will use the diabetes dataset from SKLearn, but we need to introduce some NaN values in order to simulate the missingness that we will then fix with HyperImpute.

In [None]:

# Load baseline dataset
X, y = load_diabetes(as_frame=True, return_X_y=True)
df_ = pd.concat([X, y], axis=1)

# Simulate missing data
percentage_missing = 0.2
mechanism = "MCAR"
x_miss = simulate_nan(np.asarray(df_), percentage_missing, mechanism)["X_incomp"]


df = pd.DataFrame(x_miss, columns = df_.columns)
print("The diabetes dataset with 20% missing values.\n")
print(df.head())


### Try to Generate synthetic data

In [None]:
loader = GenericDataLoader(
    df,
    target_column="target",
    sensitive_columns=["sex"],
)

In [None]:
# synthcity absolute
from synthcity.plugins import Plugins

syn_model = Plugins().get("ctgan")

syn_model.fit(loader)

Whoops! Here we see that Synthcity cannot handle NaN values. Proof that we need to impute first!

### Lets Impute
First we can view the different possible Imputation plugins

In [None]:
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()
imputers.list()

### Load an imputer
Imputers are loaded with `Imputers().get(method_name)`. Below we explain all the parameters we are using as well.

In [None]:
imputer = Imputers().get(
    "hyperimpute",  # the name of the imputation method.
    # The rest of the kwargs are specific to the method
    # optimizer: str. The optimizer to use: simple, hyperband, bayesian
    optimizer="hyperband",
    # classifier_seed: list. Model search pool for categorical columns.
    classifier_seed=["logistic_regression", "catboost", "xgboost", "random_forest"],
    # regression_seed: list. Model search pool for continuous columns.
    regression_seed=[
        "linear_regression",
        "catboost_regressor",
        "xgboost_regressor",
        "random_forest_regressor",
    ],
    # class_threshold: int. how many max unique items must be in the column to be is associated with categorical
    class_threshold=5,
    # imputation_order: int. 0 - ascending, 1 - descending, 2 - random
    imputation_order=2,
    # n_inner_iter: int. number of imputation iterations
    n_inner_iter=10,
    # select_model_by_column: bool. If true, select a different model for each column. Else, it reuses the model chosen for the first column.
    select_model_by_column=True,
    # select_model_by_iteration: bool. If true, selects new models for each iteration. Else, it reuses the models chosen in the first iteration.
    select_model_by_iteration=True,
    # select_lazy: bool. If false, starts the optimizer on every column unless other restrictions apply. Else, if for the current iteration there is a trend(at least to columns of the same type got the same model from the optimizer), it reuses the same model class for all the columns without starting the optimizer.
    select_lazy=True,
    # select_patience: int. How many iterations without objective function improvement to wait.
    select_patience=5,
)


### Impute the missing values
We impute the missing values with `imputer.fit_transform()`, which is a style you may be familiar with from library `sklearn`.

In [None]:
x_imputed = imputer.fit_transform(df.copy())
display(x_imputed)

In [None]:
loader = GenericDataLoader(
    x_imputed,
    target_column="target",
    sensitive_columns=["sex"],
)

In [None]:
# synthcity absolute
from synthcity.plugins import Plugins

syn_model = Plugins().get("ctgan")

syn_model.fit(loader)

In [None]:
X_syn = syn_model.generate(count=df_.shape[0]).dataframe()
display(X_syn.head())

In [None]:
# third party
import matplotlib.pyplot as plt

syn_model.plot(plt, loader, plots=["tsne"])

plt.show()

### Done!
We have now generated synthetic data with synthcity starting from a dataset with missing values thanks to [HyperImpute](https://github.com/vanderschaarlab/hyperimpute). Visit the (HyperImpute tutorials)[https://github.com/vanderschaarlab/hyperimpute/tree/main/tutorials] for more details on how to impute data using the various different methods.

## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star [Synthcity](https://github.com/vanderschaarlab/synthcity) on GitHub

- The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.


### Checkout other projects from vanderschaarlab
- [HyperImpute](https://github.com/vanderschaarlab/hyperimpute) - the module you have just learnt to use for synthcity!
- [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)