# Close Gaps in Your Data with Smart Imputation <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/smart-imputation/smart-imputation.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

Dealing with datasets that contain missing values can be a challenge. This is especially so if the remaining non-missing values are not representative and thus provide a distorted, biased picture of the overall population.

In this tutorial we demonstrate how MOSTLY AI can help to close such gaps in your data via "Smart Imputation". By generating a synthetic dataset that doest not contain any missing values, it is possible to create a complete and sound representation of the underlying population. With this smartly imputed synthetic dataset it is then straightforward to accurately analyze the population as if all values were present in the first place.

For this tutorial, we will be using a modified version of the UCI Adult Income dataset, that itself stems from the 1994 American Community Survey by the US census bureau. This reduced dataset consists of 48,842 records and 10 mixed-type features. We will replace ~30% of the values for attribute `age` with missing values. We will do this randomly, but with a specified bias, so that we end up missing the age information particularly from the elder segments.

## Data Preparation for this Tutorial

We start by artificially injecting missing values into the original data via the following code.

In [None]:
%pip install mostlyai  # or: pip install 'mostlyai[local]'

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")


def mask(prob, col=None, values=None):
    is_masked = np.random.uniform(size=df.shape[0]) < prob
    if col:
        is_masked = (is_masked) & (df[col].isin(values))
    df["age"] = df["age"].mask(is_masked)


np.random.seed(123)
mask(0.1, "age", [51 + i for i in range(20)])
mask(0.2, "age", [71 + i for i in range(20)])
mask(0.6, "income", [">50K"])
mask(0.6, "education", ["Doctorate", "Prof-school", "Masters"])
mask(0.6, "marital_status", ["Widowed", "Divorced"])
mask(0.6, "occupation", ["Exec-managerial"])
mask(0.6, "workclass", ["Self-emp-inc"])
tgt = df
print(f"Created original data with missing values with {tgt.shape[0]:,} records and {tgt.shape[1]} attributes")

In [None]:
# let's show some samples
tgt[["workclass", "education", "marital_status", "age"]].sample(n=10, random_state=42)

In [None]:
# report share of missing values for column `age`
print(f"{tgt['age'].isna().mean():.1%} of values for column `age` are missing")

In [None]:
# plot distribution of column `age`
import matplotlib.pyplot as plt

tgt.age.plot(kind="kde", label="Original Data (with missings)", color="black")
_ = plt.legend(loc="upper right")
_ = plt.title("Age Distribution")
_ = plt.xlim(13, 90)
_ = plt.ylim(0, None)

## Synthesize Data via MOSTLY AI

The code below will automatically create a Generator using the MOSTLY AI Synthetic Data SDK. The we will use that Generator to create a Synthetic dataset with turned on Smart Imputation for the `age` column.

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

In [None]:
# train a generator on the original training data
g = mostly.train(data=tgt, name="Smart Imputation Tutorial - Census")

In [None]:
# generate synthetic data with imputed age column
config = {
    "name": "Smart Imputation Tutorial - Census",
    "tables": [{"name": "data", "configuration": {"imputation": {"columns": ["age"]}}}],
}
syn = mostly.generate(g, config=config).data()
print(f"Created synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")

If you want to, you can now check the distribution based on the Model QA and Data QA reports. Download these via `sd.reports()`. The Model QA reports on the accuracy and privacy of the trained Generative AI model. As one can see, the distributiosn are faithfully learned, and also include the right share of missing values. The Data QA visualizes then the distributions of the delivered Synthetic dataset. And there we can see, that the share of missing values (`N/A`) has dropped to 0%, and that the distribution has been shifted towards older age buckets.

## Analyze the results

We can now explore the imputed synthetic data.

In [None]:
# show some synthetic samples
syn[["workclass", "education", "marital_status", "age"]].sample(n=10, random_state=42)

In [None]:
# report share of missing values for column `age`
print(f"{syn['age'].isna().mean():.1%} of values for column `age` are missing")

In [None]:
# plot side-by-side
import matplotlib.pyplot as plt

tgt.age.plot(kind="kde", label="Original Data (with missings)", color="black")
syn.age.plot(kind="kde", label="Synthetic Data (imputed)", color="green")
_ = plt.title("Age Distribution")
_ = plt.legend(loc="upper right")
_ = plt.xlim(13, 90)
_ = plt.ylim(0, None)

As one can see, the imputed synthetic data does NOT contain any missing values anymore. But it's also apparent, that the synthetic age distribution is significantly distinct from the distribution of the non-missing values that were provided.

So, let's then check, whether that new distribution is more representative of the ground truth, i.e. the underlying original age distribution.

In [None]:
raw = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")

# plot side-by-side
import matplotlib.pyplot as plt

tgt.age.plot(kind="kde", label="Original Data (with missings)", color="black")
raw.age.plot(kind="kde", label="Original Data (ground truth)", color="red")
syn.age.plot(kind="kde", label="Synthetic Data (imputed)", color="green")
_ = plt.title("Age Distribution")
_ = plt.legend(loc="upper right")
_ = plt.xlim(13, 90)
_ = plt.ylim(0, None)

## Conclusion

As we can see, the smartly imputed synthetic data is perfectly able to recover the original, suppressed distribution! As an analyst you can proceed with the exploratory and descriptive analysis, as if the values were present in the first place.

## Further Reading

See also here for a benchmark of Smart Imputation with respect to other commonly used imputation techniques: https://mostly.ai/blog/smart-imputation-with-synthetic-data