# **Artificial Intelligence and Society 2024/2025**
**Miriam Seoane Santos** (FCUP/DCC & LIAAD)

## **T07: Missing Data Generation and Imputation**

In this tutorial, we will explore the basic concepts related to **missing data**. 

We will simulate the missing data problem in the **iris dataset** and mitigate it using data imputation (i.e., replacing the missing values with plausible estimates). To that end, we will apply three different imputation techniques and assess their impact on classification performance. 

In terms of software, we will essentially rely on the [scikit-learn](https://scikit-learn.org/stable/modules/impute.html) standard stack, and the Missing Completely At Random (MCAR) mechanism, which we will also implement. You may further explore other missing data mechanisms implemented in [mdatagen](https://arthurmangussi.github.io/pymdatagen/).
 
1. Basic Setup and Util Functions
2. Exploring Mean Imputation
3. Exploring KNN Imputation
4. Exploring MICE Imputation

*Let's get started!*

## **Basic Setup and Util Functions**

We start by importing the necessary modules, loading the iris dataset, artificially inject it with missing values following a MCAR mechanisms, and create an auxiliary function that trains and evaluates a logistic regression model.

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

We can predefine a missing rate to be used across the tutorial:

In [5]:
# Try 0.1, 0.3, and 0.5.
MISSING_RATE = 0.3

Loads the **iris dataset** into the variables `X` (containing the data features) and `y` (containing the target labels).

In [2]:
X, y = load_iris(return_X_y=True)

Now, we simulate MCAR by creating a random number generator and defining a mask to introduce missing values into the dataset `X` based on the predefined missing rate. The selected positions in `X` are replace with **NaN** values to simulate missing data.

In [6]:
rng = np.random.default_rng()
missing_mask = rng.choice([0, 1], X.shape,
                          p=[1. - MISSING_RATE, MISSING_RATE])
missing_mask = np.argwhere(missing_mask == 1)
X[missing_mask[:, 0], missing_mask[:, 1]] = np.nan

In [17]:
missing_rate = np.isnan(X).sum() / np.size(X)
print(f"Global Missing Rate = {missing_rate * 100:.0f}%")

Global Missing Rate = 28%


We then split the dataset `X` and labels `y` into training and test sets (`X_train`, `X_test`, `y_train`, `y_test`) while preserving the original class distribution using stratified sampling.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

Now, our helper `run_logistic_regression` function trains and evaluates a logistic regression model by imputing missing values in the training set and testing data using a specified imputer, fitting the model on the training data, predicting the labels for the testing data, and returning the predictions along with a classification report.

In [None]:
def run_logistic_regression(X_train_lr, X_test_lr, y_train_lr, y_test_lr, imputer):
    X_train_lr = imputer.fit_transform(X_train_lr)
    X_test_lr = imputer.fit_transform(X_test_lr)        # Maybe change to fit on training?

    clf = LogisticRegression(max_iter=200)
    clf.fit(X_train_lr, y_train_lr)
    y_pred = clf.predict(X_test_lr)

    report = classification_report(y_test_lr, y_pred)
    return y_pred, report

## **Exploring Mean Imputation**

We start with a `SimpleImputer` to fill missing values with the mean, and then train a logistic regression model using the imputed data, and printing the classification report.

In [21]:
simple_imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
y_pred, clf_report = run_logistic_regression(X_train, X_test, y_train, y_test, simple_imputer)
print("Results with Mean Imputation")
print(clf_report)

Results with Mean Imputation
              precision    recall  f1-score   support

           0       1.00      0.92      0.96        12
           1       0.76      1.00      0.87        13
           2       1.00      0.77      0.87        13

    accuracy                           0.89        38
   macro avg       0.92      0.90      0.90        38
weighted avg       0.92      0.89      0.90        38



## **Exploring kNN Imputation**

Rather than the mean, we can decide on a `KNNImputer` with `k=5` to fill missing values. Similarly, we train a logistic regression model using the imputed data, and print the classification report.

In [22]:
knn_imputer = KNNImputer(n_neighbors=5)
y_pred, clf_report = run_logistic_regression(X_train, X_test, y_train, y_test, knn_imputer)
print("Results with kNN Imputation")
print(clf_report)

Results with kNN Imputation
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.81      1.00      0.90        13
           2       1.00      0.77      0.87        13

    accuracy                           0.92        38
   macro avg       0.94      0.92      0.92        38
weighted avg       0.94      0.92      0.92        38



## **Exploring MICE Imputation**

Finally, we explore MICE by creating an `IterativeImputer` with 100 iterations to fill missing values. The `IterativeImputer` implements the MICE behavior, automatically aggregating the results into a single imputation.

In [23]:
mice_imputer = IterativeImputer(max_iter=100)
y_pred, clf_report = run_logistic_regression(X_train, X_test, y_train, y_test, mice_imputer)
print("Results with MICE Imputation")
print(clf_report)

Results with MICE Imputation
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.87      1.00      0.93        13
           2       1.00      0.85      0.92        13

    accuracy                           0.95        38
   macro avg       0.96      0.95      0.95        38
weighted avg       0.95      0.95      0.95        38



## **Bibliography**
- To consolidate the concepts discussed herein, please refer to the following:
    - Van Buuren, Stef. [Flexible imputation of missing data](https://stefvanbuuren.name/fimd/). CRC press, 2018.
    - [Imputation of Missing Values](https://scikit-learn.org/stable/modules/impute.html), scikit-learn documentation.