# Improving MICE for Data Imputation: A Methodological and Practical Exploration

## Evaluation
Comparing the performance of the our model to impute missing values.
Our model is compared with the following models:
1. MICE - Multiple Imputation by Chained Equations (The original model we're trying to improve).
2. KNNI - K-Nearest Neighbors Imputation.
3. SICE - Single Imputation with Chained Equations.

For ablation study, we compared several versions of our improvements:
1. Ordered only - MICE where the imputation order is computed using the Bayesian Network structure.
2. correlated variables in regression only - MICE where only the correlated variables are used as features to the linear regression. 

### STEP 0 - Imports and constants

In [1]:
# General
import numpy as np
import pandas as pd
from datetime import datetime

# Models
from reparo import \
    MICE, \
    SICE, \
    KNNImputer

# Metrics
from sklearn.metrics import root_mean_squared_error as RMSE, \
    mean_squared_error as MSE, \
    mean_absolute_error as MAE

# Utils
from missing_values_injections import inject_missing_values

# Constants
DATA_FOLDER = "data/"
FRAMINGHAM = {"name": "framingham.csv",
              "numeric_columns": ["age", "education", "cigsPerDay", "BPMeds", "totChol", "sysBP", "diaBP", "heartRate", "glucose"], # TODO - check about education
              }

metrics_dict = {
    "RMSE": RMSE,
    "MSE": MSE,
    "MAE": MAE
}

models_dict = {
    # "OrdereredMICE": OrderedMICE,
    # "CorrelatedMICE": CorrelatedMICE,
    # "OrderedCorrelatedMICE": OrderedCorrelatedMICE,
    "MICE": MICE,
    "SICE": SICE,
    "KNN": KNNImputer
}

### STEP 1 - Load data

In [2]:
framingham_df = pd.read_csv(DATA_FOLDER + FRAMINGHAM["name"])

In [3]:
print(framingham_df.shape)
framingham_df.head()

(4240, 16)


Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


### STEP 2 - Remove not relevant rows and columns
- Remove rows with missing values.
- Remove columns with non-numeric features.

In [4]:
framingham_df = framingham_df[FRAMINGHAM["numeric_columns"]]
framingham_df.dropna(inplace=True)

# Verify that there are no missing values
print("Total amount of missing values:", framingham_df.isnull().sum().sum())
print("New shape:", framingham_df.shape)

Total amount of missing values: 0
New shape: (3671, 9)


### Start the comparison
Using different injections severity:
1. Inject missing values.
2. For each missing values imputation algorithm:
    1. Impute missing values.
    2. Calculate metrics.

In [5]:
import pickle

original_data = framingham_df.to_numpy()
# Severity levels
severity_levels = [0.1, 0.2, 0.3, 0.4]
injected_columns = list(framingham_df.columns)
results = {}

for current_severity_level in severity_levels:
    print("Severity level:", current_severity_level)
    current_severity_results = {}
    # Inject missing values
    data_with_missing = inject_missing_values(framingham_df, columns=injected_columns, rows_severity=current_severity_level).to_numpy()

    for current_model_name, current_model in models_dict.items():
        print("    Model:", current_model_name)
        # Impute missing values
        start_time = datetime.now()
        data_with_imputations = current_model().fit_transform(data_with_missing)
        end_time = datetime.now()
        
        # Calculate metrics
        current_severity_results[current_model_name] = {metric_name: metric_func(original_data, data_with_imputations) 
                                                        for metric_name, metric_func in metrics_dict.items()}
        current_severity_results[current_model_name]["time"] = (end_time - start_time).total_seconds() * 1000
    print(current_severity_results)
    results[current_severity_level] = current_severity_results
    with open("results.pkl", "wb") as f:
        pickle.dump(results, f)

results_df = pd.DataFrame(results)
print(results_df)

Severity level: 0.1
    Model: MICE
    Model: SICE
    Model: KNN
{'MICE': {'RMSE': 4.308558076810392, 'MSE': 34.89766170694665, 'MAE': 0.9743387978843785, 'time': 588.463}, 'SICE': {'RMSE': 24.12231435210674, 'MSE': 1087.4109839886196, 'MAE': 7.407548654620298, 'time': 93731.35}, 'KNN': {'RMSE': 5.14836742119827, 'MSE': 46.09236215486541, 'MAE': 1.1703230442480743, 'time': 5718.277999999999}}
Severity level: 0.2
    Model: MICE
    Model: SICE
    Model: KNN
{'MICE': {'RMSE': 6.099777051334195, 'MSE': 67.88282654369326, 'MAE': 1.9831574479528749, 'time': 577.506}, 'SICE': {'RMSE': 34.11999469840435, 'MSE': 2180.91881564212, 'MAE': 14.852310905293743, 'time': 375419.778}, 'KNN': {'RMSE': 7.544385530570079, 'MSE': 99.4515831004285, 'MAE': 2.4625968433339605, 'time': 13106.047999999999}}
Severity level: 0.3
    Model: MICE
    Model: SICE
    Model: KNN
{'MICE': {'RMSE': 7.566520242286542, 'MSE': 102.35326547331654, 'MAE': 3.028848519948139, 'time': 1156.216}, 'SICE': {'RMSE': 41.938935

MemoryError: Unable to allocate 2.27 MiB for an array with shape (16515, 2, 9) and data type float64

In [7]:
results

{0.1: {'MICE': {'RMSE': 2.327087556107419,
   'MSE': 17.785909979167933,
   'MAE': 0.5242930092935538},
  'KNN': {'RMSE': 2.9105594771411596,
   'MSE': 25.441002663923186,
   'MAE': 0.6629902287020697}},
 0.2: {'MICE': {'RMSE': 3.3340082811231806,
   'MSE': 35.7163372416136,
   'MAE': 1.0757643837076727},
  'KNN': {'RMSE': 4.476690404126753,
   'MSE': 59.32145744282983,
   'MAE': 1.4755357488082572}},
 0.3: {'MICE': {'RMSE': 4.074866923273656,
   'MSE': 52.09306411981672,
   'MAE': 1.644564896840846},
  'KNN': {'RMSE': 5.371500273277924,
   'MSE': 86.35813200798657,
   'MAE': 2.155940486613982}}}