# Improving MICE for Data Imputation: A Methodological and Practical Exploration
## Evaluation
Comparing the performance of the our model to impute missing values.
Our model is compared with the following models:
1. MICE - Multiple Imputation by Chained Equations (The original model we're trying to improve).
2. KNNI - K-Nearest Neighbors Imputation.
3. SICE - Single Imputation with Chained Equations.

For ablation study, we compared several versions of our improvements:
1. Ordered only - MICE where the imputation order is computed using the Bayesian Network structure.
2. correlated variables in regression only - MICE where only the correlated variables are used as features to the linear regression. 

### STEP 0 - Imports and constants

In [1]:
# General
import numpy as np
import pandas as pd
from datetime import datetime
import pickle

# Disable warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Models
from BIMICE import BIMICE
from reparo import \
    MICE, \
    SICE, \
    KNNImputer as KNN

# Metrics
from sklearn.metrics import \
    root_mean_squared_error as RMSE, \
    mean_squared_error as MSE, \
    mean_absolute_error as MAE

metrics_dict = {
    "RMSE": RMSE,
    "MSE": MSE,
    "MAE": MAE
}

models_dict = {
    "OrderBIMICE": BIMICE(order_imputations=True, filter_predicators=False),
    "FilterBIMICE": BIMICE(order_imputations=False, filter_predicators=True),
    "FullBIMICE": BIMICE(order_imputations=True, filter_predicators=True),
    "MICE": MICE(),
    # "SICE": SICE(),
    "KNN": KNN()
}

baselines_algorithms = ["MICE", "SICE", "KNN"]

### STEP 1 - Load data

#### Define the datasets

In [2]:
DATA_FOLDER = "data/"
FILE_SUFFIX = ".csv"
FRAMINGHAM = {"name": "framingham",
              "numeric_columns": ["age", "education", "cigsPerDay", "BPMeds", "totChol", "sysBP", "diaBP", "heartRate", "glucose"]
}
FINANCIAL_RISK = {"name": "financial-risk",
             "numeric_columns": ["Age", "Income", "Credit Score", "Loan Amount", "Years at Current Job", "Debt-to-Income Ratio", "Assets Value"]
}

#### Choose dataset to work with
Options are FRAMINGHAM or FINANCIAL_RISK

In [3]:
dataset = FINANCIAL_RISK

In [4]:
df = pd.read_csv(DATA_FOLDER + dataset["name"] + FILE_SUFFIX)

In [5]:
print(df.shape)
df.head()

(15000, 20)


Unnamed: 0,Age,Gender,Education Level,Marital Status,Income,Credit Score,Loan Amount,Loan Purpose,Employment Status,Years at Current Job,Payment History,Debt-to-Income Ratio,Assets Value,Number of Dependents,City,State,Country,Previous Defaults,Marital Status Change,Risk Rating
0,49,Male,PhD,Divorced,72799.0,688.0,45713.0,Business,Unemployed,19,Poor,0.154313,120228.0,0.0,Port Elizabeth,AS,Cyprus,2.0,2,Low
1,57,Female,Bachelor's,Widowed,,690.0,33835.0,Auto,Employed,6,Fair,0.14892,55849.0,0.0,North Catherine,OH,Turkmenistan,3.0,2,Medium
2,21,Non-binary,Master's,Single,55687.0,600.0,36623.0,Home,Employed,8,Fair,0.362398,180700.0,3.0,South Scott,OK,Luxembourg,3.0,2,Medium
3,59,Male,Bachelor's,Single,26508.0,622.0,26541.0,Personal,Unemployed,2,Excellent,0.454964,157319.0,3.0,Robinhaven,PR,Uganda,4.0,2,Medium
4,25,Non-binary,Bachelor's,Widowed,49427.0,766.0,36528.0,Personal,Unemployed,10,Fair,0.143242,287140.0,,New Heather,IL,Namibia,3.0,1,Low


### STEP 2 - Remove not relevant rows and columns
- Remove rows with missing values.
- Remove columns with non-numeric features.

In [6]:
df = df[dataset["numeric_columns"]]
df.dropna(inplace=True)

# Verify that there are no missing values
print("Total amount of missing values:", df.isnull().sum().sum())
print("New shape:", df.shape)

Total amount of missing values: 0
New shape: (7839, 7)


### STEP 3 - Missing values injection functions
We are using two types of missing values injection functions:
1. inject_missing_completely_at_random - Injection a portion of missing values in random cells in the dataset.
2. inject_missing_per_feature - Injecting a different portion of missing values for each column. 

In [7]:
# Set seed
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

#### Method 1: Injecting completely at random

In [8]:
def inject_missing_completely_at_random(
        df: pd.DataFrame, 
        missing_rate: float
) -> pd.DataFrame:
    """
    Inject missing values completely at random
    :param df: DataFrame
    :param missing_rate: float
    :return: DataFrame
    """
    df = df.copy()
    mask = np.random.rand(*df.shape) < missing_rate
    df[mask] = np.nan
    return df

#### Method 2: Injecting at random to given columns

In [9]:
def inject_missing_per_feature(df: pd.DataFrame,
                               missing_rate: float,
                               features: list
 ) -> pd.DataFrame:
    """
    Inject missing values per feature
    :param df: DataFrame
    :param missing_rate: float
    :param features: list
    :return: DataFrame
    """
    df = df.copy()
    for feature in features:
        mask = np.random.rand(df.shape[0]) < missing_rate
        df.loc[mask, feature] = np.nan
    return df

### STEP 4 - Evaluations
We evaluate the results using two tests:
1. Over different severities of missing completely at random (ranging from 10% to 40%).
2. Over different number of features with missing values.

#### Hyperparameter tuning:

In [10]:
iterations_amounts = [5, 10, 15]

#### Test 1 - Different severities of missing completely at random

In [11]:
original_data = df.to_numpy()
# Severity levels
severity_levels = [0.1, 0.2, 0.3, 0.4]

results = {}

for iterations_amount in iterations_amounts:
    print("Iterations amount:", iterations_amount)
    current_iterations_results = {}

    for current_severity_level in severity_levels:
        print("    Severity level:", current_severity_level)
        current_severity_results = {}
        # Inject missing values
        data_with_missing = inject_missing_completely_at_random(df, current_severity_level).to_numpy()
        missing_features_count = int(np.sum(np.any(np.isnan(data_with_missing), axis=0)))

        for current_model_name, current_model in models_dict.items():
            print("        Model:", current_model_name)
            # Impute missing values
            start_time = datetime.now()
            data_with_imputations = current_model.fit_transform(data_with_missing, n_iterations=iterations_amount)
            end_time = datetime.now()
            
            max_predictors_count = len(df.columns) - 1
            predicators_counts = [max_predictors_count] * len(df.columns)
            if current_model_name not in baselines_algorithms:
                data_with_imputations, predicators_counts = data_with_imputations
            
            # Calculate metrics
            current_severity_results[current_model_name] = {metric_name: metric_func(original_data, data_with_imputations) 
                                                            for metric_name, metric_func in metrics_dict.items()}
            # Calculate MAPE
            features_mape_sum = 0
            for column_index, predicators_count in enumerate(predicators_counts):
                features_mape_sum += metrics_dict["MAE"](original_data[:, column_index], data_with_imputations[:, column_index]) / max(max_predictors_count - predicators_count, 1)
            current_severity_results[current_model_name]["MAPE"] = features_mape_sum / missing_features_count
            current_severity_results[current_model_name]["time"] = (end_time - start_time).total_seconds() * 1000
            
        print(current_severity_results)
        current_iterations_results[current_severity_level] = current_severity_results

    results[iterations_amount] = current_iterations_results
    with open(f"{dataset['name']}__MCAR__results.pkl", "wb") as f:
        pickle.dump(results, f)

Iterations amount: 5
    Severity level: 0.1
        Model: OrderBIMICE
        Model: FilterBIMICE
        Model: FullBIMICE
        Model: MICE
        Model: KNN
{'OrderBIMICE': {'RMSE': 5597.814923031198, 'MSE': 108767600.90992944, 'MAE': 1528.6089746241262, 'MAPE': 1528.608974624127, 'time': 92.53999999999999}, 'FilterBIMICE': {'RMSE': 5600.6271464118345, 'MSE': 108831779.99689178, 'MAE': 1529.3120071046487, 'MAPE': 1167.0263758332164, 'time': 20.823999999999998}, 'FullBIMICE': {'RMSE': 5600.627180470726, 'MSE': 108831780.89685977, 'MAE': 1529.312015664474, 'MAPE': 1167.0263815655612, 'time': 17.947000000000003}, 'MICE': {'RMSE': 5602.920620400672, 'MSE': 108917728.25211693, 'MAE': 1530.2132847067132, 'MAPE': 1530.2132847067132, 'time': 297.277}, 'KNN': {'RMSE': 6051.884421407784, 'MSE': 124893755.67146477, 'MAE': 1616.0855176674177, 'MAPE': 1616.0855176674183, 'time': 13329.931999999999}}
    Severity level: 0.2
        Model: OrderBIMICE
        Model: FilterBIMICE
        Model

#### Test 2 - Different number of features

In [12]:
original_data = df.to_numpy()
random_runs = 10
# Severity levels
column_severity_range = (0.05, 0.4)
missing_features_counts = range(1, len(df.columns) + 1, 2)

results = {}

for iterations_amount in iterations_amounts:
    print("Iterations amount:", iterations_amount)
    current_iterations_results = {}

    for missing_features_count in missing_features_counts:
        current_results = {}
        print("Amount of columns:", missing_features_count)

        for random_run in range(random_runs):
            print("    Random run:", random_run)
            # Draw columns and severity level
            columns_to_inject_missing = np.random.choice(df.columns, missing_features_count, replace=False)
            severity_level = np.random.uniform(*column_severity_range)
            data_with_missing = inject_missing_per_feature(df, severity_level, columns_to_inject_missing).to_numpy()

            for current_model_name, current_model in models_dict.items():
                print("        Model:", current_model_name)
                # Impute missing values
                start_time = datetime.now()
                data_with_imputations = current_model.fit_transform(data_with_missing, n_iterations=iterations_amount)
                end_time = datetime.now()
                
                max_predictors_count = len(df.columns) - 1
                predicators_counts = [max_predictors_count] * len(df.columns)
                if current_model_name not in baselines_algorithms:
                    data_with_imputations, predicators_counts = data_with_imputations
                
                # Calculate metrics
                current_results[current_model_name] = {metric_name: 0 for metric_name in (list(metrics_dict.keys()) + ["time", "MAPE"])}
                for metric_name, metric_func in metrics_dict.items():
                    current_results[current_model_name][metric_name] += (metric_func(original_data, data_with_imputations) / random_runs) # Average on the fly
                # Calculate MAPE
                features_mape_sum = 0
                for column_index, predicators_count in enumerate(predicators_counts):
                    features_mape_sum += metrics_dict["MAE"](original_data[:, column_index], data_with_imputations[:, column_index]) / max(max_predictors_count - predicators_count, 1)
                current_results[current_model_name]["time"] = ((end_time - start_time).total_seconds() * 1000) / random_runs
                current_results[current_model_name]["MAPE"] = (features_mape_sum / missing_features_count) / random_runs
        
        current_iterations_results[missing_features_count] = current_results
    
    results[iterations_amount] = current_iterations_results

    with open(f"{dataset['name']}_per_features__results.pkl", "wb") as f:
        pickle.dump(results, f)

Iterations amount: 5
Amount of columns: 1
    Random run: 0
        Model: OrderBIMICE
        Model: FilterBIMICE
        Model: FullBIMICE
        Model: MICE
        Model: KNN
    Random run: 1
        Model: OrderBIMICE
        Model: FilterBIMICE
        Model: FullBIMICE
        Model: MICE
        Model: KNN
    Random run: 2
        Model: OrderBIMICE
        Model: FilterBIMICE
        Model: FullBIMICE
        Model: MICE
        Model: KNN
    Random run: 3
        Model: OrderBIMICE
        Model: FilterBIMICE
        Model: FullBIMICE
        Model: MICE
        Model: KNN
    Random run: 4
        Model: OrderBIMICE
        Model: FilterBIMICE
        Model: FullBIMICE
        Model: MICE
        Model: KNN
    Random run: 5
        Model: OrderBIMICE
        Model: FilterBIMICE
        Model: FullBIMICE
        Model: MICE
        Model: KNN
    Random run: 6
        Model: OrderBIMICE
        Model: FilterBIMICE
        Model: FullBIMICE
        Model: MICE
        Mod

In [13]:
results

{5: {1: {'OrderBIMICE': {'RMSE': 0.26945559804143715,
    'MSE': 5.0824423521107995,
    'MAE': 0.07760192564314546,
    'time': 21.1982,
    'MAPE': 0.5432134795020183},
   'FilterBIMICE': {'RMSE': 0.2694555980414363,
    'MSE': 5.082442352110767,
    'MAE': 0.07760192564314423,
    'time': 2.2152000000000003,
    'MAPE': 0.0905355799170016},
   'FullBIMICE': {'RMSE': 0.2694555980414363,
    'MSE': 5.082442352110767,
    'MAE': 0.07760192564314423,
    'time': 1.8212,
    'MAPE': 0.0905355799170016},
   'MICE': {'RMSE': 0.2695938747260888,
    'MSE': 5.087660010287825,
    'MAE': 0.07767046048235654,
    'time': 4.526000000000001,
    'MAPE': 0.5436932233764958},
   'KNN': {'RMSE': 0.30220455103832516,
    'MSE': 6.3929313467792985,
    'MAE': 0.08377708874859337,
    'time': 197.89249999999998,
    'MAPE': 0.5864396212401541}},
  3: {'OrderBIMICE': {'RMSE': 410.6814064896612,
    'MSE': 8983104.267846396,
    'MAE': 111.24548172817497,
    'time': 14.2286,
    'MAPE': 259.57279069907