In [None]:
import numpy as np
import pandas as pd

import xgboost

import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

from sklearn.feature_selection import SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

![](https://psychology.osu.edu/sites/default/files/mouse.jpg)

<div style="color:white;
            display:fill;
            border-radius:8px;
            background-color:#4D4C7D;
            font-size:120%;
            letter-spacing:0.5px">
    <p style="padding: 8px;
              color:white;">
        <b>1 | INTRODUCTION</b></p>
</div>

Also known as **Multivariate Imputation by Chained Equation (MICE)**, it is a multiple imputation method wherein each of the missing data are replaced with m values which are obtained from iterating m times (where m > 1 and it normally lies between **3 and 10**). So, why should we use this? over the simple and blunt `SimpleImputer`.
- Results in unbiased estimates as it considers the uncertainty of missing data
- Provides more validity by making use of available data, as we know, some of our available features may be correlated with one another

<div style="color:white;
            display:fill;
            border-radius:8px;
            background-color:#4D4C7D;
            font-size:120%;
            letter-spacing:0.5px">
    <p style="padding: 8px;
              color:white;">
        <b>2 | METHODOLOGY</b></p>
</div>

[Khan and Hoque (2020)](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00313-w/figures/2) clearly captures the methodology of MICE through a visual flowchart as shown below.

Other useful resources are in the form of video tutorials.
- https://www.youtube.com/watch?v=WPiYOS3qK70&t=132s
- https://www.youtube.com/watch?v=1n7ld38PjEc&t=8s

![](https://media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs40537-020-00313-w/MediaObjects/40537_2020_313_Fig2_HTML.png)

---
To demonstrate the **MICE** method in action on a very simple data set, I provided a dummy example of a simple dataset containing the height and weight of people. Fortunately, the workflows of this method are included and available in the sklearn.impute package by calling `IterativeImputer`.

In [None]:
df = pd.DataFrame(data = {
                          'Height(cm)': [180, 174, 172, 175, 162, np.nan],
                          'Weight(kg)': [65, 62, 61, np.nan, 56, 59],
                    }
)

df

Some nice to know parameters to take note of when using the IterativeImputer. More information can be found in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html):

1. `initial_strategy` - simple imputation is performed first for every missing value to create the initial data set / placeholder.
> Aside from mean, `initial_strategy` can accept other measures of central tendency such as **median, most_frequent, constant**.
2. `estimator` - the predictive algorithm to use in estimating the best prediction for each missing value from other columns.
 > In our case, since we know that heigh and weight is linearly correlated with one another, a simple `LinearRegression()` can be assigned in this parameter. Other more advanced and sophisticated algorithms like XGBRegressor(), SVC(), etc. can also be used.
3. `max_iter` - the maximum number of imputation rounds to perform.
 > The number usually lies between 3 and 10.
4. `tol` - This is defined as the stopping criterion. 
 > Given the default value of 0.001, if the difference of imputed values between the previous iteration and the subsequent iteration fall below 0.001, the iteration stops and returns the dataset with the filled missing values.
5. `imputation_order` - order in which the features will be imputed. Possible values include:
 - **ascending:** From features with fewest missing values to most.
 - **descending:** From features with most missing values to fewest.
 - **roman:** Left to right.
 - **arabic:** Right to left.
 - **random:** A random order for each round.

Keeping these parameters in mind, I can now apply the IterativeImputer on the working example, thus arriving at the final imputed dataset.

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
imp = IterativeImputer(
    estimator = lr,
    max_iter = 10,
    tol = 1e-10,
    imputation_order = 'roman',
)

df = imp.fit_transform(df)
df

<div style="color:white;
            display:fill;
            border-radius:8px;
            background-color:#4D4C7D;
            font-size:120%;
            letter-spacing:0.5px">
    <p style="padding: 8px;
              color:white;">
        <b>3 | APPLICATION</b></p>
</div>

In [None]:
data = pd.read_csv("../input/tabular-playground-series-jun-2022/data.csv")
sub = pd.read_csv("/kaggle/input/tabular-playground-series-jun-2022/sample_submission.csv", index_col='row-col')

MICE is best performed with features that are highly correlated with another. With this in mind, let's plot the correlation matrix in F_4.

In [None]:
f4 = data[['F_4_0', 'F_4_1', 'F_4_2', 'F_4_3', 'F_4_4', 'F_4_5', 'F_4_6', 'F_4_7', 'F_4_8', 'F_4_9', 'F_4_10', 'F_4_11', 'F_4_12', 'F_4_13', 'F_4_14']]

plt.subplots(figsize = (12, 12))
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("",
                                                           ['#363062',
                                                            '#E9D5CA',
                                                            '#363062',
                                                           ])

mask = np.triu(np.ones_like(f4.corr() ))
sns.heatmap(f4.corr(),
            mask = mask,
            cmap = cmap,
            cbar = False,
            square = True,
            annot = True,
            linewidths = 3,
           )

I am particularly interested in F_4_11, as it had one of the highest correlation -0.58 with respect to F_4_8. I will simply select the 5 most highly correlated features with F_4_11 as candidates for IterativeImputer. To be specific, they are
- F_4_1
- F_4_3
- F_4_4
- F_4_5
- F_4_8

In [None]:
f4_selectbest = f4[['F_4_1', 'F_4_3','F_4_4','F_4_5','F_4_8',]]
f4_selectbest = f4_selectbest[:10000]

Only the first 10000 rows will be selected for the interest of time because the purpose is only to demonstrate the imputation process.

In [None]:
f4_selectbest

In [None]:
iimp = IterativeImputer(
    estimator = xgboost.XGBRegressor(),
    random_state = 42,
    verbose = 2,
)

final = iimp.fit_transform(f4_selectbest)
final