# MICE Algorithm (Multivariate Imputation by Chained Equation)

MICE algorithm is a technique which we can impute missing values in a dataset by looking at data from other columns and trying to estimate the best prediction for each missing value.

There are 3 main steps in **multiple imputation** : imputation, analysis, and pooling.
## The **imputation model** should:
### 1. account for the process that created the missing data
### 2. preserve the relations in the data
### 3. preserve the uncertainty about these relations

<img src="img1.png">

<img src="img2.png">

In [None]:
import pandas as pd
import missing
import re
from statsmodels.imputation import mice
import numpy as np
import csv

In [None]:
df = pd.read_csv("test_dataset/train.csv")

In [None]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
df_mice = df.copy()
# mapping Embarked using numeric values
embarked_mapping = {"S": 1, "C": 2, "Q": 3}
df_mice['Embarked'] = df_mice['Embarked'].map(embarked_mapping)
# mapping Cabin using numeric values
deck = {"A": 0, "B": 1, "C": 2, "D": 3, "E": 4, "F": 5, "G": 6, "U": 7}
df_mice['Cabin'] = df_mice['Cabin'].fillna("U")
df_mice['Cabin'] = df_mice['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
df_mice['Cabin'] = df_mice['Cabin'].map(deck)
df_mice['Cabin'].replace({7:np.nan}, inplace=True)

numeric_features = [column for column in df_mice.columns if df_mice[column].dtype != 'object']
imp = mice.MICEData(df_mice[numeric_features])
imp.set_imputer('')
for i in range(100):
    imp.update_all()
operated_cols = [column for column in numeric_features if df[column].isnull().sum()]
print(f'Operating on following features : {operated_cols}')
# copying the imputed values to the original df
for i in operated_cols:
    df_mice[i] = imp.data[i]

# reverse mapping the values
embarked_mapping = {1:"S", 2:"C", 3:"Q"}
df_mice['Embarked'] = df_mice['Embarked'].map(embarked_mapping)
deck_mapping = {0 : "A", 1 : "B", 2 : "C", 3 : "D", 4 : "E", 5 : "F", 6 : "G"}
df_mice['Cabin'] = df_mice['Cabin'].map(deck_mapping)

Operating on following features : ['Age', 'Cabin', 'Embarked']


In [None]:
df_mice.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [None]:
df_mice

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,F,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,F,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,F,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,F,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.0,1,2,W./C. 6607,23.4500,F,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C,C


In [None]:
df_mice.to_csv("after_mice_train.csv")