# Expecatation Maximization Algorithm Imputation

Expectation Maximization (EM) is an algorithm used for find the optimal likelihood of a missing value or a datapoint in a set of data. The algorithm is popular in 2 applications:  
1. Missing Value Imputation
2. Guassian Mixture Model (GMM): used in the classification application  
  
The EM algorithm is separated into 2 major steps:  
1. <b>Expectation step (E Step) </b>
2. <b>Optimize step (M Step) </b>

## Formulas 

As mentioned above the EM steps has 2 major steps and both steps has different formulas:  
  
1. Expectation Steps:  
   The formulas of the expectation step is given by:  
     
   - Log-Likelihood Function  
   \begin{equation}
   LL = \sum_{z}^{} p(x,z|\mu)
   \end{equation}
   - Expectation Step  
   \begin{equation}
   E(LL/\mu) = \int_{x}^{} p(x,z|\mu_0)LL(x,z|\mu)dx
   \end{equation}
   - Optimization Step
   \begin{equation}
   \mu_1 = Argmax(E(LL/\mu))
   \end{equation}

## Intuition 

It is hard to understand EM Algorithm by looking at the mathematics equation. Here is the intuition behind the EM algorithm:  
  
  Support we have a set of value [1, 3, 4, _ ,4]  
  The likely number of the missing value of the dataset is
      
  \begin{equation}
  x = \frac{1+3+4+4}{5} = 2.4
  \end{equation}
This is called the <b>estimate steps (E Step)</b>
  
  This time we suppose that we have a data set of [1, 3, 4, 2.4, 4]  
  The mean $\mu$ of the data set is  
  
  \begin{equation}
  \mu = \frac{1+3+4+2.4+4}{5} = 2.88
  \end{equation}

  This is called the <b> Optimization step (M Step)</b>  
    
  The we will use the value <b>2.88</b> to fill the missing data and continue to do the opimization steps over and over until the value converge at  
  
  \begin{equation}
  x^* = \mu^* 
  \end{equation}

## Assumption 

The assumption used in the EM Algorithm is the data is drawn from Normal Distribution   
  
    
<img title="Normal Distribution" src="https://www.w3schools.com/statistics/img_normal_distribution.svg">

## Implement EM Algorithm for Missing Value Imputation 

In [31]:
#import  library 
import numpy as np
import pandas as pd
data =pd.read_csv('loan_data.csv')
data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
376,LP002953,Male,Yes,3+,Graduate,No,5703,0.0,128.0,360.0,1.0,Urban,Y
377,LP002974,Male,Yes,0,Graduate,No,3232,1950.0,108.0,360.0,1.0,Rural,Y
378,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
379,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y


In [32]:
print(data.info())
print(data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381 entries, 0 to 380
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            381 non-null    object 
 1   Gender             376 non-null    object 
 2   Married            381 non-null    object 
 3   Dependents         373 non-null    object 
 4   Education          381 non-null    object 
 5   Self_Employed      360 non-null    object 
 6   ApplicantIncome    381 non-null    int64  
 7   CoapplicantIncome  381 non-null    float64
 8   LoanAmount         381 non-null    float64
 9   Loan_Amount_Term   370 non-null    float64
 10  Credit_History     351 non-null    float64
 11  Property_Area      381 non-null    object 
 12  Loan_Status        381 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 38.8+ KB
None
(381, 13)


In [33]:
data_to_impute = data.iloc[:,6:11]
print(data_to_impute.columns)
data_to_impute.nunique()

Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History'],
      dtype='object')


ApplicantIncome      322
CoapplicantIncome    182
LoanAmount           101
Loan_Amount_Term      10
Credit_History         2
dtype: int64

Since <b>Credit history </b> and <b> Loan Amount Term </b> are sort of categorical data and the data of <i> Applicant Income and Loan Amount </i>don't have the missing value which left the <b> Co Applicant Income </b> to impute 

In [34]:
data_to_impute1 =data_to_impute.iloc[:,1]

The missing value is fill by <b> 0 </b> in this case. Change row which has 0 to NaN for easy to fill

In [35]:
data_to_impute1[data_to_impute1==0] = np.NaN
data_to_impute1


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_to_impute1[data_to_impute1==0] = np.NaN


0      1508.0
1         NaN
2      2358.0
3         NaN
4      1516.0
        ...  
376       NaN
377    1950.0
378       NaN
379       NaN
380       NaN
Name: CoapplicantIncome, Length: 381, dtype: float64

In [41]:
#Start Imputing 
thres = 0.005
print()
while True:
    old_mean = data_to_impute1.dropna().mean()
    data_to_impute1.fillna(value = old_mean, inplace=True)
    new_mean = data_to_impute1.dropna().mean()
    mean_diff = np.abs(old_mean-new_mean)
    if mean_diff <thres:
        conv_mean = new_mean
        break
print('The mean to impute by EM algorithm is ', conv_mean)
print(data_to_impute1.mean())


The mean to impute by EM algorithm is  2362.3394174205823
2362.3394174205823
