This cheat sheet is intended to show you how to do a basic imputation using the Cheat_Sheet_Missing_Data_V2_0 dataset.  
Missing observations are a common problem with data sets.  As we discussed in class, there are a number of ways to deal with this theoretically, depending on the process that gives rise to missing observations.

Thankfully, statsmodel has a useful function which we can use to perform Multiple Imputation by Chained Equations (MICE). 

# Importing Our Data and Running a Regression

In [None]:
#First we import our data from the appropriate file
import pandas as pd
import os.path as osp
import numpy as np
data_path = osp.join(osp.curdir,'Data','Cheat_Sheet_Missing_Data_V2_0.xlsx')
data = pd.read_excel(data_path,sheet_name=0)
# Drop Obs column
data = data.drop(columns = 'Obs')
#Note that Family_Income has only 977 values where all the rest have 1000
data.info()

In [None]:
# View the null Family_Income rows
data[(pd.isnull(data.Family_Income))]

In [None]:
# We will run a regression with all fields except Family_Size
frmla = 'Grocery_Bill ~ N_Adults + Family_Income + N_Vehicles + ' + \
    'Distance_to_Store + Vegetarian + N_Children + Family_Pet'

In [None]:
#Train and view our model
from statsmodels.formula.api import ols
results = ols(frmla,data).fit()
results.summary()

# Performing MICE

*Depending on your version of Anaconda, you may get a notification about a function in MICE being depreciated. This means that that function may not work in a future Python update.*

The following snippet of code performs the imputation on the dataset.

In [None]:
#Import the required libraries
import statsmodels.imputation.mice as mice
import statsmodels.regression.linear_model as sm

#Impute the data set with MICE Data
imp = mice.MICEData(data)
imputed_data = imp.next_sample()

#Merge the imputed data with the blank data
merge_data = pd.merge(imputed_data,data,left_index=True,right_index=True)
merge_data = merge_data[['Family_Income_x','Family_Income_y']]
#Family_Income_x are the imputed values
merge_data[(pd.isnull(merge_data.Family_Income_y))]

Selecting a single imputation is not the best way to use imputed data for predictive modelling. Instead, you want to use all the imputed data sets and pool the results. To do this, you need to run a regression on all 5 imputed data sets, and then pool the regression results. This can be done by setting the $\text{n_imputations}$ argument.

In [None]:
mice = mice.MICE(frmla, sm.OLS, imp)
results = mice.fit(n_imputations=5)
print(results.summary())

## Interpreting Results

You can interpret these regression results as you normally would. 
If you are interested in learning more about MICE and multiple imputation, this is an excellent paper:
https://www.jstatsoft.org/article/view/v045i03


## Summary

Here is how an imputation works, step by step:

  1) We set the number of imputations we want. If we set m = 5, mice creates 5 exact copies of the original dataset
  
  2) Then, in the background, a regression model is built to predict the missing data (in this case, Family Income) using all the other variables available
  
  3) For each of the five datasets, we use the model just created to predict the missing values. In each of the datasets, the error term of the regression is randomized. This means that each of the 5 data sets will have different imputed values!
  
  4) We then run statistical tests on all of the 5 data sets, and pool the results together for a more robust answer


## Little's Test

There is a formal test for determining whether data is MAR or MCAR – Little’s test. You can use the test by installing the impyute package.