# **Imputing Missing Data**

As we have said previously many real world datasets have missing values. For many analysts the natural thing to do with these missing values is to use some imputation technique. The appropriateness of using imputation is dependent on the individual problem. Imputation techniques can be used to avoid losing important information that would be lost if one value was missing from a large row. This makes sense as datasets can be drastically reduced if every row with one missing value is removed from the analysis. Most statistical and machine learning packages will remove a row if one value is missing and generally you wont know. The problem with imputation is as follows:</br></br>

* When do I use it?
* What methods do I use?  </br></br>

There are a number of packages such as Amelia in R, statsmodels.imputation.mice.MICEData or sklearn.impute in python that will complete the imputation for you. Also there is a substanial number of techniques ranging from replacing missing values with means of that variable to data simulators or propensity score modelling. The technique you use in your imputation is very much dependent on the cause of your missing values. My view is that **you should not impute**  in the following situations:</br></br>

* When the variable in question is NMAR (unless you have evidence that this is a censored variable or you have apriori evidence on the distribution of variable in question).
* When there are missing values in the outcome variable.</br></br>

The reason for both these situations is that I generally follow the guiding principal that your imputation technique should not bias your results, I use missing value imputation as little as possible.

I will regularly start any analysis I do by modelling variables with missing values against a range of other variables in my dataset. So I code a new variable as Missing (Y/N) from the variable with a large number of missing values. I will then conduct a simple logistic regression on the remaining variables with the new Missing (Y/N)variable and examine if there are any relationships between Missing (Y/N) and the remaining variables.

Lets look at a simple example I have drafted below, where one variable has a number of missing values. We have 3 variables which are being proposed to predict "happiness". The first is a measure of Education attainment, second is work experience and the third is salary. Now if you follow my analysis below you will see there is a  strong relationship between the salary variable, educational attainment and work experience. In fact the correlation scrores are all around the 60% mark. Now we are faced with 2 problems:

* Work experience is correlated with some of the other variables, so you could conclude that MAR is the most likely missing value mechanism.

* or are the missing values missing due MCAR?

* or is it due to NMAR?

We cannot be sure here. However, I would initially conclude in this situation the problem is a MCAR or a MNAR, because when I do a logistic regression on the Missing Y/N that I created it is not significantly realted to either of the other 2 input variables.

Now my choice here would be MCAR because there is no intuitive reason to tell you that people of one or other experience level would choice to leave it out.




In [1]:
import pandas as pd
import numpy as np

from statsmodels.api import add_constant
import statsmodels.discrete.discrete_model as sml
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
plt.rc("font", size=14)
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

candidates = {'Ed_standard': [780,750,690,710,680,730,690,720,740,900,950,975,995,1000,1010,1020],
              'Work_experience':[5.1,4.5,None,3.3,3.6,9.3,6.7,2.8,5.4,None,7.8,None,None,10.1,6.7,None],
              'Salary': [78000,75000,100000,71000,68000,70000,69000,72000,74000,69000,102000,101000,79000,114000,101000,95000],
              'happiness': [0.5,0.55,0.1,0.6,0.7,0.45,0.56,0.73,0.45,0.67,0.43,0.23,0.78,0.42,0.36,0.23]
              }
df = pd.DataFrame(candidates,columns= ['Ed_standard','Salary','Work_experience','happiness'])

df.loc[df['Work_experience'].isnull()==False,'Missing']=0
df.loc[df['Work_experience'].isnull()==True,'Missing']=1

print(df.describe())
corr = df[['Ed_standard','Salary','Work_experience','happiness']].corr()
corr.style.background_gradient(cmap='coolwarm')
df.to_csv('candidates.csv')


       Ed_standard         Salary  Work_experience  happiness    Missing
count    16.000000      16.000000        11.000000   16.00000  16.000000
mean    833.750000   83625.000000         5.936364    0.48500   0.312500
std     136.607711   15564.382416         2.420443    0.19225   0.478714
min     680.000000   68000.000000         2.800000    0.10000   0.000000
25%     717.500000   70750.000000         4.050000    0.40500   0.000000
50%     765.000000   76500.000000         5.400000    0.47500   0.000000
75%     980.000000  100250.000000         7.250000    0.61750   1.000000
max    1020.000000  114000.000000        10.100000    0.78000   1.000000


In [4]:
X=df[['Ed_standard','Salary']]
X = add_constant(X)

y=df['Missing']

logit = sml.Logit(y, X).fit()
print(logit.summary())
#print(logit.predict())

Optimization terminated successfully.
         Current function value: 0.527167
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                Missing   No. Observations:                   16
Model:                          Logit   Df Residuals:                       13
Method:                           MLE   Df Model:                            2
Date:                Tue, 11 Feb 2025   Pseudo R-squ.:                  0.1512
Time:                        14:27:11   Log-Likelihood:                -8.4347
converged:                       True   LL-Null:                       -9.9374
Covariance Type:            nonrobust   LLR p-value:                    0.2225
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const          -6.8208      4.189     -1.628      0.103     -15.031       1.390
Ed_standard     0.0089    

If there was a relationship then we are probably dealing with a MAR missing data process, and imputation would probably be ok.  You can then use for example either an imputation by regression or a propensity score model.
If you don't have MAR then you have either a MCAR or NMAR process. To discount a NMAR process is difficult and generally requires quite a lot of content knowledge and research. You may in this situation have to make the following decisions:

* Do I impute using a random number generator, mean or median? This assumes the data is MCAR.

* Do I leave the variable out? What are the consequences of this? If so do I use imputation or do I just leave it out.

If a variable with missing values is highly correlated with other variables in your dataset then generally the $R^2$ will by high (> 80%). If the volume of missing values > 50% of the variable in question then I would recommend ignoring this variable or reframing your question to focus only on the subjects with near complete datasets.

Finally, if I impute I will always run my final analysis with the imputed values and without them. If there results are similar then I would be happy as you will probably imporve the power of your experiments. However, if they are not then I would reframe the problem. Be warned there is no hard and fast rule to deal with missing values so be careful.

If you would like to read more on this topic go to this [link](http://www.stat.columbia.edu/~gelman/arm/missing.pdf) from Columbia University Statistical education. Its an interesting read.

We will now discuss 2 techniques that are tpically used to impute missing data and thay are:

* Univariate imputation
* Multivariate feastures imputation.
* Nearest neighbors imputation
