# **CIS 520: Machine Learning, Fall 2020**
# **Week 9, Worksheet 2**
## **Missing Data** 


- **Content Creator:** Kenneth Shinn
- **Content Reviewers:** Hanwen Zhang, Mohit Kumaraian
<!-- - **Reference:**  -->
<!-- <ul>
<li>https://towardsdatascience.com/k-means-clustering-with-scikit-learn-6b47a369a83c
<li>https://medium.com/swlh/gaussian-mixture-models-gmm-1327a2a62a
</ul> -->


This worksheet will work through an example of missing data imputation using both means and regression. Here, we are going to compare the performance of those two missing data imputation techniques from lecture.


# **Missing Data**

We will split up a data set into testing and training data. The training data will have data elements randomly dropped, and we will use the two strategies to impute the missing data. NOTE: this worksheet will have data missing at random. Then, we will train regression models on each of the imputed training sets, and see how they perform with the held out testing set!

In [None]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from random import random
from random import seed

Let's create a dataset using sklearn's make_regression function. This function randomly generates a data set for a regression problem. 

In [None]:
X, y = make_regression(n_samples = 100000, n_features = 2)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_train_missing = X_train.copy()

Now, let's randomly drop values from the X_train data set. These missing values will be replaced with np.nan.

In [None]:
seed(1)
for i in range(len(X_train_missing)):
    if random() < .3:
        if random() < .5:
            X_train_missing[i][0] = np.nan
        else:
            X_train_missing[i][1] = np.nan

In [None]:
# make sure everything looks good!

print(X_train_missing)
print(len(X_train))

[[ 0.48293426         nan]
 [-0.35371807  1.54279003]
 [        nan  1.35282604]
 ...
 [ 0.55075279  1.2841265 ]
 [-0.53599387         nan]
 [-0.06317035         nan]]
70000


**Simple Mean Based Imputation**

Let's perform a mean based imputation on the dataset with missing values. Remember that a mean based imputation uses the mean of all non-missing values in the column. 

In [None]:
mean_imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train_mean_imp = mean_imp.fit_transform(X_train_missing)

**Regression Based Imputation**

Here, we will now do a regression based imputation on the dataset with missing values. Remember that a regression based imputation uses the other columns of non-missing data to predict the missing data of a given column. 

In [None]:
reg_imp = IterativeImputer(missing_values=np.nan)
X_train_reg_imp = reg_imp.fit_transform(X_train_missing)

**Training and Testing on the Imputed Datasets**

Now, let's train OLS regression on each of the imputed data sets and compare their testing MSE. 

Which method do you expect to have the lower test MSE? Why?

In [None]:
# training on mean imputed data
mean_imp_lm = LinearRegression().fit(X_train_mean_imp, y_train)
y_pred_mean_imp = mean_imp_lm.predict(X_test)
mse_mean_imp = mean_squared_error(y_test, y_pred_mean_imp)

print("Mean Imputed Data MSE: " + str(mse_mean_imp))

# training on regression imputed data
reg_imp_lm = LinearRegression().fit(X_train_reg_imp, y_train)
y_pred_reg_imp = reg_imp_lm.predict(X_test)
mse_reg_imp = mean_squared_error(y_test, y_pred_reg_imp)

print("Regression Imputed Data MSE: " + str(mse_reg_imp))

# training on full X_train data
lm = LinearRegression().fit(X_train, y_train)
y_pred = lm.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("True X_train MSE: " + str(mse))

Mean Imputed Data MSE: 0.011240654108897059
Regression Imputed Data MSE: 0.011989879045445921
True X_train MSE: 2.9998313648798715e-28


**Observations and Followup Questions**

Which imputation technique produced the lower MSE? Why do you think that this is the case? Is this what you expected?

Think about why regression imputation didn't work significantly better here despite having "more information"? (hint: the x variables of the data generating model are indepedent) 

How might these MSE results change if there was a slight correlation between the x variables?

Which imputation technique do you think would work better in the real world? Why?