# Lab 5: Advanced Missing Data Imputation

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn import preprocessing
import math
import numpy as np

# Enable inline plotting
%matplotlib inline

## Advanced Regression Models

In the previous lecture, we discussed advanced regression models including neural networks and k-nearest neighbours (kNN). But in the previous lab, we only looked at using linear regression in sklearn. Now we'll try the more advanced models. Fortunately, sklearn makes it very easy to swap in different machine learning models.

Let's start with the same college dataset we used previously.

In [None]:
df = pd.read_csv('College-MISSING.csv')
print(df.head())

Let's predict graduation rate (Grad.Rate) based on the other variables. So Grad.Rate will be our outcome (y) and the other variables will be our features (X). 

But first we will remove missing data.

# Removing Missing Data

We will remove any rows with missing (NA) data, in order to fit our advanced regression models.

In [None]:
df_complete = df.dropna(axis=0, how='any')
print(df_complete.head())

Then we will divide that into features (X) and outcomes (y), as before. We also add a line of code that scales the variables (preprocessing.scale(X)), as neural networks usually perform better when all the features are on roughly the same scale. If you remove this, you will likely see performance go down. 

In [None]:
X = df_complete.drop(['College', 'Private', 'Grad.Rate'], axis=1)
X = pd.DataFrame(preprocessing.scale(X), columns=X.columns)
print("Here are the features (X):")
print(X.head())

print("\n\nHere is the outcome variable (y):")
y = df_complete['Grad.Rate']
print(y)

In [None]:
print(df_complete.head())

## Advanced Regression Models in sklearn

In the previous lecture, we described a simple Perceptron as being like a single neuron, and a neural network being an extension of the Perceptron where we have multiple neurons arranged in layers, including hidden layers between the inputs and outputs. So multi-layer perceptron (or MLP) is just another term for neural network.

We will fit a neural network regression model to predict Grad.Rate, just like in the previous lab.


We create an instance of the MLPRegressor class in sklearn. We need to specify how many hidden layers it should have and how many neurons in each of those layers. In this case, it's three hidden layers, with sizes of 100, 100, and 50, respectively. The other arguments specify which training algorithm it should use for learning the neural network weights, and the maximum number of iterations it should try.

In [None]:
lm1 = MLPRegressor(hidden_layer_sizes=(100,100,50), solver='lbfgs', max_iter=500, random_state=1)

We then fit the model to the training data. Thankfully, sklearn provides a consistent API for the various machine learning models. So training and prediction is just the same as it was before. 

In [None]:
lm1.fit(X, y)

Let's also create a kNN regression model. We can specify the value of _k_ or use the default of 5.

In [None]:
lm2 = KNeighborsRegressor(10)

Then we fit that model to the data as well.

In [None]:
lm2.fit(X,y)

# Imputation

Then we want to get predictions on the full dataset, including the rows that had missing (NA) data, so that we can impute the missing values using our trained regression model. 

So we need to get the features (X) for the entire dataset, not just the complete cases.

In [None]:
X_all = df.drop(['College', 'Private', 'Grad.Rate'], axis=1)
X_all = pd.DataFrame(preprocessing.scale(X_all), columns=X_all.columns)

preds1 = lm1.predict(X_all)

preds2 = lm2.predict(X_all)


Or we could get predictions for just the records where Grad.Rate was missing:

In [None]:
missing = df['Grad.Rate'].isnull()

preds_missing1 = lm1.predict(X_all.loc[missing, :])

preds_missing2 = lm2.predict(X_all.loc[missing, :])



# Evaluation

Let's do a similar step as you did in the first part of your lab assignment last week. We will look at the gold-standard Grad.Rate values in _College.csv_ and see how our predictions compare. But this time we will only look at the subset of the data that was missing.

In [None]:
gs_df = pd.read_csv('College.csv')
gs_grad = gs_df.loc[missing, 'Grad.Rate']

plt.figure()
plt.scatter(preds_missing1, gs_grad)
plt.xlabel("Predicted Grad Rate (NN)")
plt.ylabel("Actual Grad Rate")
plt.show()

plt.figure()
plt.scatter(preds_missing2, gs_grad)
plt.xlabel("Predicted Grad Rate (kNN)")
plt.ylabel("Actual Grad Rate")
plt.show()

It can be difficult to assess which approach is doing better just by looking at the scatter plots. 

Let's calculate the mean-squared error (MSE) for each.

In [None]:
mse = sum((gs_grad - preds_missing1)**2) / len(preds_missing1)
print("MSE with MLP regressor: ", mse)

In [None]:
mse = sum((gs_grad - preds_missing2)**2) / len(preds_missing2)
print("MSE with kNN regressor: ", mse)

Your results may differ slightly, but you should see that kNN performs better on this task. However, you might get significant improvement by trying different numbers of hidden layers and neurons. Neural networks also tend to perform better on larger datasets. If you remove the lines of code that do feature scaling, you should see neural networks perform much worse.

__Lab Assignment__: You are provided with a dataset _winequality-MISSING.csv_ that has some data about different wines, and how each wine was rated. Some of the wine ratings are missing. Try fitting advanced regression models using neural networks and kNN to predict the missing rating values. 

Then load in the gold-standard data in winequality.csv and compare your predicted values for the missing ratings with the actual ratings. Show scatterplots of the predicted vs. actual ratings for the missing data, for both models. Also show the MSE for each predictive model.