# Lab 9: Neural Networks


In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.cross_validation import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.metrics import confusion_matrix, mean_squared_error, precision_recall_fscore_support, accuracy_score

%matplotlib inline



We will first work with some data about Carseat sales. One of the variables is 'Sales' and we are going to create a new binary variable 'High' that is True is 'Sales' is greater than 8 and False otherwise. 

In [2]:
df = pd.read_csv('Carseats.csv')
df['High'] = df['Sales'] > 8 
df.ShelveLoc = pd.factorize(df.ShelveLoc)[0]
df.Urban = df.Urban.map({'No':0, 'Yes':1})
df.US = df.US.map({'No':0, 'Yes':1})
df.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US,High
0,9.5,138,73,11,276,120,0,42,17,1,1,True
1,11.22,111,48,16,260,83,1,65,10,1,1,True
2,10.06,113,35,10,269,80,2,59,12,1,1,True
3,7.4,117,100,4,466,97,2,55,14,1,1,False
4,4.15,141,64,3,340,128,0,38,13,1,0,False


Now we will predict 'High' using the other features. This is a binary classification task. The predictor features (X) will include everything except 'Sales' and 'High.'

It is probably obvious to you why we need to exclude 'High' (because it is the outcome variable we are trying to predict), but make sure you also understand why we need to remove 'Sales' from the features as well.

We randomly select 50% of the data as training data.



In [3]:
X = df.drop(['Sales', 'High'], axis=1)
y = df.High

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=1)

In this lab, we are just using a multi-layer perceptron as a drop-in substitute for the other types of classifiers we have been using. 

Let's compare an MLP, a Perceptron, Logistic Regression, and  Gradient Boosting classifier (a classifier based on decision trees).

In [4]:
nn = MLPClassifier(hidden_layer_sizes=(10,10), max_iter=1000, activation='relu', random_state=1)
perc = Perceptron(penalty='l1', random_state=1, max_iter=1000, shuffle=True)
logreg = LogisticRegression()
gb = GradientBoostingClassifier(random_state=1)

In [5]:
nn.fit(X_train, y_train)
perc.fit(X_train, y_train)
logreg.fit(X_train, y_train)
gb.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=1, subsample=1.0, verbose=0,
              warm_start=False)

Now let's make predictions on the test set using each model, and show a confusion matrix for each.

In [6]:
preds_nn = nn.predict(X_test)
preds_perc = perc.predict(X_test)
preds_logreg = logreg.predict(X_test)
preds_gb = gb.predict(X_test)

In [7]:
cm_nn = pd.DataFrame(confusion_matrix(y_test, preds_nn).T, index=['No', 'Yes'], columns=['No', 'Yes'])
print("Using Neural Network:\n")
print(cm_nn)

tree_acc = accuracy_score(y_test, preds_nn)
print('\nAccuracy is: %s' % tree_acc)

Using Neural Network:

     No  Yes
No   96   26
Yes  23   55

Accuracy is: 0.755


In [8]:
cm_perc = pd.DataFrame(confusion_matrix(y_test, preds_perc).T, index=['No', 'Yes'], columns=['No', 'Yes'])
print("Using the Perceptron:\n")
print(cm_perc)

tree_acc = accuracy_score(y_test, preds_perc)
print('\nAccuracy is: %s' % tree_acc)

Using the Perceptron:

      No  Yes
No   118   61
Yes    1   20

Accuracy is: 0.69


In [9]:
cm_logreg = pd.DataFrame(confusion_matrix(y_test, preds_logreg).T, index=['No', 'Yes'], columns=['No', 'Yes'])
print("Using Logistic Regression:\n")
print(cm_logreg)

tree_acc = accuracy_score(y_test, preds_logreg)
print('\nAccuracy is: %s' % tree_acc)

Using Logistic Regression:

      No  Yes
No   103   26
Yes   16   55

Accuracy is: 0.79


In [10]:
cm_gb = pd.DataFrame(confusion_matrix(y_test, preds_gb).T, index=['No', 'Yes'], columns=['No', 'Yes'])
print("Using Logistic Regression:\n")
print(cm_gb)

tree_acc = accuracy_score(y_test, preds_gb)
print('\nAccuracy is: %s' % tree_acc)

Using Logistic Regression:

      No  Yes
No   103   23
Yes   16   58

Accuracy is: 0.805


In this case, Logistic Regression and Gradient Boosting Trees perform the best. This isn't surprising, given that it's a small amount of data, and that we haven't tried to optimize the structure of the neural network.

Tree-based models tend to work very well on these types of datasets.

In the lab assignment, you will work with a different dataset and spend more time trying out different neural architectures. 

# Lab Assignment

* Read in the Boston dataset. 
* This includes some socioeconomic data about neighbourhoods in Boston.
* Create a new variable 'HighVal' that is True when 'medv' is greater than 20, and False otherwise.
* You will be doing classification, to predict 'HighVal', i.e. whether a neighbourhood has high median home values.
* The outcome variable is 'HighVal'.
* The features are all of the other variables, except 'medv' and 'HighVal'. 
* Select 50% of the data as training data (using the train_test_split() function, as above). 
* Train three different Multi-Layer Perceptron classifiers on the training set. 
* * Try varying numbers of hidden layers. 
* * Try varying numbers of neurons in the hidden layers.
* * Try a couple of activation functions. 
* Make predictions on the test set using each model, and calculate accuracy and show a confusion matrix for each model.
* Compare these predictions with predictions from at least two other models.
* * e.g. the Perceptron or Logistic Regression.

__Submit your completed notebook via Blackboard.__