# Neural Network Classifier

In this notebook we will implement a Neural Network Classifier to classify a survey participant as either a fan of Star wars (1) or Not a fan of Star Wars (-1).
We used the SciKit Learn object `MLPClassifier`.

In [2]:
import pandas as pd

### Read CSV
We first read the CSV of numerical survey data

In [3]:
filepath = r"C:\Users\User\AppData\Local\Packages\CanonicalGroupLimited.UbuntuonWindows_79rhkp1fndgsc\LocalState\rootfs\home\mitchel\StarWarsSurvey-ClassificationAnalysis\survey_numeric.csv"

data = pd.read_csv(filepath)
data.head(15)
data.shape

(1186, 37)

### Record Cleaning/Feature Engineering

Because we are trying to predict whether or not a respondent is a fan of Star Wars or not, we remove all record where the respondent answered "no" to the question "Seen a Star Wars film", as the remainder of their answers will be null or carry very little identifying information.

We then drop the features "Location (Census Region)" and "Seen a Star Wars film" entirely, because after the model was trained, it was found that the weights from these specific input nodes to the first hidden layer were almost entirely zero. This implies that these feature values had very little effect on the models classification, as large negative or positive values imply the opposite.

In [4]:
#drop all records where respondent has not seen a star wars film
data.drop(data[data["Seen a Star Wars film"] == 0].index,inplace=True)

#drop location column and "Seen a star wars film"
data = data.drop(["Location (Census Region)"], axis=1)
data = data.drop(["Seen a Star Wars film"], axis=1)

### Data Split
Here we split the data on the given label.

We create a function that takes the label as an argument, and returns featurs (data_X) and labels (data_Y). This functions allows for the flexibilty to easily train the model to predict different labels (Fan of Star Wars, Star Trek Fan, Gender, Age, Household Income, and Education), without changing much of the code.

We remove the records in which the label value is zero. For the labels mentioned above the value of zero represents and Unknown or Null response, which is not a response that we should expect our model to accurately predict.

In [5]:
#print all column names
for col in data.columns:
    print(col)

#this function splits the data by features and labels, given the label argument
#it returns two objects: data_X and data_Y
def splitData(label):
    #make copy of data to keep original data unaffected
    data_copy = data
    
    #drop null values
    data_copy.drop(data_copy[data_copy[label] == 0].index, inplace=True)
    
    #split and return data
    data_X = data_copy.loc[:,data_copy.columns != label]
    data_Y = data_copy[label]
    return data_X, data_Y

Fan of Star Wars
Seen The Phantom Menace
Seen Attack of the Clones
Seen Revenge of the Sith
Seen A New Hope
Seen The Empire Strikes Back
Seen Return of the Jedi
Rank for The Phantom Menace
Rank for Attack of the Clones
Rank for Revenge of the Sith
Rank for A New Hope
Rank for The Empire Strikes Back
Rank for Return of the Jedi
View of Han Solo
View of Luke Skywalker
View of Princess Leia Organa
View of Anakin Skywalker
View of Obi Wan Kenobi
View of Emperor Palpatine
View of Darth Vader
View of Lando Calrissian
View of Boba Fett
View of C-3P0
View of R2 D2
View of Jar Jar Binks
View of Padme Amidala
View of Yoda
Which character shot first?
Familiar with the Expanded Universe?
Fan of the Expanded Universe?
Star Trek Fan
Gender
Age
Household Income
Education


### Model Set Up

##### GridSearchCV hyperparamters
In order to tune the hyperparameters of the Neural Network, we set up a pipeline which scales the data, and pass this into a GridSearchCV object with varying ranges of hyperparameters.
We ran this Grid Search over all activation functions. We also initialized the hidden layer sizes to be a single layer, ranging from 10 nodes to 60 nodes, counting by 10.
After fitting many Grid Searches, we consistently found that the best parameeters were the largest hidden layer sizes. To push keep searching for the optimal hidden layer geometry, we iteratively adjusted the `hidden_layer_list` to hold larger and larger values. We also tested on two and three hidden layers of similar sizes.

##### Converging on optimal hidden layers
We saw that the model's optimal hidden layers were consistently a single hidden layer with 110 nodes. We then iteratively made our step size smaller, and closed our range around the most optimal number of hidden layers. Finally, we found that our model's optimal hidden layer geometry was 106 nodes in a single layer, and kept the Grid Search's parameters to search for the best hyperparameters between 100 and 110 (with a step size of 2).

In [6]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#define function to create GridSearchCV object

scaler = StandardScaler()
MLPC = MLPClassifier(solver = 'lbfgs')

pipe = Pipeline([("Scaler",scaler),("MLPC",MLPC)])

#create list of hidden_layer sizes for param_grid
hidden_layer_list = []
for i in range(100,111, 2):
    hidden_layer_list += [(i,)]

    
#parameter grid for different hyperparamters in gridsearchCV
param_grid = {'MLPC__hidden_layer_sizes':hidden_layer_list,
             'MLPC__activation':['identity','logistic','tanh','relu']
             }


### Training Model/ Reporting Performance
##### Help Me, 'Fan of Star Wars' label: You're My Only Hope
Earlier we discussed training the model to predict several labels, but here we decided to only train it on the label "Fan of Star Wars". This is for two reasons. First, "Fan of Star Wars" and "Star Trek Fan" were the only two of the mentioned labels that are bivariate. The others have 5 or 6 options to their survey responses, which means that at random, a classifier could guess these with 16-20% accuracy. Our trained classifiers were guessing these at around 40%, which is good considering the classification against randomness, but bad in general. 
Additionally, the Neural Network was unable to surpass 68% accuracy for the label "Star Trek fan", so we decided to only try to train and predict the label "Fan of Star Wars"
##### GridSearchCV and Reporting
We ran a 5 fold cross validation loop around the 5 fold Grid Search to report on the accuracy of a trained GridSearchCV. We reported on the accuracy, precision, recall, (and FScore).
This model had an accuracy of 82.3%, a 1 classification f-score of 87%, and a -1 classification f-score of 72%. This means that more of the 1 classifications were classified correctly than -1, and that of all correct classifications, 1 classes were classified at a higher proportion. This is in part due to the class imbalance between the two. Nevertheless, precision and recall for the -1 classes were 76 and 68% respectively, which is not terrible.

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report

#Create GridSearchCV Object
MLPSearch= GridSearchCV(pipe,
                  param_grid,
                  scoring = 'accuracy',
                  cv=5)

#display accuracy and confusion matrix for the GridSearchCV
data_X, data_Y = splitData('Fan of Star Wars')

#Print Accuracy
acc = cross_val_score(MLPSearch, data_X, data_Y, cv=5)
print("Average Accuracy: {:.2f}%\n".format(acc.mean()*100))

# cross_val_predict
y_pred = cross_val_predict(MLPSearch, data_X, data_Y)

# print confusion matrix
conf_mat = confusion_matrix(data_Y, y_pred)
print('Confusion Matrix:\n', conf_mat)

# display classification report
print(classification_report(data_Y, y_pred))

#fit the final model
MLPSearch.fit(data_X,data_Y)
print(MLPSearch.best_params_)





Average Accuracy: 82.30%





Confusion Matrix:
 [[194  90]
 [ 60 492]]
              precision    recall  f1-score   support

          -1       0.76      0.68      0.72       284
           1       0.85      0.89      0.87       552

    accuracy                           0.82       836
   macro avg       0.80      0.79      0.79       836
weighted avg       0.82      0.82      0.82       836

{'MLPC__activation': 'identity', 'MLPC__hidden_layer_sizes': (100,)}


### Putting It to the Test
##### New Records
To rougly test the performance of this Neural Network model's ability to predict new records, I had two of my friends take the survey. Although they are both Star Wars fans (it's hard to get a variety during the quarantine), the model was able to predict both accurately based on their responses.
##### Checking Weights
As mentioned before, the features "Location (Census Region)" and "Seen a Star Wars film" were dropped from the dataset. That decision came from the results in this cell block, where we summed the entirety of each input node's weights to each node in the hidden layer (a sum of 106 weights for each feature). We printed which nodes were less than the average sum of weights, to get an idea of how widely these values ranged. Most values are close to the mean of ~46, however the two features mentioned above, had a total sum of weights less than 6. This means that the trained network had very low weights connecting these input nodes to each first layer hidden layer node. 
This implies that they had little to no effect on the classification, and for this reason, they were dropped at the beginning.

In [16]:
#sample record
record = [[1,1,1,1,1,1,5,6,4,2,3,1,3,4,5,5,4,3,2,0,3,4,4,1,5,4,0,1,1,-1,1,1,5,3],
         [1,1,1,1,1,1,5,6,4,2,3,1,5,4,4,3,4,1,4,3,0,5,5,1,4,5,0,-1,0,0,-1,1,4,3]]

print(MLPSearch.predict(record))
weights = MLPSearch.best_estimator_.steps[1][1].coefs_


sum_of_weights = []
for node in weights[0]:
    node_sum = 0
    for weight in node:
        node_sum += abs(weight)
    sum_of_weights.append(node_sum)
    
    
avg = sum(sum_of_weights)/len(sum_of_weights)
[weight for weight in sum_of_weights if weight < avg]


[1 1]


[10.207444480294264,
 10.291860931078203,
 10.029060369661481,
 10.532192132306383,
 9.797763645073752,
 9.081594855196876,
 10.42113109736299,
 10.342573963853411,
 8.883889304290838,
 10.555974673364211,
 9.45201475131237,
 9.992752820487874,
 10.114170110107299,
 9.765772498114119,
 10.407174265321636,
 9.932844690230018,
 9.5344082429577]