# Data Analysis of Star Wars Survey Dataset

#### Chance Mason, Nicolas Arrieche Villegas, Mitchell Walker, Tyler Wittig

## Part 4. Data Analysis - Naive Bayes

We will attempt to classify the following labels using the Gaussian Naive Bayes method with a 10-fold cross-validation:
* 'Fan of Star Wars'
* 'Which character shot first?'
* 'Star Trek Fan'
* 'Gender'
* 'Age'
* 'Household Income'
* 'Education'
* 'Location (Census Region)'  

---

In [32]:
import warnings
warnings.simplefilter("ignore")

import pandas as pd
import numpy as np

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, KFold

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report

In [33]:
%matplotlib inline

In [34]:
# Read the data from csv file
with open('column_names.txt', 'r') as cn:
    col_names = [line.strip() for line in cn]
    col_names.remove('RespondentID')
    
data = pd.read_csv('survey_numeric.csv')
print("Shape = ", data.shape)
data.head()

Shape =  (1186, 37)


Unnamed: 0,Seen a Star Wars film,Fan of Star Wars,Seen The Phantom Menace,Seen Attack of the Clones,Seen Revenge of the Sith,Seen A New Hope,Seen The Empire Strikes Back,Seen Return of the Jedi,Rank for The Phantom Menace,Rank for Attack of the Clones,...,View of Yoda,Which character shot first?,Familiar with the Expanded Universe?,Fan of the Expanded Universe?,Star Trek Fan,Gender,Age,Household Income,Education,Location (Census Region)
0,1,1,1,1,1,1,1,1,3.0,2.0,...,2,0,1,-1,-1,-1,1,0,2,1
1,0,0,0,0,0,0,0,0,0.0,0.0,...,0,0,0,0,1,-1,1,1,4,2
2,1,-1,1,1,1,0,0,0,1.0,2.0,...,0,0,-1,0,-1,-1,1,1,2,3
3,1,1,1,1,1,1,1,1,5.0,6.0,...,2,0,-1,0,1,-1,1,4,3,3
4,1,1,1,1,1,1,1,1,5.0,4.0,...,1,1,1,-1,-1,-1,1,4,3,3


### 4.1 Separate Features Columns from Labels

In [43]:
labels = ['Fan of Star Wars', 'Which character shot first?', 'Star Trek Fan', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)']
features = [col for col in col_names if col not in labels]

print('Features Columns:')
data[features]

Features Columns:


Unnamed: 0,Seen a Star Wars film,Seen The Phantom Menace,Seen Attack of the Clones,Seen Revenge of the Sith,Seen A New Hope,Seen The Empire Strikes Back,Seen Return of the Jedi,Rank for The Phantom Menace,Rank for Attack of the Clones,Rank for Revenge of the Sith,...,View of Darth Vader,View of Lando Calrissian,View of Boba Fett,View of C-3P0,View of R2 D2,View of Jar Jar Binks,View of Padme Amidala,View of Yoda,Familiar with the Expanded Universe?,Fan of the Expanded Universe?
0,1,1,1,1,1,1,1,3.0,2.0,1.0,...,2,0,0,2,2,2,2,2,1,-1
1,0,0,0,0,0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,1.0,2.0,3.0,...,0,0,0,0,0,0,0,0,-1,0
3,1,1,1,1,1,1,1,5.0,6.0,1.0,...,2,1,-1,2,2,2,2,2,-1,0
4,1,1,1,1,1,1,1,5.0,4.0,6.0,...,1,0,2,1,1,-2,1,1,1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1181,1,1,1,1,1,1,1,5.0,4.0,6.0,...,1,1,1,2,2,1,1,2,-1,0
1182,1,1,1,1,1,1,1,4.0,5.0,6.0,...,-2,1,0,1,2,-1,-1,2,-1,0
1183,0,0,0,0,0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1184,1,1,1,1,1,1,1,4.0,3.0,6.0,...,2,1,2,1,1,2,1,2,-1,0


### 4.2 Naive Bayes Classifications
Below is a function to automate the Gaussian Naive Bayes Classification process with a 10-fold cross validation. It will predict whatever label name is passed in as a parameter, and then display the accuracy, confusion matrix, and classification report of the resulting classification.

In [40]:
def scoreNB(label):
    # separate the features from the class label
    X = data.loc[:, features].values
    y = data.loc[:, [label]].values
    
    # initialize classifier
    clf = GaussianNB()
    clf.fit(X, y)
    
    # 10-fold cross validation
    k_fold = KFold(n_splits=10, shuffle=False, random_state=None)

    # display accuracy
    print('Accuracy:', cross_val_score(clf, X, y, cv=k_fold).mean())
    
    # cross_val_predict
    y_pred = cross_val_predict(clf, X, y, cv=k_fold)

    # print confusion matrix
    conf_mat = confusion_matrix(y, y_pred)
    print('Confusion Matrix:\n', conf_mat)

    # display classification report
    print(classification_report(y, y_pred))

We will now pass each label into the above function.

In [44]:
# run NB on all labels to see which works best
for l in labels:
    print('\n' + l + '\n')
    scoreNB(l)


Fan of Star Wars

Accuracy: 0.8769192422731804
Confusion Matrix:
 [[212   0  72]
 [  0 350   0]
 [ 74   0 478]]
              precision    recall  f1-score   support

          -1       0.74      0.75      0.74       284
           0       1.00      1.00      1.00       350
           1       0.87      0.87      0.87       552

    accuracy                           0.88      1186
   macro avg       0.87      0.87      0.87      1186
weighted avg       0.88      0.88      0.88      1186


Which character shot first?

Accuracy: 0.6036889332003988
Confusion Matrix:
 [[258  10  57]
 [125 414 125]
 [140  13  44]]
              precision    recall  f1-score   support

          -1       0.49      0.79      0.61       325
           0       0.95      0.62      0.75       664
           1       0.19      0.22      0.21       197

    accuracy                           0.60      1186
   macro avg       0.55      0.55      0.52      1186
weighted avg       0.70      0.60      0.62      1186




### 4.3 Interpretation of Results
Many of our label columns included missing values and neutral answers as the 0 class of the column. We see that this 0 class was the most significant burden on the results of our classifiers.

* The classifier for 'Fan of Star Wars' yielded an accuracy of roughly 87%, with both precision and recall at 88%. These values are also all nearly or greater than 75% for each class.  

* The classifier for 'Which character shot first?' yielded an overall precision of 70%, but its accuracy and recall were both at 60%. That is, 70% of the classifier's predictions were correct, but only 60% of the total records were correctly predicted. Moreso, looking at the classification report shows us only 49% of records predicted to be in the -1 class were in class -1 and only 19% of records predicted to be in class 1 were in class 1. Thus, the overall precision of 70% comes from the fact that 95% of records predicted to be in class 0 were in class 0. That is, the classifier's success here was in predicting who answered that they did not understand the question, not in predicting if they thought Han or Greedo shot first. Thus, our classifer did not perform well on this label.  

* The classifier for 'Star Trek Fan' yielded an accuracy of only about 52%. The classification report shows that one of the issues was with the 0 class. Of the 350 records in class 0, only 110 records were predicted to be in class 0 (precission of 31%). However, of the 118 records predicted to be in class 0, 110 were correct. Since the 0 class represents respondents who did not answer 'Yes' or 'No' to being a fan of Star Trek, we see the classifier doesn't perform well with this middle option, and perhaps records with a value of 0 should've been excluded.  

* Having a class 0 to represent an unknown answer for 'Gender' and 'Age' definitely impacted their classifiers' predictions too. The confusion matrix for 'Gender' shows many records which were 'Female' (class 1) were predicted as 'Unknown' (class 0), and the confusion matrix for 'Age' shows the majority of records in every class except for class 0 were incorrectly predicted. The same was true of the classifiers for 'Income', 'Education', and 'Location (Census Region)'.  

Thus we see that our classifier for 'Fan of Star Wars' yielded the most ideal results, even with a neutral 0 class.