# Data Analysis of Star Wars Survey Dataset

**Chance Mason, Nicolas Arrieche Villegas, Mitchell Walker, Tyler Wittig**

## Part 4. Data Analysis - K-Nearest Neighbors

We will attempt to classify the following labels using the K-Nearest Neighbors method, with 5-fold cross validation:

 - 'Fan of Star Wars'
 - 'Which character shot first?'
 - 'Star Trek Fan'
 - 'Gender'
 - 'Age'
 - 'Household Income'
 - 'Education'
 - 'Location (Census Region)'

In [39]:
#initial imports
import warnings
warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import time
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_val_predict, KFold
from sklearn.metrics import classification_report

In [40]:
%matplotlib inline

In [41]:
#read in numeric version of survey data

with open('column_names.txt', 'r') as cn:
    col_names = [line.strip() for line in cn]
    col_names.remove('RespondentID')
    
data = pd.read_csv('survey_numeric.csv')
print("Shape = ", data.shape)
data.head(10)

Shape =  (1186, 37)


Unnamed: 0,Seen a Star Wars film,Fan of Star Wars,Seen The Phantom Menace,Seen Attack of the Clones,Seen Revenge of the Sith,Seen A New Hope,Seen The Empire Strikes Back,Seen Return of the Jedi,Rank for The Phantom Menace,Rank for Attack of the Clones,...,View of Yoda,Which character shot first?,Familiar with the Expanded Universe?,Fan of the Expanded Universe?,Star Trek Fan,Gender,Age,Household Income,Education,Location (Census Region)
0,1,1,1,1,1,1,1,1,3.0,2.0,...,2,0,1,-1,-1,-1,1,0,2,1
1,0,0,0,0,0,0,0,0,0.0,0.0,...,-100,0,0,0,1,-1,1,1,4,2
2,1,-1,1,1,1,0,0,0,1.0,2.0,...,-100,0,-1,0,-1,-1,1,1,2,3
3,1,1,1,1,1,1,1,1,5.0,6.0,...,2,0,-1,0,1,-1,1,4,3,3
4,1,1,1,1,1,1,1,1,5.0,4.0,...,1,1,1,-1,-1,-1,1,4,3,3
5,1,1,1,1,1,1,1,1,1.0,4.0,...,2,-1,1,-1,1,-1,1,2,4,4
6,1,1,1,1,1,1,1,1,6.0,5.0,...,2,-1,1,-1,-1,-1,1,0,2,5
7,1,1,1,1,1,1,1,1,4.0,5.0,...,2,-1,-1,0,1,-1,1,0,2,1
8,1,1,1,1,1,1,1,1,5.0,4.0,...,1,-1,-1,0,-1,-1,1,1,3,1
9,1,-1,0,1,0,0,0,0,1.0,2.0,...,2,0,-1,0,-1,-1,1,2,3,6


## 4.1 Separate Dataset into Features and Labels

In [42]:
#labels = ['Fan of Star Wars', 'Which character shot first?', 'Star Trek Fan', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)']
#features = [col for col in col_names if col not in labels]

#print('Features Columns:')
#data[features]

## 4.2 KNN Classifications
Below is a function to create a KNN Classifier on different labels. PCA to reduce dimensionality and scaling will be done using a pipeline. 5-fold cross validation is used.

In [51]:
def scoreKNN(label):
    #split into features and label
    features = data.drop(label, axis=1)
    labels = data[[label]]
   
    
    #create a PCA
    pca = PCA()

    #create a KNN classifier
    knn = KNeighborsClassifier()
    
    #create a scaler, PCA and KNN classifier
    scaler = sk.preprocessing.MinMaxScaler()

    #create a pipeline that does scaling, then PCA, then KNN
    pipe = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('knn', knn)])

    #Set up the parameters we want to test
    param_grid = {
       'pca__n_components': list(range(1, 19)), #find how many principal componenet to keep
       'knn__n_neighbors': list(range(1, 30)),  #find the best value of k
    }

    #pass same GridSearchCV into a cross_val_score then print out the accuracy
    search = GridSearchCV(pipe, param_grid, cv=5)
    scores = cross_val_score(estimator=search, X=features, y=labels.values.ravel(), cv=5, scoring = 'accuracy')
    predictions = cross_val_predict(estimator=search, X=features, y=labels.values.ravel(), cv=5)
    print('Accuracy:', scores.mean())
    print(classification_report(labels, predictions))

In [52]:
test_labels = ['Fan of Star Wars','Which character shot first?','Gender', 'Age','Household Income','Education','Location (Census Region)', 'Star Trek Fan']
for l in test_labels:
    print('\n'+l+'\n')
    scoreKNN(l)


Fan of Star Wars

Accuracy: 0.8633786404499582
              precision    recall  f1-score   support

          -1       0.75      0.67      0.71       284
           0       0.99      1.00      1.00       350
           1       0.84      0.88      0.86       552

    accuracy                           0.87      1186
   macro avg       0.86      0.85      0.86      1186
weighted avg       0.86      0.87      0.86      1186


Which character shot first?

Accuracy: 0.6812907269751924
              precision    recall  f1-score   support

          -1       0.51      0.79      0.62       325
           0       0.83      0.83      0.83       664
           1       0.15      0.02      0.03       197

    accuracy                           0.68      1186
   macro avg       0.50      0.54      0.49      1186
weighted avg       0.63      0.68      0.64      1186


Gender

Accuracy: 0.6213602411571795
              precision    recall  f1-score   support

          -1       0.59      0.53     

## 4.3 Interpretation of Results

Several of our label columns include a neutral or missing answer as a 0, but unlike in our Naive Bayes analysis, the 0 class doesn't seem to be a burden on our classification. In several cases the 0 class precision and recall are higher than the other categories.

 - The classifier for 'Fan of Star Wars' peformed quite well, with an overall accuracy of 87% and precision above 75% for each of the classes.
 - The classifier for 'Which character shot first?' had an overall accuracy of 68%. Upon further examination we see that this classifier did severely poor job in identifying the 1 class, only 2% of records that chose 1 (Greedo) were correctly identified, and only 15% of those identified as picking 1 actually did. The 68% accuracy is influenced by the relatively high precision and recall values for the 0 class.
 - The classifier for 'Star Trek Fan' yielded our second highest accuracy, 77%. Each of the classes had precision and recall above 75%, well over for the 0 class. The lowest score was for precision of the 1 class, only 68% of records identified as fans of Star Trek were actually fans.
 - The classifier for 'Gender' only performed slightly worse than 'Which character shot first?', with an accuracy of 64%. Both precision and recall for the -1 class (male) were lower than 60%, and only 61% of records identified as 1 (female) were correctly labeled.
 - The rest of our classifiers for Age, Household Income, Education, and Location all performed poorly, with accuracies below 40% for each of them. These classifiers performed best in identifying the 0 class, which is a blank answer. The other classes had very low precision and recall, it seems the overall accuracies would be even lower if the blank answer class were removed.
 
These results show that the K-Nearest Neighbor classifier for the 'Fan of Star Wars' label performed best.