# Data Analysis of Star Wars Survey Dataset

#### Chance Mason, Nicolas Arrieche Villegas, Mitchell Walker, Tyler Wittig

---  
# Part 4. Data Analysis  

In the notebooks of Part 4, we will analyze how well different classification methods work on the following **labels**: 
* 'Fan of Star Wars'
* 'Which character shot first?'
* 'Star Trek Fan'
* 'Gender'
* 'Age'
* 'Household Income'
* 'Education'
* 'Location (Census Region)'   

The **classification methods** we will use in the Part 4 notebooks are:
* 4.1 Decision Trees
* 4.2 K-Nearest Neighbors
* 4.3 Gaussian Naive Bayes
* 4.4 Neural Networks

We will then interpret the **results** of each of our classification methods to determine how well each classifier performed on the data.  

---  

# 4.1 Decision Tree Classifier

We will first analyze how the Decision Tree Classification method, with a 10-fold cross validation, works on the above labels.

In [1]:
import warnings
warnings.simplefilter("ignore")

import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [2]:
%matplotlib inline

### Read CSV

In [3]:
# Read the data from csv file
with open('column_names.txt', 'r') as cn:
    col_names = [line.strip() for line in cn]
    col_names.remove('RespondentID')
    
data = pd.read_csv('survey_numeric.csv')
print("Shape = ", data.shape)

Shape =  (1186, 37)


### Separate Features Columns from Labels  

Here we define the split between feature columns and label columns.

We create an array to store each of the labels we will attempt to classify, and another to store the features we will use in our classification.

In [6]:
labels = ['Fan of Star Wars', 'Which character shot first?', 'Star Trek Fan', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)']
features = [col for col in col_names if col not in labels]

print('Features Columns:')
data[features]

Features Columns:


Unnamed: 0,Seen a Star Wars film,Seen The Phantom Menace,Seen Attack of the Clones,Seen Revenge of the Sith,Seen A New Hope,Seen The Empire Strikes Back,Seen Return of the Jedi,Rank for The Phantom Menace,Rank for Attack of the Clones,Rank for Revenge of the Sith,...,View of Darth Vader,View of Lando Calrissian,View of Boba Fett,View of C-3P0,View of R2 D2,View of Jar Jar Binks,View of Padme Amidala,View of Yoda,Familiar with the Expanded Universe?,Fan of the Expanded Universe?
0,1,1,1,1,1,1,1,3.0,2.0,1.0,...,2,0,0,2,2,2,2,2,1,-1
1,0,0,0,0,0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,1.0,2.0,3.0,...,0,0,0,0,0,0,0,0,-1,0
3,1,1,1,1,1,1,1,5.0,6.0,1.0,...,2,1,-1,2,2,2,2,2,-1,0
4,1,1,1,1,1,1,1,5.0,4.0,6.0,...,1,0,2,1,1,-2,1,1,1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1181,1,1,1,1,1,1,1,5.0,4.0,6.0,...,1,1,1,2,2,1,1,2,-1,0
1182,1,1,1,1,1,1,1,4.0,5.0,6.0,...,-2,1,0,1,2,-1,-1,2,-1,0
1183,0,0,0,0,0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1184,1,1,1,1,1,1,1,4.0,3.0,6.0,...,2,1,2,1,1,2,1,2,-1,0


### Decision Tree Classifications

Below is a function made to tune our model to find the best parameters.

In [17]:
def gridSearch(label):
    # separate the features from the class label
    X = data.loc[:, features].values
    y = data.loc[:, [label]].values
    
    params = {"max_depth": [5,10,15,20], "max_features": [5,15,25], "min_samples_leaf": [5,10,15] } 
    clf = tree.DecisionTreeClassifier()

    grid_search = GridSearchCV(clf, params, cv=10, scoring='accuracy')

    grid_search.fit(X, y)

    score = cross_val_score(grid_search, X, y, cv =10)
    return score.mean()*100


We will now pass each label into the above function.

In [19]:
# run NB on all labels to see which works best
for l in labels:
    print("\nThe average accuracy for the '" + l + "' label is:")
    score = gridSearch(l)
    print(score)


The average accuracy for the 'Fan of Star Wars' label is:
83.22389972938329

The average accuracy for the 'Which character shot first?' label is:
65.34111949864692

The average accuracy for the 'Star Trek Fan' label is:
71.00128186867967

The average accuracy for the 'Gender' label is:
62.80729240848883

The average accuracy for the 'Age' label is:
39.1190713573565

The average accuracy for the 'Household Income' label is:
31.02834354080615

The average accuracy for the 'Education' label is:
38.1035465033471

The average accuracy for the 'Location (Census Region)' label is:
23.01808859136875


### Interpretation of Results

We see that the two classifiers that performed best were those of the 'Fan of Star Wars' label, with an accuracy of 83%, and of the 'Star Trek Fan' label, with an accuracy of 71%. 
The classifiers for the 'Which character shot first' label and the 'Gender' label had slightly lower accuracies with percentages in the 60's.
From there, all our classifiers' accuracies drop below 50% accuracy, with the worst performing classifier being that of the 'Location (Census Region)' label.