## Part 4. Data Analysis - Decision Trees using K-Fold Cross Validation

We will attempt to classify the following labels using the Decision Tree method:
* 'Fan of Star Wars'
* 'Which character shot first?'
* 'Star Trek Fan'
* 'Gender'
* 'Age'
* 'Household Income'
* 'Education'
* 'Location (Census Region)'  

---

In [199]:
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [200]:
%matplotlib inline

In [201]:
# Read the data from csv file
with open('column_names.txt', 'r') as cn:
    col_names = [line.strip() for line in cn]
    col_names.remove('RespondentID')
    
data = pd.read_csv('survey_numeric.csv')
print("Shape = ", data.shape)
data.head()

Shape =  (1186, 37)


Unnamed: 0,Seen a Star Wars film,Fan of Star Wars,Seen The Phantom Menace,Seen Attack of the Clones,Seen Revenge of the Sith,Seen A New Hope,Seen The Empire Strikes Back,Seen Return of the Jedi,Rank for The Phantom Menace,Rank for Attack of the Clones,...,View of Yoda,Which character shot first?,Familiar with the Expanded Universe?,Fan of the Expanded Universe?,Star Trek Fan,Gender,Age,Household Income,Education,Location (Census Region)
0,1,1,1,1,1,1,1,1,3.0,2.0,...,2,0,1,-1,-1,-1,1,0,2,1
1,0,0,0,0,0,0,0,0,0.0,0.0,...,0,0,0,0,1,-1,1,1,4,2
2,1,-1,1,1,1,0,0,0,1.0,2.0,...,0,0,-1,0,-1,-1,1,1,2,3
3,1,1,1,1,1,1,1,1,5.0,6.0,...,2,0,-1,0,1,-1,1,4,3,3
4,1,1,1,1,1,1,1,1,5.0,4.0,...,1,1,1,-1,-1,-1,1,4,3,3


### Separating labels from data frame and isolating features.

In [202]:
# your code goes here
fan = data['Fan of Star Wars']
shot = data['Which character shot first?']
trek = data['Star Trek Fan']
gender = data['Gender']
age = data['Age']
income = data['Household Income']
education = data['Education']
location = data['Location (Census Region)']

labels = [fan, shot, trek, gender, age, income, education, location]
label_names = ['Fan of Star Wars','Which character shot first?', 'Star Trek Fan', 'Gender','Age', 'Household Income','Education','Location (Census Region)']

features = data.drop(columns = ['Fan of Star Wars', 'Which character shot first?', 'Star Trek Fan', 'Gender', 'Age', 'Household Income', 'Education','Location (Census Region)'])
features.head()

Unnamed: 0,Seen a Star Wars film,Seen The Phantom Menace,Seen Attack of the Clones,Seen Revenge of the Sith,Seen A New Hope,Seen The Empire Strikes Back,Seen Return of the Jedi,Rank for The Phantom Menace,Rank for Attack of the Clones,Rank for Revenge of the Sith,...,View of Darth Vader,View of Lando Calrissian,View of Boba Fett,View of C-3P0,View of R2 D2,View of Jar Jar Binks,View of Padme Amidala,View of Yoda,Familiar with the Expanded Universe?,Fan of the Expanded Universe?
0,1,1,1,1,1,1,1,3.0,2.0,1.0,...,2,0,0,2,2,2,2,2,1,-1
1,0,0,0,0,0,0,0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,1.0,2.0,3.0,...,0,0,0,0,0,0,0,0,-1,0
3,1,1,1,1,1,1,1,5.0,6.0,1.0,...,2,1,-1,2,2,2,2,2,-1,0
4,1,1,1,1,1,1,1,5.0,4.0,6.0,...,1,0,2,1,1,-2,1,1,1,-1


### Function made to tune the model to find the best parameters.

In [203]:
def gridSearch(label):
    params = {"max_depth": [5,10,15,20], "max_features": [5,15,25], "min_samples_leaf": [5,10,15] } 
    clf = tree.DecisionTreeClassifier()

    grid_search = GridSearchCV(clf, params, cv=10, scoring='accuracy')

    grid_search.fit(features, label)

    score = cross_val_score(grid_search, features, label, cv =10)
    return score.mean()*100


### Printing accuracy of the tuned, cross-validated model.

In [204]:
for i in range(len(labels)):
    print("\nThe average accuracy for the '" + label_names[i] + "' label is:'")
    score = gridSearch(labels[i])
    print(score)


The average accuracy for the 'Fan of Star Wars' label is:'
83.88294639889855

The average accuracy for the 'Which character shot first?' label is:'
65.26937137306031

The average accuracy for the 'Star Trek Fan' label is:'
70.26557659134927

The average accuracy for the 'Gender' label is:'
63.805756113831876

The average accuracy for the 'Age' label is:'
38.10833661492891

The average accuracy for the 'Household Income' label is:'
31.020565879658562

The average accuracy for the 'Education' label is:'
37.65908489198963

The average accuracy for the 'Location (Census Region)' label is:'
23.494771034540936
