# Data Analysis of Star Wars Survey Dataset

Chance Mason, Nicolas Arrieche Villegas, Mitchell Walker, Tyler Wittig

### Project Description

The Star Wars Survey dataset is a labeled dataset with 1188 records. It has 15 features, which are survey questions regarding people's opinions on the Star Wars franchise and some personal information such as: "Do you consider yourself a Star Wars fan?", "Which is your favorite movie?", "Which character shot first?", "Gender", "Income", et cetera.

The dataset can be found [here](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey).

Our aim is to train a classifier to predict a person's answers to some questions about the franchise given their answers to other such questions, or to try to predict a person's personal characteristics based off of their answers to questions about the franchise.

# Data Cleaning

In [1]:
import pandas as pd
pd.__version__ # version of pandas being using

'1.0.0'

In [2]:
# Enable inline mode for matplotlib so that Jupyter displays graphs
%matplotlib inline

### Read in the Raw Dataset
Note: Dataset file header is 2 lines, and the file contains non-ASCII characters somewhere that cause an error in read_csv with normal 'utf-8' encoding. Hence, the header and encoding parameters are added below.

In [3]:
raw_data = pd.read_csv('star_wars_survey.csv', header=[0,1], encoding='unicode_escape')
print("Shape = ", raw_data.shape)
raw_data.head()

Shape =  (1186, 38)


Unnamed: 0_level_0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.,...,Unnamed: 28_level_0,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
Unnamed: 0_level_1,Unnamed: 0_level_1,Response,Response,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,Star Wars: Episode I The Phantom Menace,...,Yoda,Response,Response,Response,Response,Response,Response,Response,Response,Response
0,3292879998,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3.0,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,3292879538,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
2,3292765271,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1.0,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
3,3292763116,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
4,3292731220,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central


### Clean the Dataset Header
<p>To simplify the raw dataset header, we will combine the two rows into a single row, and shorten the labels to be more concise.</p>

<p>Note that there were non-ASCII characters at the end of the column label: "Do you consider yourself to be a fan of the Expanded Universe?" Since the cleaned header will no longer use these labels, we will see that we can now use normal encoding with the updated dataframe.</p>

<p>We will also exclude the 'RespondentID' column in this new dataframe, as this header label is not relevant to the data.</p>

In [4]:
# new labels
with open('column_names.txt', 'r') as cn:
    col_names = [line.strip() for line in cn]

# read in dataset with new header
data = pd.read_csv('star_wars_survey.csv', 
    header=0,  # ignore raw header
    names=col_names,  # use new header
    skiprows=1  # skip first two rows (old header rows)
)

# remove RespondentID column
data.drop(['RespondentID'],axis=1, inplace=True)
col_names.remove('RespondentID')


print("Shape = ", data.shape)
data.head()

Shape =  (1186, 37)


Unnamed: 0,Has seen a Star Wars film,Star Wars Fan,Watched The Phantom Menace,Watched Attack of the Clones,Watched Revenge of the Sith,Watched A New Hope,Watched The Empire Strikes Back,Watched Return of the Jedi,Rank for The Phantom Menace,Rank for Attack of the Clones,...,View of Yoda,Which character shot first?,Familiar with the Expanded Universe?,Fan of the Expanded Universe?,Star Trek Fan,Gender,Age,Household Income,Education,Location (Census Region)
0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3.0,2.0,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,No,,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
2,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1.0,2.0,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
3,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,6.0,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
4,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,4.0,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central


### Check for Missing Value Errors
To double check for errors in missing value entries, we will compare the sum total of missing value entries in the dataset both before and after attempting to run the replace function on the dataset. If the numbers are the same, then no new null values were found.

In [13]:
import numpy as np
'''
Check if any missing values are not formatted as NaN.
'''

# compute sum of NaN values before and after the replace function
init_sum = (data.isnull().sum()).sum()
data.replace('?', np.NaN, inplace=True)
data.replace(' ', np.NaN, inplace=True)
data.replace('', np.NaN, inplace=True)
final_sum = (data.isnull().sum()).sum()

if (init_sum < final_sum):
    print('Found ' + str(init_sum - final_sum), 'new null values.')
else:
    print('Found no new null values.')

Found no new null values.


### Reformat Data!

<p>There are three survey questions which need closer examination. The first asked which of the 6 Star Wars films the respondent had watched, the second asked respondents to rank the films from best to worst (from 1 to 6), and the third asked respondents for their opinion on numerous characters.</p>

<p>Above, we separated these questions out into separate column. However, this greatly increases dimensionality of the dataset. So either we must find a way to combine these features, or should remove some of these columns from our dataset.</p> 
