# Data Analysis of Star Wars Survey Dataset

#### Chance Mason, Nicolas Arrieche Villegas, Mitchell Walker, Tyler Wittig

The Star Wars Survey dataset is a labeled dataset with 1188 records. It has 15 features, which are survey questions regarding people's opinions on the Star Wars franchise and some personal information such as: "Do you consider yourself a Star Wars fan?", "Which is your favorite movie?", "Which character shot first?", "Gender", "Income", et cetera.

More information about the dataset can be found [here](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey).

---

## Part 1. Data Preparation

In [34]:
import pandas as pd
import numpy as np

# Enable inline mode for matplotlib so that Jupyter displays graphs
%matplotlib inline

pd.__version__ # version of pandas being using

'1.0.0'

### 1.1 Read in the Raw Dataset
For this particular dataset, the header is 2 lines, and the file contains non-ASCII characters that cause an error in read_csv with the default encoding.

To read in and display the raw dataset, we will make use of the header and encoding parameters.

In [21]:
raw_data = pd.read_csv(
    'raw_survey.csv', 
    header=[0,1], 
    encoding='unicode_escape'
)

print("Shape = ", raw_data.shape)
raw_data.head()

Shape =  (1186, 38)


Unnamed: 0_level_0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.,...,Unnamed: 28_level_0,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
Unnamed: 0_level_1,Unnamed: 0_level_1,Response,Response,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,Star Wars: Episode I The Phantom Menace,...,Yoda,Response,Response,Response,Response,Response,Response,Response,Response,Response
0,3292879998,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3.0,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,3292879538,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
2,3292765271,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1.0,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
3,3292763116,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
4,3292731220,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central


### 1.2 Non-ASCII Characters
Note that there were non-ASCII characters mistakenly input at the end of the column label: "Do you consider yourself to be a fan of the Expanded Universe?" 

After removing these characters as part of the following step, we will see that we can now use normal encoding to read in the dataset, and thus we know we have removed all non-ASCII characters from the data.

### 1.3 Header Labels
To make the dataset more manageable and understandable, we will simplify the header by combining the two rows into a single row, and shortening the labels to be more concise.

In [98]:
# new labels
with open('column_names.txt', 'r') as cn:
    col_names = [line.strip() for line in cn]

# read in dataset with new header
data = pd.read_csv('raw_survey.csv', 
    header=0,  # ignore raw header
    names=col_names,  # use new header
    skiprows=1  # skip first two rows (old header rows)
)

print("Shape = ", data.shape)
data.head()

Shape =  (1186, 38)


Unnamed: 0,RespondentID,Seen a Star Wars film,Fan of Star Wars,Seen The Phantom Menace,Seen Attack of the Clones,Seen Revenge of the Sith,Seen A New Hope,Seen The Empire Strikes Back,Seen Return of the Jedi,Rank for The Phantom Menace,...,View of Yoda,Which character shot first?,Familiar with the Expanded Universe?,Fan of the Expanded Universe?,Star Trek Fan,Gender,Age,Household Income,Education,Location (Census Region)
0,3292879998,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3.0,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,3292879538,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
2,3292765271,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1.0,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
3,3292763116,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
4,3292731220,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central


### 1.4 Missing Values (Unfinished)
First, we will check for inconsistencies in missing value entries as follows:
* Take an initial sum total of NaN values
* Run the replace() function on dataset entry values of: '?', ' ', and ''
* Take a final sum total of NaN values
* Compare the initial and final sums. If the numbers are the same, then no new null values were found.

In [91]:
# check if any missing values are not formatted as NaN
init_sum = (data.isnull().sum()).sum()
data.replace('?', np.NaN, inplace=True)
data.replace(' ', np.NaN, inplace=True)
data.replace('', np.NaN, inplace=True)
final_sum = (data.isnull().sum()).sum()

# compare sum of NaN values before and after the replace function
if (init_sum < final_sum):
    print('Found ' + str(init_sum - final_sum), 'new null values.')
else:
    print('Found no new null values.')

Found no new null values.


Next, we will display the total number of missing values in each column.

In [73]:
print("\nTotal Missing Values:\n" + str(data.isnull().sum()))


Total Missing Values:
RespondentID                              0
Seen a Star Wars film                     0
Fan of Star Wars                        350
Seen The Phantom Menace                 513
Seen Attack of the Clones               615
Seen Revenge of the Sith                636
Seen A New Hope                         579
Seen The Empire Strikes Back            428
Seen Return of the Jedi                 448
Rank for The Phantom Menace             351
Rank for Attack of the Clones           350
Rank for Revenge of the Sith            351
Rank for A New Hope                     350
Rank for The Empire Strikes Back        350
Rank for Return of the Jedi             350
View of Han Solo                        357
View of Luke Skywalker                  355
View of Princess Leia Organa            355
View of Anakin Skywalker                363
View of Obi Wan Kenobi                  361
View of Emperor Palpatine               372
View of Darth Vader                     360
View of L

In [74]:
# Will we choose to handle these missing values up front?

### 1.5 Inconsistent Values (Unfinished)
First, we will analyze the unique values for each column, using the output from the following method:
* Exclude the RespondentID column, since all its values are unique
* Write all unique values to a text file

In [117]:
with open('unique_values.txt', 'w') as uv:
    cols = [c for c in col_names if c != 'RespondentID']
    for name in cols:
        uv.write(name + '\n')
        for val in data[name].unique():
            uv.write(str(val) + '\n')
        uv.write('\n')

In [None]:
# Next, we will change any values that we see need changing.

### 1.6 Duplicate Values (Unfinished)

### Write Cleaned Dataset to CSV

In [5]:
data.to_csv('survey_data.csv')