## Introduction

We were trying to answer two questions with this analysis. First, are the characters with the most lines in the Star Wars movies also the most polarizing characters? That is, are they viewed by those that have seen the movies either very favorably or very unfavorably? Second, is there a physical trait that is common among the most polarizing characters? We will look at the polarization of characters compared to both script and description data frames to answer these questions. 

In [1]:
#Import all needed librarys
import re
import pandas as pd

### FiveThirtyEight and List of Lists

We took the following data set off of FiveThirtyEight. This data set contains results from a Star Wars survey. There are 1000 survey participants recorded. The csv was read in as a list of lists, which is a row centric approach to storing data from a csv. We then created a pandas data frame using this list of lists. From here we were able to clean the data and chang the values for how favorable a certain character to a scale that only measures polarization. Since the values for all of the characters were of the form "Very favorably", "Somewhat favorably", "Neither favorably nor unfavorably (neutral)", 'Somewhat unfavorably', 'Unfamiliar (N/A)', and 'Very unfavorably'. Since we wanted to measure polarization, it made sense to make the responses that were "Very favorably" or "Very unfavorably" weighted the most. We assigned values of 2 to wherever these values appeared in the data frame. After this, we were able to sum up polarization values, which were stored in a dictionary, and added this to a pandas data frame.

In [None]:
filename = 'star-wars-survey_StarWars.csv'   # Opens file that is in the same working directory
file = open(filename, 'r')

def getColumns(file):
    """
    Given a file, this function returns the column headers in a spreadsheet
    
    Parameters: file
    Return value: list containing the column headers
    """
    file.seek(0)
    columns = file.readline().strip().split(',')
        
    return columns


def buildLol(file):
    """
    Given a file, this function creates a list of lists representation of the file
    
    Parameters: file
    Return value: a list of lists storing the data of the a file
    """       
    lol = []
    
    for line in file: #Iterate through the lines of a file
        curRow = line.strip().split(',') #creates a list of values in a row, split up by commas
        lol.append(curRow)
        
    return lol

#Creation of the LoL and subsequent creation of a pandas data frame using the lists of lists
columns = getColumns(file)
StarWarsLol = buildLol(file)
five38df = pd.DataFrame(StarWarsLol)
five38df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,29,30,31,32,33,34,35,36,37,38
0,2,3292879998,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,3,3292879538,No,,,,,,,,...,,,,,Yes,Male,18-29,$0 - $24999,Bachelor degree,West South Central
2,4,3292765271,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,$0 - $24999,High school degree,West North Central
3,5,3292763116,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,$100000 - $149999,Some college or Associate degree,West North Central
4,6,3292731220,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,$100000 - $149999,Some college or Associate degree,West North Central
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
994,996,3288682656,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,Han,No,,Yes,Female,30-44,$50000 - $99999,Bachelor degree,New England
995,997,3288678886,Yes,No,,,,,Star Wars: Episode V The Empire Strikes Back,,...,Very favorably,I don't understand this question,No,,No,Female,18-29,$100000 - $149999,Some college or Associate degree,East North Central
996,998,3288675923,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,...,Very favorably,I don't understand this question,No,,Yes,Female,18-29,,Graduate degree,West South Central
997,999,3288672490,Yes,No,Star Wars: Episode I The Phantom Menace,,,Star Wars: Episode IV A New Hope,,Star Wars: Episode VI Return of the Jedi,...,Very favorably,Han,No,,No,Female,18-29,$50000 - $99999,Bachelor degree,Pacific


In [None]:
#Naming the columns of the python data frame using the list of columns created above. Set the index of the data frame
five38df.columns = columns
five38dfv2 = five38df.set_index("rowid")

#Removing all un-needed columns
five38dfv3 = five38dfv2.drop(["RespondentID", "Which character shot first?","Are you familiar with the Expanded Universe?","Do you consider yourself to be a fan of the Expanded Universe?","Do you consider yourself to be a fan of the Star Trek franchise?","Gender", "Age","Household Income", "Education", "Location (Census Region)"], axis=1)

In [None]:
#List of the columns that are going to be in the cleaned data set
cleanedfive38df = five38dfv3[['Have you ever seen Episode IV?','Have you ever seen Episode V?','Have you ever seen Episode VI?','Please state whether you view Han Solo favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view Luke Skywalker favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view Princess Leia Organa favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view Anakin Skywalker favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view Obiwan Kenobi favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view Emperor Palpatine favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view Darth Vader favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view Lando Calrissian favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view Boba Fett favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view C-3PO favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view R2 D2 favorably or unfavorably or are unfamiliar with him/her.',
 'Please state whether you view Yoda favorably or unfavorably or are unfamiliar with him/her.']]

In [None]:
#Renaming the column headers in the cleaned file
cleanedfive38df = cleanedfive38df.rename(columns={"Have you ever seen Episode IV?": "Episode IV", 
"Have you ever seen Episode V?": "Episode V", 
"Have you ever seen Episode VI?": "Episode VI",
'Please state whether you view Han Solo favorably or unfavorably or are unfamiliar with him/her.': 'Han Solo',
'Please state whether you view Luke Skywalker favorably or unfavorably or are unfamiliar with him/her.': "Luke Skywalker",
'Please state whether you view Princess Leia Organa favorably or unfavorably or are unfamiliar with him/her.': "Leia Organa",
'Please state whether you view Anakin Skywalker favorably or unfavorably or are unfamiliar with him/her.': "Anakin Skywalker",
'Please state whether you view Obiwan Kenobi favorably or unfavorably or are unfamiliar with him/her.': "Obiwan Kenobi",
'Please state whether you view Emperor Palpatine favorably or unfavorably or are unfamiliar with him/her.': "Emporer Palpatine",
'Please state whether you view Darth Vader favorably or unfavorably or are unfamiliar with him/her.': "Darth Vader", 
'Please state whether you view Lando Calrissian favorably or unfavorably or are unfamiliar with him/her.': "Lando Calrissian",
'Please state whether you view Boba Fett favorably or unfavorably or are unfamiliar with him/her.': "Boba Fett", 
'Please state whether you view C-3PO favorably or unfavorably or are unfamiliar with him/her.': "C-3PO",
'Please state whether you view R2 D2 favorably or unfavorably or are unfamiliar with him/her.': "R2 D2",
'Please state whether you view Yoda favorably or unfavorably or are unfamiliar with him/her.': "Yoda"
})

In [None]:
#Changing the values of the columns which check which movies a person has seen
cleanedfive38df.iloc[:, 0:3] = cleanedfive38df.iloc[:, 0:3].replace(["Star Wars: Episode IV  A New Hope","Star Wars: Episode V The Empire Strikes Back",
"Star Wars: Episode VI Return of the Jedi",''], ["Yes","Yes","Yes","No"])

#Changing current measuring system for one that measures polarization rather than favorability. 
cleanedfive38df = cleanedfive38df.replace(["Very favorably","Somewhat favorably","Neither favorably nor unfavorably (neutral)",
'Somewhat unfavorably', 'Unfamiliar (N/A)', 'Very unfavorably',''], [2,1,0,1,0,2,0])

In [None]:
#Create list of columns that we hope to sum
columnList = cleanedfive38df.columns.tolist()
cleanedList = columnList[3:]

characterSum = {}

#Sum values for each character
for column in cleanedList:
    characterSum[column] = sum(cleanedfive38df[column])

#Sort dictionary
polarization = dict(sorted(characterSum.items(), key= lambda x: x[1], reverse = True))

In [None]:
#Create data frame and subsequent formatting of that data frame
polarizationDf = pd.DataFrame(polarization, index = [0])
polarizationDf2 = polarizationDf.transpose()

#Add column headers
polarizationDf2.columns = ["Polarization Count"]

polarizationDf2

Unnamed: 0,Polarization Count
Han Solo,1192
Yoda,1183
Obiwan Kenobi,1164
Luke Skywalker,1155
R2 D2,1142
Leia Organa,1139
C-3PO,1038
Darth Vader,1023
Anakin Skywalker,787
Emporer Palpatine,598


### Regular Expressions and Dictionary of Lists

We will start by using regular expressions to read through the scripts of the original trilogy (A New Hope, The Empire Stikes Back, Return of the Jedi). The regex patterns that we created will find speaking lines of any character in the script. We created a dictionary of lists to store the results. The dictionary of lists we created contains the speaking characters as keys and the list of character speaking lines and whether they are in each specific movie as the values. 

In [2]:
#creates a capture group for each Star Wars movie script which will capture speaking characters
newHopePattern = r"(\b[A-Z ']+\b)\n[^\n]+[A-Z(]"  
newHopeScript = open('Star-Wars-A-New-Hope.txt', 'r')
newHopeMatches = re.findall(newHopePattern, newHopeScript.read())

empireStrikesBackPattern = r"(\b[A-Z ']+\b)\n[^\n]+[A-Z(]" 
empireStrikesBackScript = open('Star-Wars-The-Empire-Strikes-Back.txt','r')
empireStrikesBackMatches = re.findall(empireStrikesBackPattern, empireStrikesBackScript.read())

returnOfTheJediPattern = r"\n\n(\b[A-Z ']{3,}\b)\n" 
returnOfTheJediScript = open('Star-Wars-Return-of-the-Jedi.txt','r') 
returnOfTheJediMatches = re.findall(returnOfTheJediPattern, returnOfTheJediScript.read())

In [3]:
def createDictionary(matches):
    """
    Given a list of matches, this function returns the dictionary of lists with keys as matches and values as frequency of 
    those characters in the matches list
    
    Parameters: list of matches
    Return value: dictionary of lists with characters and frequencies
    """
    dic = {}
    for item in matches:
        if item in dic:
            dic[item] = [x+1 for x in dic[item]]
        else:
            dic[item] = [1]
    return dic

#create dictionaries for all matches in each movie
newHopeDic = createDictionary(newHopeMatches)
empireStrikesBackDic = createDictionary(empireStrikesBackMatches)
returnOfTheJediDic = createDictionary(returnOfTheJediMatches[:-1])


#create a single dictionary aggregating all dictionaries and their characters and frequencies 
for item in empireStrikesBackDic:
    if item in newHopeMatches:
        newHopeDic[item] = [x+empireStrikesBackDic[item][0] for x in newHopeDic[item]]
    else:
        newHopeDic[item] = empireStrikesBackDic[item]

for item in returnOfTheJediDic:
    if item in newHopeMatches:
        newHopeDic[item] = [x+returnOfTheJediDic[item][0] for x in newHopeDic[item]]
    else:
        newHopeDic[item] = returnOfTheJediDic[item]

In [4]:
completeDic = dict(sorted(newHopeDic.items(), key=lambda x: x[1],reverse=True)) #create the complete dictionary sorted by highest 
#frequency of speaking lines

In [5]:
newHopeDic = createDictionary(newHopeMatches)
empireStrikesBackDic = createDictionary(empireStrikesBackMatches)
returnOfTheJediDic = createDictionary(returnOfTheJediMatches[:-1])

# append Y/N values to the dictionary of lists for each key which allow us to determine whether a character was in the 
# first, second, or third movie
for key in completeDic:
    if key in newHopeDic:
        completeDic[key].append("Y")
    else:
        completeDic[key].append("N")

for key in completeDic:
    if key in empireStrikesBackDic:
        completeDic[key].append("Y")
    else:
        completeDic[key].append("N")

for key in completeDic:
    if key in returnOfTheJediDic:
        completeDic[key].append("Y")
    else:
        completeDic[key].append("N")

In [6]:
df = pd.DataFrame(completeDic) # create a data frame of our dictionary
df2 = df.transpose()
df2.columns = ["Number of lines", "In movie 4?", "In movie 5?", "In movie 6?"] # change row names to make sense within the context 
# of the row values
df2

Unnamed: 0,Number of lines,In movie 4?,In movie 5?,In movie 6?
LUKE,476,Y,Y,Y
HAN,433,Y,Y,Y
THREEPIO,291,Y,Y,Y
LEIA,224,Y,Y,Y
VADER,132,Y,Y,Y
...,...,...,...,...
RED THREE,1,N,N,Y
NAVIGATOR,1,N,N,Y
CONTROL ROOM COMMANDER,1,N,N,Y
SECOND COMMANDER,1,N,N,Y


### Character Descriptions

We took a data set from github that listed the physical attributes of most of the characters in Star Wars. This file was read directly as a csv into pandas. We cleaned the pandas data frame by dropping unimportant variables, as well as removing any characters that did not make an appearance in any of the original trilogy movies. We did this so that we could see if there were any physical traits that made a character more popular/polarizing.

In [14]:
#Create pandas data frame and set index
descriptionsDf = pd.read_csv("starwars_descriptions.csv")
descriptionsDf.set_index(["name"], inplace = True)

#Drop unwanted columns using drop method for pandas
descriptionsDfv2 = descriptionsDf.drop(["vehicles", "starships", "homeworld"], axis = 1)

#Using filter to only include characters that are in the original triology
for index,row in descriptionsDfv2.iterrows():
    movies = [item.strip() for item in row[9].split(",")] #list comprehension to create list of movies each character was in
    if ("A New Hope" not in movies) and ("The Empire Strikes Back" not in movies) and ("Return of the Jedi" not in movies): #New list will only include characters in original triology
        descriptionsDfv2.drop([index], axis = 0, inplace = True) #Drop characters if they do not appear in 4,5, or 6
        
descriptionsDfv2

Unnamed: 0_level_0,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,species,films
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Human,"The Empire Strikes Back, Revenge of the Sith, ..."
C-3PO,167.0,75.0,,gold,yellow,112.0,none,masculine,Droid,"The Empire Strikes Back, Attack of the Clones,..."
R2-D2,96.0,32.0,,"white, blue",red,33.0,none,masculine,Droid,"The Empire Strikes Back, Attack of the Clones,..."
Darth Vader,202.0,136.0,none,white,yellow,41.9,male,masculine,Human,"The Empire Strikes Back, Revenge of the Sith, ..."
Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Human,"The Empire Strikes Back, Revenge of the Sith, ..."
Owen Lars,178.0,120.0,"brown, grey",light,blue,52.0,male,masculine,Human,"Attack of the Clones, Revenge of the Sith, A N..."
Beru Whitesun lars,165.0,75.0,brown,light,blue,47.0,female,feminine,Human,"Attack of the Clones, Revenge of the Sith, A N..."
R5-D4,97.0,32.0,,"white, red",red,,none,masculine,Droid,A New Hope
Biggs Darklighter,183.0,84.0,black,light,brown,24.0,male,masculine,Human,A New Hope
Obi-Wan Kenobi,182.0,77.0,"auburn, white",fair,blue-gray,57.0,male,masculine,Human,"The Empire Strikes Back, Attack of the Clones,..."
