## Unsupervised Learning Capstone
### Star Wars IV - can you predict the character dialogue?

For my Unsupervised Learning Capstone, I wanted to think about a practical and fun way to understand unsupervised learning and natural language processing. Practically you can think about ways to implement this - text websites, reviews, customer surveys, etc. But in a fun way I thought about movie scripts - what if you got a movie script that you knew of and drew out natural language techniques to pick apart significant words that associated the dialogue with key characters? 

In this vein I looked at the script to one of my favorite movie franchies, Star Wars IV - a classic, good feel movie with one of the best protagonists in movie history. I took the text file of the script, ran it through a data frame, leveraged a tf-id vectorizing algorithm to see if the user could predict between characters - knowing their personalities and what their most significant words were - who said what.

In [18]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jasonpaik9/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jasonpaik9/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
## Load the data of the Star Wars movie script
data_1 = ('https://storage.googleapis.com/kaggle-datasets/25491/32521/SW_EpisodeIV.txt?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1556147369&Signature=QXyLP28aLa4ckV0aK85Vu3WWZoXByl3Zsk9Wrf25Utxx38nCEIWLK1MEau34ladxXejeGKiL2qVHks9QI1qfRXQy6s%2Fkf3Dj6IZ6HB2akETEk4Xs4It0E9m9SDZyqfECLf6kc2J41h3B549Hhw0D7LrEdasYsYfEUxQXggi%2B9lHvXjy7ehWaZYoO80YyRhse6fDNo%2BvkdP2FfH%2Bhm2QX447NNAKO4dltTIO6m8BEy%2FIBOUV96x1JxvUFjISFM%2FwgtLLaCf8X34DqK5OBR6TFZ8ANwXzzDh4Vy90R3a0I%2F3GO0Efeif9RgBgoyCIJt%2BHbuuBIrcP6BcNvR21nQGRcnA%3D%3D')
#data_2 = ('https://storage.googleapis.com/kaggle-datasets/25491/32521/SW_EpisodeV.txt?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1555545245&Signature=o%2FZOeaRXXG5Us8hEAtgR%2F%2Bw4QvLgi9%2BLwAubWEDW%2FlBaIxG%2F36EIPiDFuvMT8fBKQFPv%2FJO2to%2FyUqkC5Gw42gjfu4xyJaJScE2W%2BqkCULniWpPbtX7tsAFkN3auiMwX4AIkyEA64S95YuSuMM7US3ZosbE6jQGqlww0Tpig2lXZ9i9TURJ9713%2BSYFS%2BsoTYZWjvgfxZQpZWUNOq0ofiM78Grm3YYieY8OUCwCWmiAzpO05CkdavY3qP5%2B9vaoEXwciwuuhiIW8ExSHE1lXrVeacOA8U4yvE6u7%2FNfTL5NhTWDr%2FxQuK8UkX9cFuGlaU5uWTRypUItqgfkfgPpjdA%3D%3D')
#data_3 = ('https://storage.googleapis.com/kaggle-datasets/25491/32521/SW_EpisodeVI.txt?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1555545268&Signature=Fj0xe4mBczpcnnVuTkh6kgCVN5M0CBBLxC4vE5Au87Cu6dEnjLtfYpSiVHO%2BTdnMV75yfWLkPxIh2xxlBFmTknT%2BF9PM6BDXY6UeX2NT4Vf%2FeHi0FzxwyAv%2Fj1viZGBrvuhO%2FQ7GeAh%2BBg3kxG7FXk0X2zQBhiPH7ChJyV9PNLBTwDs%2Fu8VxoLT2NsdFLE5FfBXnBbcvfKfC%2BTcRMrtaPTyH%2FYuzzKilOaunIEkBI4zjxI%2BN3JyuSROGlC6D9AAs5dfNp10rcWmsCGkLd4hq2C3tYtkhkEuGH10uLOFqdHMk51DrFDStACdzk0Y5IIPQxNMro5hyN%2FeRxC94mQ0eZg%3D%3D')

In [20]:
df1 = pd.read_csv(data_1,delim_whitespace = True,header = 0,escapechar='\\')
#df2 = pd.read_csv(data_2,delim_whitespace = True,header = 0,escapechar='\\')
#df3 = pd.read_csv(data_3,delim_whitespace = True,header = 0,escapechar='\\')
#SWdata = pd.concat([df1,df2,df3],axis = 0)
#SWdata.head()

In [21]:
## Isolate the unique characters from Star Wars IV
print(df1['character'].unique())
print(len(df1['character'].unique()))

['THREEPIO' 'LUKE' 'IMPERIAL OFFICER' 'VADER' 'REBEL OFFICER' 'TROOPER'
 'CHIEF PILOT' 'CAPTAIN' 'WOMAN' 'FIXER' 'CAMIE' 'BIGGS' 'DEAK' 'LEIA'
 'COMMANDER' 'SECOND OFFICER' 'FIRST TROOPER' 'SECOND TROOPER' 'BERU'
 'OWEN' 'AUNT BERU' 'BEN' 'TAGGE' 'MOTTI' 'TARKIN' 'BARTENDER' 'CREATURE'
 'HUMAN' 'HAN' 'GREEDO' 'JABBA' 'OFFICER CASS'
 'VOICE OVER DEATH STAR INTERCOM' 'OFFICER' 'VOICE' 'GANTRY OFFICER'
 'INTERCOM VOICE' 'TROOPER VOICE' 'FIRST OFFICER' 'WILLARD'
 'DEATH STAR INTERCOM VOICE' 'DODONNA' 'GOLD LEADER' 'WEDGE' 'MAN'
 'RED LEADER' 'CHIEF' 'MASSASSI INTERCOM VOICE' 'RED TEN' 'RED SEVEN'
 'PORKINS' 'RED NINE' 'RED ELEVEN' 'ASTRO-OFFICER' 'CONTROL OFFICER'
 'GOLD FIVE' 'GOLD TWO' 'WINGMAN' 'BASE VOICE' 'TECHNICIAN']
60


In [22]:
## I wanted to isolate the top three characters with the highest dialogue count
## This will help run the algorithm a bit better and create more unique
## distinctions between the characters
df1.groupby('character').count().sort_values(by = ['dialogue'], ascending=False)

Unnamed: 0_level_0,dialogue
character,Unnamed: 1_level_1
LUKE,254
HAN,153
THREEPIO,119
BEN,82
LEIA,57
VADER,41
RED LEADER,37
BIGGS,34
TARKIN,28
OWEN,25


In [23]:
## Took the top five characters who had the highest counts of sentences throughout the script - this will shorten
## my dataset to a much more manageable dataset that I can work with
main_chars_df = df1[df1.character.isin(['LUKE','HAN','THREEPIO'])]

In [24]:
## Go through the process of tokenizing the data of stop words

stop_words = stopwords.words('english')

tokenized_script = main_chars_df['dialogue'].apply(lambda x: x.split())

tokenized_script = tokenized_script.apply(lambda x: [item for item in x if item not in stop_words])

## De-tokenization of the words from the script itself
detokenized_script = []
for i in range(len(main_chars_df)):
    t = ' '.join(tokenized_script.reset_index().iloc[i]['dialogue'])
    detokenized_script.append(t)

## Create a new dataframe with the clean dialogue itself
main_chars_df['clean_dialogue'] = detokenized_script

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [25]:
## Explore the new dataset with dialogue and clean_dialogue separated into two separate columns
main_chars_df.head()

Unnamed: 0,character,dialogue,clean_dialogue
1,THREEPIO,Did you hear that? They've shut down the main...,Did hear that? They've shut main reactor. We'l...
2,THREEPIO,We're doomed!,We're doomed!
3,THREEPIO,There'll be no escape for the Princess this time.,There'll escape Princess time.
4,THREEPIO,What's that?,What's that?
5,THREEPIO,I should have known better than to trust the l...,I known better trust logic half-sized thermoca...


In [26]:
## Term frequency - inverse document vectorizer will tokenize documents, learn vocabulary and inverse document 
## frequency weightings, and allow you to encode new documents. The inverse document frequencies are calculated for
## each word in vocabulary.

from sklearn.feature_extraction.text import TfidfVectorizer

## Establish English stop words
vectorizer = TfidfVectorizer(stop_words='english', 
                    max_features= 5000,
                    max_df = 0.5, 
                    smooth_idf=True)

## Vectorize the clean_dialogue column
X = vectorizer.fit_transform(main_chars_df['clean_dialogue'])

## Understand the shape of the matrix
X.shape 

(526, 892)

## Unsupervised Learning

In [27]:
## Transformer performs linear dimensionality reduction by means of SVD (contrary to PCA)
## Estimator does not center the data before computing the singular value decomposition

from sklearn.decomposition import TruncatedSVD

# SVD represent terms in vectors 
svd_model = TruncatedSVD(n_components=3, algorithm='randomized', n_iter=1000, random_state=101)

svd_model.fit(X)

## Length of number of components/characters you want to cluster and organize
len(svd_model.components_)

3

In [29]:
## Where I analyzed the feature names to see and parse through characters one by one and the most significant words
## each character has said. This will vectorize important phrases and vocab words in the clusters they belong in.

terms = vectorizer.get_feature_names()

for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:20]
    print("Character "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])

Character 0: 
going
come
right
artoo
threepio
got
ll
look
think
okay
oh
way
know
ve
red
princess
like
long
try
kid
Character 1: 
come
threepio
right
ll
artoo
oh
got
chewie
sir
ve
ship
hang
sand
people
look
copy
great
stand
detoo
wrong
Character 2: 
right
ll
oh
sir
got
ve
think
know
ship
luke
like
kid
better
great
sure
way
gonna
wrong
help
stand


When you looked into the clusters of words associated with each character, you start to understand - having the bias of watching Star Wars itself - that there are certain words that associate itself to each of the characters themselves. This sort of plays into the curriculum's emphasis on sentiment analysis by understanding the similarity of patterns with clusters.

Character 0: Han -- I knew this character was Han because of clustered words like "princess", "kid", "come", and "going". Just honing in on a word like "princess" wouldn't immediately signify Han (since Luke also says this) but it's the association of "princess" has with some of the other words as well with "kid" and words to go from one place to another - an association we see a lot of in Star Wars IV. The cluster of these words and how they're related to each other has a big play in identifying words together with a character.

Character 1: Luke --  Identifying this as Luke, there were unique words like "sand" that played a lot to Luke's character in the beginning of the film. The significance of the word "ship", "sir", and "people" play a lot int Luke's character itself so this helped me identify this character as Luke.

Character 2: Threepio -- I identified this as Threepio because of the clustered associations between "right", "oh", "help", and "sure" - all phrases that deeply reflect the character of Threepio and his helpful nature. In many ways sentimental analysis like this does an interesting job analyzing clusters of phrases crossed with characters that might best portray those clusters.