# Topic Modeling
## Introduction
A text analysis technique that will be used in this notebook is topic modeling, which is necessary for the goal of extracting topics from the corpus (i.e. the five movie screenplays) and analyzing any hidden patterns that are present such as the topics themselves or some theme that can be extrapolated from those topics.

In this project, a TF-IDF Vectorizer will be used to create a document-term matrix to use as input for a Non-Negative Matrix Factorization model. Once the topic modeling technique is applied, we'll interpret the results and see if the mix of words in each topic make sense. If they don't make sense, we'll try to tune the model using techniques such as adding stop words, changing the number of topics, and/or cleaning the data further.

In [1]:
# data manipulation
import pandas as pd
import numpy as np

# files
import pickle

# topic modeling
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# visualization
from matplotlib import pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors

## Topic Modeling with NMF
### Document-Term Matrix
The TF-IDF Vectorizer will be used for the topic modeling, which converts text data to vectors for analysis. The TF-IDF Vectorizer will weigh the term frequency against how often the term appears across all documents (questions) in the dataset, which is the inverse document frequency.

We'll first read in the DataFrame containing the corpus.

In [2]:
# read in the original dataframe
df = pd.read_pickle('df.pkl')

### TF-IDF Vectorizer
Create a document-term matrix using a TF-IDF Vectorizer and exclude common English stop words.

In [3]:
# instantiate the vectorizer with common words filtered
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.6)

# fit the text and create vectors
X = vectorizer.fit_transform(df.dialogue)

# create a document-term matrix
dtm = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
dtm.index = df.index
dtm

Unnamed: 0,abandoned,abducted,abduction,able,abominable,abridging,abruptly,absolutely,ac,accept,...,youre,yous,youve,yummier,yummy,yup,zachary,zeroing,zone,zones
Batman Scene 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
Batman Scene 3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
Batman Scene 5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.081476,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
Batman Scene 6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
Batman Scene 7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
The Dark Knight Rises Scene 302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
The Dark Knight Rises Scene 303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
The Dark Knight Rises Scene 304,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.230598,0.0,0.0,0.0,0.0
The Dark Knight Rises Scene 305,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


### Model without Additional Stop Words
Let's try a model with 4 topics.

In [4]:
# create the nmf model with 4 topics
nmf_model = NMF(n_components=4)
doc_word = vectorizer.fit_transform(df.dialogue_lemmatized)
doc_topic = nmf_model.fit_transform(doc_word)

# create a dataframe
df_nmf = pd.DataFrame(nmf_model.components_, columns=vectorizer.get_feature_names())
df_nmf



Unnamed: 0,abandoned,abducted,abduction,able,abominable,abridging,abruptly,absolutely,ac,accept,...,young,youre,yous,youve,yummier,yummy,yup,zachary,zeroing,zone
0,0.011519,0.002807,0.00416,0.019643,0.003477,0.002757,0.000839,0.003126,0.009415,0.003293,...,0.013266,0.112303,0.002681,0.064324,0.0,0.004407,0.002526,0.0,0.008329,0.007685
1,0.003911,0.0,0.000278,0.0,0.0,0.005489,0.0,0.0,0.0,0.000268,...,0.0,0.009081,0.0,0.065541,0.0,2e-05,0.0,0.0,0.0,0.005142
2,0.0,0.0,0.0,0.0,0.0,0.0,0.011007,0.00061,0.0,0.0,...,0.0,0.04975,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.001882,0.002882,0.0,0.006017,0.0,0.000118,0.007087,0.001214,0.001106,0.000322,...,0.006071,0.352983,0.001842,0.088601,0.026005,0.00145,0.00069,0.027529,0.0,0.0


In [5]:
# look at top 10 highest ranked words for each topic
for topic in range(df_nmf.shape[0]):
    tmp = df_nmf.iloc[topic]
    print(f'For topic {topic} the words with the highest value are:')
    print(tmp.nlargest(10))
    print('\n')

For topic 0 the words with the highest value are:
im        0.871092
dent      0.430189
batman    0.418998
harvey    0.402921
need      0.382577
gotham    0.335543
wa        0.307883
sir       0.302286
sorry     0.299945
wayne     0.290145
Name: 0, dtype: float64


For topic 1 the words with the highest value are:
hell        1.501286
doing       0.354634
think       0.197150
hey         0.148369
fine        0.136472
batman      0.119675
fit         0.116293
shouldnt    0.107837
forgive     0.107164
problem     0.101416
Name: 1, dtype: float64


For topic 2 the words with the highest value are:
bruce      1.437238
rachel     0.538222
wayne      0.327159
master     0.215413
defines    0.174404
right      0.127607
coming     0.105503
alfred     0.098342
chill      0.080293
whatre     0.079712
Name: 2, dtype: float64


For topic 3 the words with the highest value are:
know      0.704803
dont      0.679377
want      0.385698
got       0.369285
youre     0.352983
tell      0.299132
just    

Let's try again with 10 topics.

In [6]:
# create the nmf model with 10 topics
nmf_model = NMF(n_components=10)
doc_word = vectorizer.fit_transform(df.dialogue_lemmatized)
doc_topic = nmf_model.fit_transform(doc_word)

# create a dataframe again
df_nmf = pd.DataFrame(nmf_model.components_, columns=vectorizer.get_feature_names())
df_nmf



Unnamed: 0,abandoned,abducted,abduction,able,abominable,abridging,abruptly,absolutely,ac,accept,...,young,youre,yous,youve,yummier,yummy,yup,zachary,zeroing,zone
0,0.014491,0.005011,0.005032,0.015902,0.0,0.0,0.0,0.0,0.015799,0.0,...,0.0,0.104124,0.002156,0.016154,0.0,0.004278,0.002512,0.0,0.020969,0.00015
1,0.004586,0.0,7.8e-05,0.0,0.0,0.005909,0.0,0.0,0.0,0.0,...,0.0,0.018752,0.0,0.022243,0.0,0.000547,0.0,0.0,0.0,0.002736
2,0.0,0.0,0.0,0.0,0.0,0.0,0.005786,0.0,0.0,0.0,...,0.0,0.037781,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.001903,0.003041,0.0,0.002023,0.0,0.0,0.004185,0.007705,0.0,0.0,...,0.0,0.077878,0.0,0.0,0.029725,0.001856,0.0,0.029533,0.0,0.0
4,0.0,0.00173,0.0,0.005691,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.024147,0.0,0.0,0.0,0.000325,0.000784,0.0,0.0,0.0
5,0.0,0.0,0.000346,0.009472,0.0,0.00392,0.024118,0.007156,0.0,0.002317,...,0.0,0.0,0.000401,0.0,0.0,0.001422,0.000532,0.0,0.0,0.0
6,0.006103,0.0,0.001886,0.0,0.008701,0.000398,0.0,0.0,0.0,0.005617,...,0.017328,0.001239,0.0,0.0,0.0,0.000574,0.0,0.0,0.0,0.014073
7,0.000821,0.0,0.001282,0.0,0.0,0.0,0.0,0.0,0.0,0.002854,...,0.0,0.007855,0.001382,0.390557,0.0,0.000657,0.0,0.0,0.0,0.008532
8,0.00094,0.001587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.108211,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.001163,0.000714,0.0,0.011943,0.0,0.001742,0.003688,0.0,0.00461,0.000523,...,0.018439,0.438271,0.003122,0.0,0.015531,0.000668,0.002098,0.015545,0.0,0.0


In [7]:
# look again at top 10 highest ranked words for each topic
for topic in range(df_nmf.shape[0]):
    tmp = df_nmf.iloc[topic]
    print(f'For topic {topic} the words with the highest value are:')
    print(tmp.nlargest(10))
    print('\n')

For topic 0 the words with the highest value are:
im        1.670852
like      0.541368
sorry     0.521172
come      0.376415
need      0.324851
sure      0.322850
little    0.264163
friend    0.255196
look      0.233140
thats     0.224886
Name: 0, dtype: float64


For topic 1 the words with the highest value are:
hell        1.596823
doing       0.370065
think       0.189601
hey         0.160150
fine        0.147878
fit         0.125382
forgive     0.114931
shouldnt    0.113177
paying      0.099978
probably    0.097698
Name: 1, dtype: float64


For topic 2 the words with the highest value are:
bruce      1.661972
right      0.157789
defines    0.153474
coming     0.136710
chill      0.086749
explain    0.080150
wayne      0.077927
dear       0.072621
fall       0.070986
falcone    0.067708
Name: 2, dtype: float64


For topic 3 the words with the highest value are:
know        1.157007
alfred      0.645586
tell        0.293531
joker       0.194030
sleeping    0.124963
awake       0.098

### Attempt with Additional Stop Words
The first model contains a lot of words that are either meaningless (e.g. I'm, like, shouldn't) or are possibly too common across documents (e.g. Bruce, Alfred, Harvey). We'll try a second model with added stop words identified previously. To do this, a new vectorizer object will have to be created to use as input for the NMF model.

In [8]:
# create a new document-term matrix with added stop_words 

# add additional stop words identified from earlier models
add_stop_words = ['im', 'know', 'dont', 'think', 'thought', 'got', 'ready', 'sir', 'hell', 'ill',
                  'oh', 'tell', 'youre', 'going', 'want', 'like', 'yes', 'just', 'hes', 'shes',
                  'took', 'theyre', 'wanna', 'looks', 'need', 'does', 'yeah', 'thats', 'come',
                  'gonna', 'gon', 'whered', 'didnt', 'did', 'coming', 'told', 'aint', 'little',
                  'okay', 'youve', 'trying', 'lets', 'ive', 'hed', 'mr', 'doing', 'let', 'came',
                  'whats', 'sure', 'stay', 'theres', 'doing', 'said', 'knows', 'ah', 'gotta', 'hey',
                  'weve', 'theyve', 'wheres', 'em', 'whatre', 'batman', 'gotham', 'dent', 'rachel',
                  'harvey', 'wayne', 'bruce', 'alfred', 'youll', 'yous', 'yup', 'ac', 'shouldnt',
                  'yknow', 'youd', 'youits', 'say', 'hi', 'ya', 'lot', 'gordon', 'isnt', 'wa']

stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# recreate a document-term matrix with common words filtered
vectorizer_stop = TfidfVectorizer(stop_words=stop_words, max_df=0.6)
X_stop = vectorizer_stop.fit_transform(df.dialogue_lemmatized)
dtm_stop = pd.DataFrame(X_stop.toarray(), columns=vectorizer_stop.get_feature_names())
dtm_stop.index = df.index
dtm_stop

Unnamed: 0,abandoned,abducted,abduction,able,abominable,abridging,abruptly,absolutely,accept,accepted,...,yearend,yearn,yesterday,yield,young,yummier,yummy,zachary,zeroing,zone
Batman Scene 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Scene 3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Scene 5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Scene 6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Scene 7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
The Dark Knight Rises Scene 302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dark Knight Rises Scene 303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dark Knight Rises Scene 304,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dark Knight Rises Scene 305,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
# create the nmf model
nmf_model_stop = NMF(n_components=10)
doc_word_stop = vectorizer_stop.fit_transform(df.dialogue_lemmatized)
doc_topic_stop = nmf_model_stop.fit_transform(doc_word_stop)



In [10]:
# create a dataframe with the model
df_nmf_stop = pd.DataFrame(nmf_model_stop.components_, columns=vectorizer_stop.get_feature_names())
df_nmf_stop

Unnamed: 0,abandoned,abducted,abduction,able,abominable,abridging,abruptly,absolutely,accept,accepted,...,yearend,yearn,yesterday,yield,young,yummier,yummy,zachary,zeroing,zone
0,0.01923,0.007724,0.001669,0.019875,0.003189,0.00264,0.031756,0.0,0.027006,0.005338,...,0.009907,0.006143,0.029334,0.007462,0.032865,0.0,0.009721,0.004993,0.0,0.010588
1,0.0,0.0,0.0,0.010265,0.0,0.00611,0.0,0.0,0.0,0.0,...,0.0,0.0,0.008022,0.0,0.0,0.004074,0.0,0.0,0.0,0.0
2,0.0,0.0,0.000381,0.0,0.000134,0.0,0.0,0.0,0.0,0.0,...,0.007882,0.0,0.0,0.0,0.000174,0.0,0.0,0.0,0.0,0.010871
3,0.0,0.001727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.003316,0.0,0.0,0.0,0.0,0.004471,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,9.3e-05,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.000843,0.0,0.009493,7.4e-05,0.00305,0.0,0.0,0.0,0.005854,...,0.002743,0.0,0.0,0.006109,0.0,0.0,0.0,0.000312,0.012105,0.0
6,0.0,0.0,0.007865,0.0,0.0,0.0,0.0,0.016485,0.0,0.0,...,0.0,0.0,0.0,0.001042,0.0,0.0,0.0,0.000457,0.017676,0.0
7,0.0,0.0,0.0,0.005552,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002357,0.0,0.006869
8,0.024584,0.0,0.0,0.003957,0.0,0.002426,0.0,0.0,0.0,0.0,...,0.0,0.0,0.010061,0.0,0.019397,0.0025,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.02103,0.0,0.0,0.0,0.0,0.0,0.0,...,0.013029,0.0,0.0,0.0,0.012261,0.063051,0.0,0.0,0.0,0.0


In [11]:
# look at top 5 highest ranked words for each topic
for topic in range(df_nmf_stop.shape[0]):
    tmp = df_nmf_stop.iloc[topic]
    print(f'For topic {topic} the words with the highest value are:')
    print(tmp.nlargest(5))
    print('\n')

For topic 0 the words with the highest value are:
time    0.451149
make    0.439594
good    0.396692
day     0.354874
look    0.353639
Name: 0, dtype: float64


For topic 1 the words with the highest value are:
sorry      1.259587
friend     0.464614
trust      0.238830
listen     0.156708
brought    0.141079
Name: 1, dtype: float64


For topic 2 the words with the highest value are:
god         1.563906
charge      0.181627
thank       0.152078
eckhardt    0.143350
talk        0.139375
Name: 2, dtype: float64


For topic 3 the words with the highest value are:
master    1.454466
pushup    0.179590
point     0.171812
fall      0.163709
swing     0.127419
Name: 3, dtype: float64


For topic 4 the words with the highest value are:
joker     1.271441
guy       0.135376
thanks    0.114952
makeup    0.104763
mcu       0.099107
Name: 4, dtype: float64


For topic 5 the words with the highest value are:
die        0.790775
city       0.689439
stop       0.419654
control    0.206032
men       

The above topics make much more sense with stop words added, so we'll stick with this model for now and save the relevant DataFrames for later EDA.

In [12]:
# create a dataframe with the document-topic matrix
df_doc_topic = pd.DataFrame(doc_topic_stop)
df_doc_topic

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.042625,0.000000,0.133992,0.000000,0.000000,0.000000,0.003735,0.022431,0.000000,0.000000
1,0.007028,0.012812,0.000000,0.000000,0.000000,0.000000,0.002163,0.000000,0.000733,0.000000
2,0.048414,0.051317,0.000000,0.000000,0.000000,0.003908,0.000000,0.112453,0.004447,0.000000
3,0.062038,0.000000,0.000000,0.000000,0.001938,0.092610,0.000000,0.007242,0.000000,0.000000
4,0.062213,0.013685,0.014296,0.014837,0.001457,0.000000,0.007281,0.000000,0.000000,0.001799
...,...,...,...,...,...,...,...,...,...,...
820,0.000000,0.000000,0.000000,0.000000,0.001064,0.134341,0.000000,0.000000,0.000000,0.000000
821,0.008347,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.001674,0.004143,0.000687
822,0.032664,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.004287,0.000000,0.140773
823,0.024001,0.000000,0.000000,0.007807,0.000000,0.000000,0.000562,0.000051,0.000000,0.000000


In [13]:
# save both dataframes for later use
df_nmf_stop.to_pickle('df_nmf.pkl')
df_doc_topic.to_pickle('df_doc_topic.pkl')

### Look at Topics Between Directors
Now that we can see the topics characteristic of a Batman movie, let's explore what topics might be unique depending on who directed the film. We already have our DataFrame that contains information on the director, so we can create two new DataFrames and repeat the topic modeling process for each:
1. Split the data by director
2. Create a TF-IDF Vectorizer object
3. Create an NMF model

#### Split the Data

In [14]:
df_burton = df[df.director == 'Tim Burton']
df_nolan = df[df.director == 'Christopher Nolan']

#### Create TF-IDF Vectorizer for Both Corpora

In [15]:
# create a document-term matrix for Burton corpus

vect_burton = TfidfVectorizer(stop_words=stop_words, max_df=0.6)
X_burton = vect_burton.fit_transform(df_burton.dialogue_lemmatized)
dtm_burton = pd.DataFrame(X_burton.toarray(), columns=vect_burton.get_feature_names())
dtm_burton.index = df_burton.index
dtm_burton

Unnamed: 0,abandoned,abduction,abominable,abridging,abruptly,absolutely,access,accidentally,accomplished,acid,...,wretched,wrong,yawn,year,yearn,yesterday,young,yummier,yummy,zone
Batman Scene 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.196382,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Scene 3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Scene 5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Scene 6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Scene 7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Batman Returns Scene 181,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Returns Scene 183,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.128366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Returns Scene 185,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Returns Scene 187,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
# repeat the process for the Nolan corpus

vect_nolan = TfidfVectorizer(stop_words=stop_words, max_df=0.6)
X_nolan = vect_nolan.fit_transform(df_nolan.dialogue_lemmatized)
dtm_nolan = pd.DataFrame(X_nolan.toarray(), columns=vect_nolan.get_feature_names())
dtm_nolan.index = df_nolan.index
dtm_nolan

Unnamed: 0,abandoned,abducted,able,accept,accepted,access,accessed,accident,accidentally,accomplice,...,written,wrong,wuertz,xrays,year,yearend,yesterday,yield,zachary,zeroing
Batman Begins Scene 1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Begins Scene 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Begins Scene 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Begins Scene 6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Batman Begins Scene 8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
The Dark Knight Rises Scene 302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dark Knight Rises Scene 303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dark Knight Rises Scene 304,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dark Knight Rises Scene 305,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Create an NMF Model for Both Corpora

In [17]:
# create the nmf model for Burton movies
nmf_model_burton = NMF(n_components=10)
doc_word_burton = vect_burton.fit_transform(df_burton.dialogue_lemmatized)
doc_topic_burton = nmf_model_burton.fit_transform(doc_word_burton)

# create a dataframe
df_nmf_burton = pd.DataFrame(nmf_model_burton.components_, columns=vect_burton.get_feature_names())
df_nmf_burton



Unnamed: 0,abandoned,abduction,abominable,abridging,abruptly,absolutely,access,accidentally,accomplished,acid,...,wretched,wrong,yawn,year,yearn,yesterday,young,yummier,yummy,zone
0,0.05247,0.0,0.015729,0.02613,0.06166,0.0,0.02457,0.004578,0.0,0.0,...,0.0,0.134922,0.025704,0.006073,0.006073,0.070877,0.108912,0.0,0.019365,0.033009
1,0.0,0.002888,0.0,0.0,0.001799,0.0,0.0,0.0,0.0,0.04328,...,0.0,0.017859,0.0,0.000428,0.000428,0.0,0.0,0.150845,0.000707,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00183,0.0,0.0,0.0
3,0.0,0.016806,0.0,0.0,0.0,0.033659,0.0,0.000272,0.0,0.000847,...,0.0,0.0,0.0,0.0,0.0,0.000686,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007043
5,0.0,0.0,0.0,0.004241,0.0,0.0,0.000236,0.0,0.143915,0.002479,...,0.273552,0.0,0.010025,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006457,...,0.0,0.026698,0.000973,0.007867,0.007867,0.0,0.0,0.0,0.00111,0.0
7,0.0,0.0,0.0,0.003423,0.0,0.004562,0.0,0.0,0.0,0.026598,...,0.0,0.0,0.0,0.0,0.0,0.0,0.028997,0.0,0.0,0.009969
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.001006,0.0,0.0,0.0,0.0
9,0.0,0.001364,0.004584,0.0,0.00224,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02381,0.0,0.0,0.0,0.045085,0.0,0.0,0.0,0.0


In [18]:
# look at top 10 highest ranked words for each topic
for topic in range(df_nmf_burton.shape[0]):
    tmp = df_nmf_burton.iloc[topic]
    print(f'For topic {topic} the words with the highest value are:')
    print(tmp.nlargest(10))
    print('\n')

For topic 0 the words with the highest value are:
time       0.453617
love       0.372329
man        0.369325
mayor      0.356023
penguin    0.331736
maybe      0.308018
oswald     0.304265
night      0.273867
make       0.269430
city       0.265795
Name: 0, dtype: float64


For topic 1 the words with the highest value are:
miss       0.633652
vale       0.509142
kitty      0.367147
arrived    0.313196
help       0.243761
vicki      0.180603
meeting    0.175758
message    0.154602
yummier    0.150845
late       0.127708
Name: 1, dtype: float64


For topic 2 the words with the highest value are:
shield      0.914404
open        0.586354
ignition    0.509489
people      0.198208
somebody    0.192370
car         0.150411
enjoy       0.075212
hungry      0.075212
wallet      0.070341
welcome     0.069363
Name: 2, dtype: float64


For topic 3 the words with the highest value are:
right           0.909596
kick            0.283989
great           0.119727
announcement    0.113614
member      

In [19]:
# repeat the process for Nolan movies
nmf_model_nolan = NMF(n_components=10)
doc_word_nolan = vect_nolan.fit_transform(df_nolan.dialogue_lemmatized)
doc_topic_nolan = nmf_model_nolan.fit_transform(doc_word_nolan)

# create a dataframe
df_nmf_nolan = pd.DataFrame(nmf_model_nolan.components_, columns=vect_nolan.get_feature_names())
df_nmf_nolan



Unnamed: 0,abandoned,abducted,able,accept,accepted,access,accessed,accident,accidentally,accomplice,...,written,wrong,wuertz,xrays,year,yearend,yesterday,yield,zachary,zeroing
0,0.020647,0.006102,0.048429,0.0,0.0,0.013855,0.012719,0.013871,0.010041,0.019174,...,0.011708,0.163663,0.021396,0.0,0.140276,0.025709,0.000384,0.007714,0.005865,0.0
1,0.0,0.001395,0.0,0.0,0.0,0.0,0.0,0.026337,0.0,0.026863,...,0.0,0.0,0.0,0.050606,0.0,0.001247,0.0,0.0,0.0,0.0
2,0.0,0.0,0.009724,0.0,0.0,0.011721,0.0,0.0,0.003217,0.0,...,0.0,0.0,0.0,0.0,0.008288,0.0,0.000494,0.0,0.001769,0.0
3,0.0,0.0,0.001974,0.0,0.0,0.0,0.0,0.001261,0.0,0.0,...,0.0,0.0,0.015527,0.0,0.0,0.012135,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.013496,0.0,0.000157,0.0,0.0,...,0.0,0.0,0.04666,0.0,0.0,0.0,0.000766,0.0,0.0,0.0
5,0.000479,0.001256,0.003148,0.001796,0.000463,0.008411,0.001611,0.0,0.0,0.0,...,0.0,0.0,0.009564,0.0,0.0,0.001824,0.003753,0.002684,0.0,0.029434
6,0.0,0.000137,0.0,0.065933,0.003948,0.009188,0.0,0.0,0.0,0.0,...,0.0,0.017969,0.000256,0.0,0.058757,0.0,0.0,0.005896,0.0,0.0
7,0.0,0.0,0.0,0.0,0.020281,0.0,0.0,0.0,0.0,0.0,...,5.8e-05,0.0,0.0,0.0,0.006732,0.0,0.004594,0.00193,0.0,0.0
8,0.0,0.0,0.0,0.0,0.000867,0.003698,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001522,0.0,0.005908
9,0.0,0.0,0.005911,0.0,0.0,0.0,0.0,0.0,0.0,0.000764,...,0.0,0.0,0.009494,0.0,0.0,0.0,0.005594,0.0,0.0,0.0


In [20]:
# look at top 10 highest ranked words for each topic
for topic in range(df_nmf_nolan.shape[0]):
    tmp = df_nmf_nolan.iloc[topic]
    print(f'For topic {topic} the words with the highest value are:')
    print(tmp.nlargest(10))
    print('\n')

For topic 0 the words with the highest value are:
time      0.476453
right     0.459313
people    0.377516
city      0.330305
day       0.311522
way       0.309943
friend    0.309742
thing     0.308653
look      0.301278
cop       0.293577
Name: 0, dtype: float64


For topic 1 the words with the highest value are:
master     1.290295
pushup     0.162815
point      0.149107
fall       0.140733
swing      0.111822
thing      0.099602
worry      0.096390
time       0.086738
feel       0.084680
prepare    0.075895
Name: 1, dtype: float64


For topic 2 the words with the highest value are:
sorry        0.897162
money        0.897028
friend       0.195329
wait         0.143091
wallet       0.123748
cab          0.113169
trust        0.112608
listen       0.106216
goddammit    0.098086
men          0.086892
Name: 2, dtype: float64


For topic 3 the words with the highest value are:
god        1.400050
thank      0.174254
talk       0.173810
wait       0.149859
help       0.130558
kidding    0

Again, these topics look fairly meaningful with the parameter set for 10 topics, along the text pre-processing from earlier and the added stop words that were included. The DataFrames for both corpora and the models for each director can be saved for later use in EDA.

In [21]:
# save both models and corpora for later
df_burton.to_pickle('df_burton.pkl')
df_nolan.to_pickle('df_nolan.pkl')
df_nmf_burton.to_pickle('df_nmf_burton.pkl')
df_nmf_nolan.to_pickle('df_nmf_nolan.pkl')