# Dimensionality reduction with LDA

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sekhansen/text_algorithms_econ/blob/main/notebooks/3_LDA.ipynb)


This notebook introduces how to use the [```lda```](https://lda.readthedocs.io/en/latest/autoapi/lda/index.html) Python library for estimating Latent Dirichlet Allocation using the collapsed Gibbs sampling algorithm of Griffiths and Steyvers (2004).

To illustrate LDA, the tutorial uses text data from State of the Union Addresses at the paragraph level. These are available for download from [this link](http://www.presidency.ucsb.edu/sou.php).

In [128]:
# install required libraries
!pip3 install flashtext                  # easy phrase replacing methods
!pip3 install contractions               # expand English contractions
!pip3 install --upgrade spacy==2.2.4     # functions for lemmatizing
!pip3 install gdown                      # download files from Google Drive
!pip3 install nltk                       # NLP library
!pip3 install lda

# install Spacy's language model
# for more languages and models check: https://spacy.io/models
!python3 -m spacy download en_core_web_sm

In [2]:
# TODO: Eventually the optimal way to bring our custom modules is to clone our
# public repository here

# (COLAB) clone the GitHub repository
!git clone https://github.com/yabramuvdi/pymodules-test.git
#!git clone https://github.com/sekhansen/text_algorithms_econ

Cloning into 'pymodules-test'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 30 (delta 8), reused 22 (delta 5), pack-reused 0[K
Unpacking objects: 100% (30/30), done.


In [3]:
# define dictionary with paths to data and Python scripts in Google Drive
urls_dict = {"speech_data_extend":             ("https://drive.google.com/uc?id=1O8wy8vbOY8TXYze27b_dAkB9UOq8BNy_", "txt"),
            }

In [4]:
# download all files
import gdown
for file_name, attributes in urls_dict.items():
    url = attributes[0]
    extension = attributes[1]
    gdown.download(url, f"./{file_name}.{extension}", quiet=False) 

Downloading...
From: https://drive.google.com/uc?id=1O8wy8vbOY8TXYze27b_dAkB9UOq8BNy_
To: /content/speech_data_extend.txt
100%|██████████| 12.1M/12.1M [00:00<00:00, 49.5MB/s]


In [41]:
# import required libraries
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from gensim.models.ldamodel import LdaModel
import lda

import spacy
nlp_standard = spacy.load('en_core_web_sm')

# TODO: UPDATE THIS LATER 
# import our own modules
sys.path.append('./pymodules-test/pymodules')
import preprocessing_class as pc

To begin, we read in the data, specifying the encoding of the text data.

In [6]:
data = pd.read_table("speech_data_extend.txt", encoding="utf-8")
data

Unnamed: 0,president,speech,year
0,Washington,Fellow-Citizens of the Senate and House of Rep...,1790
1,Washington,I embrace with great satisfaction the opportun...,1790
2,Washington,In resuming your consultations for the general...,1790
3,Washington,Among the many interesting objects which will ...,1790
4,Washington,"A free people ought not only to be armed, but ...",1790
...,...,...,...
23150,Obama,Let me tell you about one of those families I'...,2014
23151,Obama,"A few months later, on his 10th deployment, Co...",2014
23152,Obama,"Even now, Cory is still blind in one eye, stil...",2014
23153,Obama,"My fellow Americans, men and women like Cory r...",2014


For the tutorial we focus on State of the Union addresses made since the television era, which began in 1947.

In [7]:
data = data[data.year >= 1947]
data.reset_index(drop=True, inplace=True)
len(data) # The number of documents (paragraphs of State of the Union Addresses) in the dataset

9488

# Cleaning Raw Text Data

Before implementing a topic model, it is important to pre-process the data. To do this, we will use a custom-built preprocessing Python library. For a more detailed discussion on preprocessing check the *2_preprocessing.ipynb* notebook within this same repository.

We will start by creating a function that applies all the preprocessing steps in the correct order. We should provide the documents that we want to preprocess along side some key parameters. This function will return a document-term matrix.


In [8]:
def apply_preprocessing(data, item_type, stopwords_type, replacing_dict, pattern, punctuation):
    """ Function to apply the steps from the preprocessing class in the correct
        order to generate a term frequency matrix
    """

    # initialize the class with the text data and some parameters
    prep = pc.RawDocs(data, stopwords=stopwords_type, lower_case=True, contraction_split=True, tokenization_pattern=pattern)
    # replace some specific phrases of interest
    prep.phrase_replace(replace_dict=replacing_dict, case_sensitive_replacing=False)
    # lower-case text, expand contractions and initialize stopwords list
    prep.basic_cleaning()
    # split the documents into tokens
    prep.tokenize_text()
    # clean tokens
    prep.token_clean(length=2, punctuation=punctuation, numbers=True)
    # remove stopwords
    prep.stopword_remove("tokens")
    
    if item_type == "lemmas":
        prep.lemmatize()
    elif item_type == "stems":
        prep.stem()

    # create document-term matrix
    prep.dt_matrix_create(items=item_type, min_df=10, score_type='df')
    # get the vocabulary and the appropriate dictionaries to map from indices to words
    word2idx = prep.vocabulary[item_type]
    idx2word = {i:word for word,i in word2idx.items()}
    vocab = list(word2idx.keys())
    
    return prep, word2idx, idx2word, vocab

In [9]:
# define tokenization pattern and punctuation symbols
pattern = r'''
          (?x)                # set flag to allow verbose regexps (to separate logical sections of pattern and add comments)
          \w+(?:-\w+)*        # word characters with internal hyphens
          | [][.,;"'?():-_`]  # preserve punctuation as separate tokens
          '''
punctuation = string.punctuation.replace("-", "")

In [10]:
# use preprocessing class
prep, word2idx, idx2word, vocab = apply_preprocessing(data["speech"],       # our documents
                                                      "stems",              # tokens, stems or lemmas
                                                      "long",               # long or short
                                                      {},                   # dictionary with expressions we want to preserver 
                                                      pattern,              # tokenization pattern
                                                      punctuation           # string with punctuation symbols to remove
                                                      )

In [11]:
# inspect a particular tokenized document and compare it to its original form
i = 10
print(data["speech"][i])
print("\n ------------------------------- \n")
print(prep.stems[i])

As the year 1947 begins, the state of our national economy presents great opportunities for all. We have virtually full employment. Our national production of goods and services is 50 percent higher than in any year prior to the war emergency. The national income in 1946 was higher than in any peacetime year. Our food production is greater than it has ever been. During the last 5 years our productive facilities have been expanded in almost every field. The American standard of living is higher now than ever before, and when the housing shortage can be overcome it will be even higher. 

 ------------------------------- 

['year', 'begin', 'state', 'nation', 'economi', 'present', 'great', 'opportun', 'virtual', 'full', 'employ', 'nation', 'product', 'good', 'servic', 'percent', 'higher', 'year', 'prior', 'war', 'emerg', 'nation', 'incom', 'higher', 'peacetim', 'year', 'food', 'product', 'greater', 'last', 'year', 'product', 'facil', 'expand', 'almost', 'field', 'american', 'standard', 'l

Notice that the outcome of stemming need not be an English word.  These stems are the data on which we will run the topic model below.  We make an additional call to remove stopwords from stems, since the stemmed forms of tokens not in the stopword list may themselves be in the stopword list.

The final step in pre-processing is to drop remaining words that are not useful for identifying content.  We have already dropped standard stopwords, but there may also be data-dependent common words.  For example, in data from Supreme Court proceedings, "justice" might be treated as a stopword.  Also, words that appear just once or twice in the collection are not informative of content either.  Ideally, one would like a measure of informativeness that both punishes common words in the data, and rare words.  One such option is to give each stem a tf-idf (term frequency - inverse document frequency) score.  This is standard in the language processing literature, so we omit details here.

In [12]:
# create document-term matrix using stems
prep.dt_matrix_create(items='stems', min_df=10, score_type='df')

In [13]:
dt_matrix_stems = prep.df_matrix["stems"].astype(int)
dt_matrix_stems.shape

(9488, 2541)

After pre-processing, we have 2,541 unique stems We now proceed to estimate a topic model on them.

# Estimating a Topic Model

The first step in estimation is to initialize a model using the LDA class.  We will pass the list of stems we just created as the set of documents.

There are three main parameters in LDA, the number of topics, and the two hyperparameters of the Dirichlet priors.  We will follow the advice of Griffiths and Steyvers (2004) and set the hyperparameter of the Dirichlet prior on topics to $200/V$, where $V$ is the number of unique vocabulary elements, and the hyperparameter of the Dirichlet prior on document-topic distributions to $50/K$, where $K$ is the number of topics.

In [98]:
# create LDA object with our parameters
num_topics = 30
vocab_size = dt_matrix_stems.shape[1]
lda_model = lda.LDA(n_topics=num_topics,         # Number of topics
                    alpha=50/num_topics,         # Dirichlet parameter for distribution over topics
                    eta=200/vocab_size,          # Dirichlet parameter for distribution over words
                    n_iter=4000,                 # Number of sampling iterations
                    random_state=92              # The generator used for the initial topics
                    )

lda_model

<lda.lda.LDA at 0x7ff718198f10>

In [99]:
# check some paramters
print(lda_model.alpha)
print(lda_model.eta)
print(lda_model.n_topics)

1.6666666666666667
0.07870916961826052
30


In [100]:
# estimate model (takes a couple of minutes)
lda_model.fit(dt_matrix_stems)



<lda.lda.LDA at 0x7ff718198f10>

The estimated topics are represented by a $30 \times 2541$ matrix whose rows sum to one, while the estimated distributions of topics within each document are represented by $9488 \times 30$ matrices whose rows sum to one as well.  To get an idea of the topics that have been estimated, and whether they make sense, we will explore the stems with the highest probability for each topic. It's a good idea to check the topics are "reasonable" before proceeding with any analysis.

In [101]:
# topics: probability distribution over the vocabulary (num_topics x vocab_size)
lda_model.topic_word_.shape

(30, 2541)

In [102]:
# explore the words with highest probability per topic
topics_dist = lda_model.topic_word_
n_top_words = 10
for i, topic_dist in enumerate(topics_dist):
     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
     print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: dream electr nutrit minist dictatorship implic overwhelm profession leas
Topic 1: deserv satisfi analysi allevi involv recov shop fruit incent
Topic 2: schedul mandat scarc represent exercis proceed lowest sacrific invas
Topic 3: minist electr born unnecessari alli fail child end issu
Topic 4: rank result heritag math bring teacher like racial coupl
Topic 5: princip light deepest stimulus distribut earth terrorist basic conflict
Topic 6: giant hard-working strength convers petroleum truli advanc strain conclud
Topic 7: slow elimin manner lobbi led rank vietnames agricultur finish
Topic 8: discrimin expir dead scene conscienc clean protect compet leas
Topic 9: conting offens narcot appoint educ year mechan easier superior
Topic 10: invas instant narrow overhaul terrorist tuition hardship touch embrac
Topic 11: coven abl terrorist outer observ energi premium interrupt state
Topic 12: agenda fair reduct afford summer compass chao famin price
Topic 13: freez method stimulus assur 

Most economics researchers will probably be most interested initially in the distributions of topics within each document.

In [103]:
# matrix of proportion of topics per document (num_doc x num_topics)
lda_model.doc_topic_.shape

(9488, 30)

One might also want to replace the speech field in the original dataset with the estimated topics in order to have a ready-to-go dataset for regression or other econometric analysis.  The following code builds this dataset, and also writes it to file.

In [104]:
# add the topic proportions to each document
for i in range(lda_model.doc_topic_.shape[1]): data['T' + str(i)] = lda_model.doc_topic_[:,i]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [105]:
# explore our new data
data.head(2)

Unnamed: 0,president,speech,year,T0,T1,T2,T3,T4,T5,T6,...,T21,T22,T23,T24,T25,T26,T27,T28,T29,speech_stems
0,Truman,"Mr. President, Mr. Speaker, Members of the Con...",1947,0.065476,0.029762,0.029762,0.029762,0.029762,0.029762,0.029762,...,0.029762,0.029762,0.029762,0.029762,0.029762,0.047619,0.029762,0.029762,0.029762,presid speaker member congress unit state
1,Truman,It looks like a good many of you have moved ov...,1947,0.030303,0.030303,0.030303,0.030303,0.030303,0.048485,0.030303,...,0.030303,0.030303,0.030303,0.048485,0.030303,0.030303,0.048485,0.030303,0.030303,look good move left last


In [106]:
# save data
data.to_csv("final_output.csv", index=False)

# Querying Using Estimated Topics

After estimating a topic model, one is often interested in estimating the distribution of topics for documents not included in estimation.  In this case, one option is to $\textit{query}$ those documents by holding fixed the topics estimated from LDA, and only estimating the distribution of topics for the out-of-sample documents.  The ```lda``` library also provides a way to do this through the ```transform()``` function.

We will apply querying to the corpus of entire State of the Union Addresses since 1947 (recall that we estimated topics on the level of the paragraph within each speech).  In terms of estimating topics, the paragraph level is preferable to the speech level since individual paragraphs are more likely to be based around a single theme.  But, in terms of econometric work, the entire speech is a more natural unit of analysis.  At the same time, there is no general way of "adding up" probability distribution at the paragraph level in order to arrive at a speech-level distribution.  Hence the need for querying, which allows us to estimate the speech-level distributions.

In [107]:
# # aggregate up to the speech level
# agg_speeches = data.groupby(['year','president'], as_index = False)['speech'].apply(lambda x: ' '.join(x)) 
# agg_speeches

# aggregate up to the speech level using the stemmed version of the speech
# replace the speech field in the original data with its cleaned version from docsobj
data['speech_stems'] = [' '.join(s) for s in prep.stems] 
agg_speeches = data.groupby(['year','president'], as_index = False)['speech_stems'].apply(lambda x: ' '.join(x)) 
agg_speeches

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,year,president,speech_stems
0,1947,Truman,presid speaker member congress unit state look...
1,1948,Truman,presid speaker member congress today consid st...
2,1949,Truman,presid speaker member congress happi report co...
3,1950,Truman,presid speaker member congress year ago report...
4,1951,Truman,presid speaker member congress congress face g...
...,...,...,...
66,2010,Obama,speaker vice presid member congress distinguis...
67,2011,Obama,speaker vice presid member congress distinguis...
68,2012,Obama,speaker vice presid member congress distinguis...
69,2013,Obama,pleas everybodi speaker vice presid member con...


In [108]:
# simple auxiliary function to override the preprocessing done by sklearn
def do_nothing(doc):
    return doc

In [109]:
count_vectorizer = CountVectorizer(preprocessor=do_nothing,              # apply no additional preprocessing
                                   #tokenizer=pattern,                    # use same tokenization pattern
                                   lowercase=False,                      # convert all characters to lower-case
                                   strip_accents=None,                   # remove non-ascii characters
                                   stop_words=None,                      # remove stopwords from a built-in list. We can also provide our own list
                                   ngram_range=(1, 1),                   # generate only unigrams
                                   analyzer='word',                      # analysis at the word-level
                                   vocabulary=word2idx                   # impose same vocabulary
                                   )                       
count_vectorizer

CountVectorizer(lowercase=False,
                preprocessor=<function do_nothing at 0x7ff70b87f3b0>,
                vocabulary={'-percent': 0, '-year': 1, '-year-old': 2,
                            'abandon': 3, 'abil': 4, 'abl': 5, 'abraham': 6,
                            'abroad': 7, 'absenc': 8, 'absolut': 9,
                            'absorb': 10, 'abund': 11, 'abus': 12,
                            'acceler': 13, 'accept': 14, 'access': 15,
                            'accid': 16, 'accommod': 17, 'accompani': 18,
                            'accomplish': 19, 'accord': 20, 'account': 21,
                            'accumul': 22, 'accur': 23, 'achiev': 24,
                            'acknowledg': 25, 'acquir': 26, 'acr': 27,
                            'across': 28, 'act': 29, ...})

In [110]:
# create the document term matrix using stems
dt_matrix_stems_agg = count_vectorizer.fit_transform(agg_speeches["speech_stems"].values).toarray()
print(dt_matrix_stems_agg.shape)
vocab_agg = count_vectorizer.vocabulary_

(71, 2541)


In [111]:
len(word2idx), len(vocab_agg)

(2541, 2541)

In [112]:
vocab_agg["loss"], word2idx["loss"]

(1353, 1353)

In [113]:
# # use preprocessing class to clean text
# prep_agg, word2idx_agg, idx2word_agg, vocab_agg = apply_preprocessing(agg_speeches["speech"],
#                                                                       "stems",
#                                                                       "long",
#                                                                       {}, 
#                                                                       pattern, 
#                                                                       punctuation)

In [114]:
# # inspect a particular tokenized document and compare it to its original form
# i = 10
# print(agg_speeches.loc[i, "speech"])
# print("\n ------------------------------- \n")
# print(prep_agg.stems[i])

In [115]:
# # create document-term matrix using stems
# prep_agg.dt_matrix_create(items='stems')

# dt_matrix_stems_agg = prep_agg.df_matrix["stems"].astype(int)
# dt_matrix_stems_agg.shape

Before continuing, suppose that we instead wanted to query a document whose constitutent parts had not been included in estimation, for example a State of the Union Address from the 1930s.  How to proceed?  First, create a RawDocs object with the text to be queried (recall that RawDocs can take a basic text file, which each new line treated as a separate documents).  Second, perform the same cleaning steps as were done for the documents that went into the estimated model.  However, there is no need to do any stopword removal.  When you initialize a Query object, tokens in the documents to be queried that are not present in the estimated model are automatically stripped out.

Since we don't need to estimate topics when querying, we can use far fewer iterations.  Let's start with 10.

In [116]:
# transform documents into their topic proportions
topic_proportions = lda_model.transform(dt_matrix_stems_agg,
                                        max_iter=20
                                        )
topic_proportions.shape



(71, 30)

Finally, we follow similar steps as for LDA to output the estimated distribution of topics for entire speeches.

In [117]:
# add the topic proportions to each document
for i in range(topic_proportions.shape[1]): agg_speeches['T' + str(i)] = topic_proportions[:,i]

In [118]:
# explore new data
agg_speeches.head(5)

Unnamed: 0,year,president,speech_stems,T0,T1,T2,T3,T4,T5,T6,...,T20,T21,T22,T23,T24,T25,T26,T27,T28,T29
0,1947,Truman,presid speaker member congress unit state look...,0.030033,0.047402,0.035108,0.037038,0.035825,0.006205,0.011201,...,0.034261,0.018875,0.052881,0.022378,0.014876,0.063992,0.036384,0.029237,0.025814,0.036483
1,1948,Truman,presid speaker member congress today consid st...,0.039979,0.026865,0.026731,0.004013,0.02783,0.018406,0.014934,...,0.052044,0.014041,0.077509,0.038354,0.010191,0.05175,0.016826,0.02958,0.021698,0.037097
2,1949,Truman,presid speaker member congress happi report co...,0.043467,0.013939,0.03246,0.01965,0.030004,0.02366,0.004843,...,0.042597,0.004129,0.063795,0.046154,0.011091,0.071829,0.035553,0.014831,0.040067,0.051456
3,1950,Truman,presid speaker member congress year ago report...,0.028925,0.021794,0.031504,0.014452,0.043081,0.017516,0.016719,...,0.059671,0.009722,0.06334,0.052878,0.01178,0.043817,0.030803,0.035901,0.03068,0.039167
4,1951,Truman,presid speaker member congress congress face g...,0.01856,0.016232,0.020382,0.001069,0.049226,0.018618,0.072563,...,0.044753,0.043258,0.022279,0.020399,0.0484,0.015769,0.014014,0.008185,0.03474,0.045952


In [119]:
# save data
agg_speeches.to_csv("final_output_agg.csv", index=False)

# Assessing Output

At this point, you can use all of the csv files this tutorial has generated with your statistical software of choice (should this not be Python!) to analyze the topics.  Before finishing, though, we can perform an initial test of whether our output makes sense intuitively.  The following code determines each President's top topics, as measured in terms of deviations from the sample average.

In [120]:
temp = agg_speeches.reset_index(drop=True).copy()
ranking = temp.set_index('president')
ranking = ranking - ranking.mean()
ranking = ranking.groupby(level='president').mean()
ranking

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,T0,T1,T10,T11,T12,T13,T14,T15,T16,T17,...,T28,T29,T3,T4,T5,T6,T7,T8,T9,year
president,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BushI,-0.008091,-0.003979,-0.009666,0.002509,-0.002207,0.007335,0.029179,0.012763,0.00135,0.027664,...,-0.008927,0.005983,-0.007612,0.003029,0.012909,0.00019,-0.009414,-0.001215,-0.011233,10.65493
BushII,-0.004971,-0.008153,0.012847,0.004566,-0.005615,-0.005463,0.023276,-0.004132,0.013968,0.007171,...,-0.002246,-0.000869,-0.013647,-0.010571,0.006981,-0.000107,-0.005518,-0.0118,-0.019569,24.65493
Carter,0.000921,0.007424,0.003671,0.007321,-0.013839,-0.001022,-0.034781,-0.029911,0.002218,-0.031218,...,0.0209,-0.015295,0.052141,-0.022238,-0.025594,0.024158,-0.004596,0.009803,0.002719,-0.34507
Clinton,-0.010539,-0.011225,0.00492,-0.004295,0.004643,0.014715,0.013352,-0.020828,0.017525,0.006589,...,-0.013549,0.007763,-0.015968,0.00518,0.05906,-0.009559,-0.011386,-0.018566,-0.022268,16.65493
Eisenhower,0.007433,0.013662,0.003462,-0.009199,-0.006761,-0.014055,-0.028947,0.018989,-0.005673,-0.020769,...,0.004095,0.003139,0.018297,-0.007204,-0.024612,-0.000234,0.003894,0.030208,0.0372,-22.84507
Ford,0.021275,0.004322,-0.009627,-0.000513,0.024272,-0.010666,-0.000396,-0.010946,-0.009942,-0.005019,...,0.039078,-0.004112,-0.001176,0.002573,-0.033719,0.006054,-0.004118,0.011445,0.006154,-3.84507
JohnsonII,-0.006634,-0.004149,-0.012593,0.004648,0.00149,-0.006802,0.009067,-0.005349,-0.004013,-0.001632,...,-0.0147,0.006012,-0.003089,0.002574,-0.007542,0.009866,0.00851,-0.00869,0.010763,-13.34507
Kennedy,-0.004872,0.002744,-0.008335,-0.00925,0.007097,-0.007019,-0.005995,0.026924,-0.00338,-0.015977,...,0.002378,0.00182,-0.00339,0.003871,-0.012587,0.02333,-0.001036,0.019365,0.005188,-17.84507
Nixon,0.027599,0.018656,-0.007016,-0.002848,-0.009059,-0.013223,-0.021208,-0.01607,-0.006159,0.010088,...,0.000807,-0.006079,0.010156,-0.000216,-0.019188,-0.016708,0.013678,-0.018758,0.011294,-7.84507
Obama,-0.009156,-0.004847,-0.000908,0.009998,0.004112,0.04983,0.005487,-0.027488,0.006213,0.012525,...,0.009783,-0.009574,-0.010601,0.013469,0.032429,-0.007735,-0.007073,-0.0191,-0.022951,31.65493


In [121]:
# sort presidents chronologically
ranking = ranking.sort_values('year')
ranking = ranking.drop('year', 1)
ranking

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,T0,T1,T10,T11,T12,T13,T14,T15,T16,T17,...,T27,T28,T29,T3,T4,T5,T6,T7,T8,T9
president,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Truman,-0.000866,0.000568,0.00577,-0.008878,-0.024171,-0.012927,-0.020605,0.047035,-0.005608,-0.018205,...,-0.007634,0.002262,0.006059,-0.00952,0.005501,-0.02541,0.002888,0.015876,0.017241,0.012355
Eisenhower,0.007433,0.013662,0.003462,-0.009199,-0.006761,-0.014055,-0.028947,0.018989,-0.005673,-0.020769,...,-0.009628,0.004095,0.003139,0.018297,-0.007204,-0.024612,-0.000234,0.003894,0.030208,0.0372
Kennedy,-0.004872,0.002744,-0.008335,-0.00925,0.007097,-0.007019,-0.005995,0.026924,-0.00338,-0.015977,...,-0.000876,0.002378,0.00182,-0.00339,0.003871,-0.012587,0.02333,-0.001036,0.019365,0.005188
JohnsonII,-0.006634,-0.004149,-0.012593,0.004648,0.00149,-0.006802,0.009067,-0.005349,-0.004013,-0.001632,...,0.003786,-0.0147,0.006012,-0.003089,0.002574,-0.007542,0.009866,0.00851,-0.00869,0.010763
Nixon,0.027599,0.018656,-0.007016,-0.002848,-0.009059,-0.013223,-0.021208,-0.01607,-0.006159,0.010088,...,0.004931,0.000807,-0.006079,0.010156,-0.000216,-0.019188,-0.016708,0.013678,-0.018758,0.011294
Ford,0.021275,0.004322,-0.009627,-0.000513,0.024272,-0.010666,-0.000396,-0.010946,-0.009942,-0.005019,...,0.002284,0.039078,-0.004112,-0.001176,0.002573,-0.033719,0.006054,-0.004118,0.011445,0.006154
Carter,0.000921,0.007424,0.003671,0.007321,-0.013839,-0.001022,-0.034781,-0.029911,0.002218,-0.031218,...,0.002087,0.0209,-0.015295,0.052141,-0.022238,-0.025594,0.024158,-0.004596,0.009803,0.002719
Reagan,-6.8e-05,-0.005774,-0.002468,0.007388,0.027447,-0.002659,0.021505,-0.000304,-0.014794,0.02071,...,0.006634,-0.009141,-0.003741,0.000728,0.003973,0.000916,-0.006945,-0.002056,-0.001986,-0.008737
BushI,-0.008091,-0.003979,-0.009666,0.002509,-0.002207,0.007335,0.029179,0.012763,0.00135,0.027664,...,-0.006092,-0.008927,0.005983,-0.007612,0.003029,0.012909,0.00019,-0.009414,-0.001215,-0.011233
Clinton,-0.010539,-0.011225,0.00492,-0.004295,0.004643,0.014715,0.013352,-0.020828,0.017525,0.006589,...,0.008487,-0.013549,0.007763,-0.015968,0.00518,0.05906,-0.009559,-0.011386,-0.018566,-0.022268


In [122]:
def top_topics(x, num_top, col_names):
    top_idxs = x.values.argsort()[-num_top:][::-1]
    top = [int(col_names[i].replace("T", "")) for i in top_idxs]
    return pd.Series(top, index=range(1,num_top+1))

In [123]:
# find the most prevalent topics for each president
ranking_top = ranking.apply(top_topics, num_top=10, col_names=ranking.columns, axis=1)
ranking_top

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10
president,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Truman,15,8,20,7,22,25,9,18,29,10
Eisenhower,9,8,20,15,3,25,1,2,18,0
Kennedy,15,25,6,8,22,20,12,9,4,1
JohnsonII,23,9,6,26,14,7,20,2,21,29
Nixon,19,0,2,1,7,9,3,17,23,26
Ford,28,12,0,22,8,9,6,1,2,4
Carter,3,2,20,26,6,28,25,19,8,18
Reagan,12,14,17,24,22,11,27,4,25,5
BushI,24,14,17,5,15,13,29,4,11,16
Clinton,5,24,16,13,14,27,29,17,4,10


In [124]:
# save data
ranking_top.to_csv("president_top_topics.csv")

In [125]:
# take a look at the top words for Bush II (Geroge W.) and Obama
n_top_words = 15
president = "BushII"
for topic in ranking_top.loc[president].values:
    topic_dist = topics_dist[topic]
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('{} Topic {}: {}'.format(president, topic, ' '.join(topic_words)))

BushII Topic 21: barrel reflect justic posit financi giant tonight trap ship channel gentlemen sit trigger bridg
BushII Topic 14: investor lincoln open forget hostil excit outlook impact giant punish modest spirit wall seri
BushII Topic 16: tariff mere custom threat represent bracket earth thought child unnecessari featur glad stimulus patient
BushII Topic 10: invas instant narrow overhaul terrorist tuition hardship touch embrac arrang chronic emiss represent machineri
BushII Topic 23: thought fought corrupt hazard sound crippl describ trust ton comput distribut construct pocket talk
BushII Topic 17: spirit shop noth depress soil preserv accompani machineri destroy tempt trust ceil bodi earmark
BushII Topic 5: princip light deepest stimulus distribut earth terrorist basic conflict cuba thorough often extremist dealt
BushII Topic 18: gratifi stagger offset hurt ton prospect quota appreci entrepreneur employe came stand tough vital
BushII Topic 11: coven abl terrorist outer observ energi

In [127]:
# take a look at the top words for Bush II (Geroge W.) and Obama
n_top_words = 10
president = "Kennedy"
for topic in ranking_top.loc[president].values:
    topic_dist = topics_dist[topic]
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('{} Topic {}: {}'.format(president, topic, ' '.join(topic_words)))

Kennedy Topic 15: folk hard-working qualiti decent yield brighter avoid clean abroad
Kennedy Topic 25: rose coverag bureaucraci wonder patienc curb offic term wisdom
Kennedy Topic 6: giant hard-working strength convers petroleum truli advanc strain conclud
Kennedy Topic 8: discrimin expir dead scene conscienc clean protect compet leas
Kennedy Topic 22: regard consumpt ratif twice white intoler meantim stabl pennsylvania
Kennedy Topic 20: qualiti grave abl ratif pray sacrific ban giant earliest
Kennedy Topic 12: agenda fair reduct afford summer compass chao famin price
Kennedy Topic 9: conting offens narcot appoint educ year mechan easier superior
Kennedy Topic 4: rank result heritag math bring teacher like racial coupl
Kennedy Topic 1: deserv satisfi analysi allevi involv recov shop fruit incent


For this particular topic model, for example, George W. Bush's top topic contains words relating to military force, and Obama's employment and economic activity.  The topic model you estimate will of course vary, so I encourage you to open president_top_policy_topics.csv and topic_description.csv to have a look for yourself.  Note too that some topics probably relate to policy, while some others relate to pure rhetoric.  Depending on the nature of the analysis you want to do with the data, it may make sense to restrict attention to some subset of the estimated topics.

That's all for now, I hope you enjoyed the tutorial, and begin to use topic modelling in your own work!