## Automatic Learning of Key Phrases and Topics in Document Collections

## Part 3: Topic Modeling Training and Summarization

### Overview

This notebook is Part 3 of 4, in a series providing a step-by-step description of how to process and analyze the contents of a large collection of text documents in an unsupervised manner. Using Python packages and custom code examples, we have implemented the basic framework that combines key phrase learning and latent topic modeling as described in the paper entitled ["Modeling Multiword Phrases with Constrained Phrases Tree for Improved Topic Modeling of Conversational Speech"](http://people.csail.mit.edu/hazen/publications/Hazen-SLT-2012.pdf) which was originally presented in the 2012 IEEE Workshop on Spoken Language Technology.

Although the paper examines the use of the technology for analyzing human-to-human conversations, the techniques are quite general and can be applied to a wide range natural language data including news stories, legal documents, research publications, social media forum discussion, customer feedback forms, product reviews, and many more.

Part 3 of the series shows how to train a topic model on a collection of text documents and how to use the topic model to summarize the contents of the corpus. The training is applied to text generated from the preprocessing and phrase learning stages presented in Parts 1 and 2.  


### Import Relevant Python Packages

Most significantly, Part 3 relies on the use of the [Gensim Python library](http://radimrehurek.com/gensim/)  for generating a sparse bag-of-words representation of each document and then training a [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) model on the data. LDA produces a collection of latent topics learned in a completely unsupervised fashion from the text data. Each document can then be represented with a distribution of the learned topics.

In [1]:
import numpy
import pandas 
import re
import math
from gensim import corpora, models
from operator import itemgetter
from collections import namedtuple
import time
import gc
import sys
from __future__ import print_function

### Load Text Data

In [2]:
# Load full TSV file including a column of text
frame = pandas.read_csv("../Data/CongressionalDocsProcessed.tsv", sep='\t')

In [3]:
print ("Total docs in corpus: %d\n" % len(frame))

# Show the first five rows of the data in the frame
frame[0:5]

Total docs in corpus: 189088



Unnamed: 0,DocID,ProcessedText
0,hconres1-100,provides_for_a_joint session_of_the_congress o...
1,hconres1-101,salvadoran foreign_assistance reform resolutio...
2,hconres1-102,supports the president's actions to defend sau...
3,hconres1-103,declares that it is the sense_of_the_congress ...
4,hconres1-104,recognizes the sacrifice of army chief warrant...


### Load the Stop Word Lists
Latent topic models attempt to restrict the topic learning processing to the use of only content bearing words by excluding non-content bearing <b><i>stop words</i></b>. Manually crafted stop word lists are typically manually crafted and include common functional words such as articles, conjunctions, prepositions, pronouns, etc.

In [4]:
# Define a function for loading lists into dictionary hash tables
def LoadListAsHash(filename):
    listHash = {}
    fp = open(filename)

    # Read in lines one by one stripping away extra spaces, 
    # leading spaces, and trailing spaces and inserting each
    # cleaned up line into a hash table
    re1 = re.compile(' +')
    re2 = re.compile('^ +| +$')
    for stringIn in fp.readlines():
        term = re2.sub("",re1.sub(" ",stringIn.strip('\n')))
        if term != '':
            listHash[term] = 1

    fp.close()
    return listHash 

In [5]:
# Load the stop-list of non-content bearing function words
stopwordHash = LoadListAsHash("../Data/function_words.txt")

# Additional words can also be manually added to the stop word list as needed
stopwordHash["foo"] = 1

### Load the Mapping of Lower-Cased Vocabulary Items to Their Most Common Surface Form

In [6]:
# Load surface form mappings here
fp = open("../Data/Vocab2SurfaceFormMapping.tsv")

vocabToSurfaceFormHash = {}

# Each line in the file has two tab separated fields;
# the first is the vocabulary item used during modeling
# and the second is its most common surface form in the 
# original data
for stringIn in fp.readlines():
    fields = stringIn.strip().split("\t")
    if len(fields) != 2:
        print ("Warning: Bad line in surface form mapping file: %s" % stringIn)
    elif fields[0] == "" or fields[1] == "":
        print ("Warning: Bad line in surface form mapping file: %s" % stringIn)
    else:
        vocabToSurfaceFormHash[fields[0]] = fields[1]
fp.close()


### Do Topic Modeling on Corpus using Latent Dirichlet Allocation (LDA)

#### Create the Vocabulary Used for Topic Modeling

In [7]:
def CreateVocabForTopicModeling(textData,stopwordHash):

    print ("Counting words")
    numDocs = len(textData) 
    globalWordCountHash = {} 
    globalDocCountHash = {} 
    for textLine in textData:
        docWordCountHash = {}
        for word in textLine.split():
            if word in globalWordCountHash:
                globalWordCountHash[word] += 1
            else:
                globalWordCountHash[word] = 1
            if word not in docWordCountHash: 
                docWordCountHash[word] = 1
                if word in globalDocCountHash:
                    globalDocCountHash[word] += 1
                else:
                    globalDocCountHash[word] = 1

    minWordCount = 5;
    minDocCount = 2;
    maxDocFreq = .25;
    vocabCount = 0;
    vocabHash = {}

    excStopword = 0
    excNonalphabetic = 0
    excMinwordcount = 0
    excNotindochash = 0
    excMindoccount = 0
    excMaxdocfreq =0

    print ("Building vocab")
    for word in globalWordCountHash.keys():
        # Test vocabulary exclusion criteria for each word
        if ( word in stopwordHash ):
            excStopword += 1
        elif ( not re.search(r'[a-zA-Z]', word, 0) ):
            excNonalphabetic += 1
        elif ( globalWordCountHash[word] < minWordCount ):
            excMinwordcount += 1
        elif ( word not in globalDocCountHash ):
            print ("Warning: Word '%s' not in doc count hash") % (word)
            excNotindochash += 1
        elif ( globalDocCountHash[word] < minDocCount ):
            excMindoccount += 1
        elif ( float(globalDocCountHash[word])/float(numDocs) > maxDocFreq ):
            excMaxdocfreq += 1
        else:
            # Add word to vocab
            vocabHash[word]= globalWordCountHash[word];
            vocabCount += 1 
    print ("Excluded %d stop words" % (excStopword))       
    print ("Excluded %d non-alphabetic words" % (excNonalphabetic))  
    print ("Excluded %d words below word count threshold" % (excMinwordcount)) 
    print ("Excluded %d words below doc count threshold" % (excMindoccount))
    print ("Excluded %d words above max doc frequency" % (excMaxdocfreq)) 
    print ("Final Vocab Size: %d words" % vocabCount)
            
    return vocabHash
                    
vocabHash = CreateVocabForTopicModeling(frame['ProcessedText'],stopwordHash)

Counting words
Building vocab
Excluded 293 stop words
Excluded 9337 non-alphabetic words
Excluded 64323 words below word count threshold
Excluded 285 words below doc count threshold
Excluded 1 words above max doc frequency
Final Vocab Size: 73672 words


In [8]:
# Show that the stop word "and" is not the vocabulary
'and' in vocabHash

False

In [9]:
# Show a learned phrase is in the vocabulary
'department_of_labor' in vocabHash

True

In [10]:
# The vocabulary hash table contains the total count of the vocabulary item in the data set
vocabHash["department_of_labor"]

1139

In [11]:
# Print the 10 most frequent non-excluded words in the vocabulary
sorted(vocabHash.items(), key=lambda x: -x[1])[0:10]

[('requires', 91172),
 ('program', 69610),
 ('state', 66911),
 ('including', 56521),
 ('provides', 55203),
 ('certain', 53096),
 ('provide', 48731),
 ('programs', 41210),
 ('united_states', 40047),
 ('prohibits', 39121)]

#### Convert the Text Data Into a Sparse Vector Format

In [12]:
# Start by tokenizing the full text string string for each document into list of tokens
# Any token that is in not in the pre-defined set of acceptable vocabulary words is execluded
def TokenizeText(textData,vocabHash):
    tokenizedText = []
        
    for textLine in textData:
        tokenizedText.append([token for token in textLine.split() if token in vocabHash])    
    return tokenizedText
    
tokenizedDocs = TokenizeText(frame['ProcessedText'], vocabHash)

In [13]:
# Examine the tokenizaton of the first two documents
tokenizedDocs[0:2]

[['provides_for_a_joint',
  'session_of_the_congress',
  'january_27',
  'message_from_the_president',
  'state_of_the_union'],
 ['salvadoran',
  'foreign_assistance',
  'reform',
  'resolution',
  'expresses_the_sense_of_the_congress',
  'u.s._foreign_assistance_program',
  'el_salvador',
  'revised',
  'promote',
  'negotiated_settlement',
  'reduction',
  'human',
  'suffering',
  'ratio',
  'assistance',
  'reversed',
  'fy',
  'spent',
  'war',
  'effort',
  'one-third',
  'spent',
  'reform',
  'development_activities',
  'assistance',
  'distributed',
  'manner',
  'promote',
  'interests',
  'particular',
  'political_party',
  'assistance',
  'distributed',
  'church-related',
  'nongovernmental_organizations',
  'international_organizations',
  'selected',
  'agency_for_international_development',
  'president',
  'report_quarterly',
  'congress',
  'restructuring',
  'assistance',
  'economic',
  'results',
  'restructuring',
  'reports',
  'corruption',
  'distribution']]

In [14]:
# Count the total number of vocabulary tokens used over the entire corpus 
numTokens = 0
for i in range(0,len(tokenizedDocs)):
    numTokens += len(tokenizedDocs[i])
print("Total number of retained tokens: %d" % numTokens)

Total number of retained tokens: 15903854


In [15]:
from gensim import corpora

# Create a dictionary mapping string tokens in the text to unique token IDs
dictionary = corpora.Dictionary(tokenizedDocs)

# If the reverse mapping for token ids back to string doesn't exist then create the mapping
# in the form of a list where the list index is the tokenID and the list value is the token
if len(dictionary.id2token) == 0:
    numTokens = len(dictionary.token2id);
    id2token = numTokens * [""];
    for token in dictionary.token2id:
        tokenID = dictionary.token2id[token]
        if tokenID < numTokens:
            id2token[tokenID] = token
        else: 
            print ("Warning: token id %d for token '%s' exceeds max index of %d" % (tokenID,token,numTopics-1))
    for i in range(0,numTokens):
        if id2token[i] == "":
            print ("Warning: token id %d has an empty token" % i)
else:
    id2token = dictionary.id2token
    

In [16]:
# The mapping from unique token ids to strings uses the id2token element of the dictionary
for i in range(0,10):
    print ("Token ID %d --> %s" % (i, id2token[i]))

# The mapping from strings to unique token ids uses the token2id element of the dictionary   
print ("")
print ("%s --> Token ID %d" % ('spent', dictionary.token2id['spent']))
print ("%s --> Token ID %d" % ('provides_for_a_joint', dictionary.token2id['provides_for_a_joint']))

Token ID 0 --> january_27
Token ID 1 --> state_of_the_union
Token ID 2 --> message_from_the_president
Token ID 3 --> session_of_the_congress
Token ID 4 --> provides_for_a_joint
Token ID 5 --> revised
Token ID 6 --> one-third
Token ID 7 --> spent
Token ID 8 --> economic
Token ID 9 --> manner

spent --> Token ID 7
provides_for_a_joint --> Token ID 4


In [17]:
# Create a Gensim corpus structure
corpus =[dictionary.doc2bow(tokens) for tokens in tokenizedDocs]

In [18]:
# Show that the corpus structure models the tokenized text as a sparse list of the tokens 
# in the document where each list item is represented by the unique ID for the token along 
# with the count of how often that token appeared in the document
print(tokenizedDocs[1])
print("spent:", dictionary.token2id['spent'])
print (corpus[1])

['salvadoran', 'foreign_assistance', 'reform', 'resolution', 'expresses_the_sense_of_the_congress', 'u.s._foreign_assistance_program', 'el_salvador', 'revised', 'promote', 'negotiated_settlement', 'reduction', 'human', 'suffering', 'ratio', 'assistance', 'reversed', 'fy', 'spent', 'war', 'effort', 'one-third', 'spent', 'reform', 'development_activities', 'assistance', 'distributed', 'manner', 'promote', 'interests', 'particular', 'political_party', 'assistance', 'distributed', 'church-related', 'nongovernmental_organizations', 'international_organizations', 'selected', 'agency_for_international_development', 'president', 'report_quarterly', 'congress', 'restructuring', 'assistance', 'economic', 'results', 'restructuring', 'reports', 'corruption', 'distribution']
spent: 7
[(5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 4), (20, 1), (21, 2), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1

#### Train an LDA Topic Model Using the Gensim Package 

In [19]:
import gensim
from gensim import models

In [20]:
# Set the number of topics to be learned to 200
numTopics=200

# Train LDA model 
if False:
    lda = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=numTopics, passes=50)

In [21]:
# Saving and loading a trained LDA model
ldaFile = "../Data/CongressionalDocsLDA.pickle"

if False:
    #Save a trained LDA model
    lda.save(ldaFile)
    
else: 
    # Loaded trained model
    lda = gensim.models.ldamodel.LdaModel.load(ldaFile)

#### Accessing the Contents of the LDA Model

In [22]:
# To see the accessible variables in the LDA model structure
lda.__dict__

{'__ignoreds': ['state', 'dispatcher'],
 '__numpys': ['expElogbeta'],
 '__recursive_saveloads': ['id2word'],
 '__scipys': [],
 'alpha': array([ 0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
 

To print out the internal help document for the LDA model class you can use the help() function.

In [23]:
# To print the help document for the LDA model class
help(lda)


Help on LdaModel in module gensim.models.ldamodel object:

class LdaModel(gensim.interfaces.TransformationABC)
 |  The constructor estimates Latent Dirichlet Allocation model parameters based
 |  on a training corpus:
 |  
 |  >>> lda = LdaModel(corpus, num_topics=10)
 |  
 |  You can then infer topic distributions on new, unseen documents, with
 |  
 |  >>> doc_lda = lda[doc_bow]
 |  
 |  The model can be updated (trained) with new documents via
 |  
 |  >>> lda.update(other_corpus)
 |  
 |  Model persistency is achieved through its `load`/`save` methods.
 |  
 |  Method resolution order:
 |      LdaModel
 |      gensim.interfaces.TransformationABC
 |      gensim.utils.SaveLoad
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, bow, eps=None)
 |      Return topic distribution for the given document `bow`, as a list of
 |      (topic_id, topic_probability) 2-tuples.
 |      
 |      Ignore topics with very low probability (below `eps`).
 |  
 |  __init

The print_topics(n) method of the LDA model object prints out a random sampling of n different learned topics as represented by the most likely terms in the topic's langauge model, i.e. the terms that maximize the topic language model P(term|topic).

In [24]:
lda.print_topics(10)

[(99,
  u'0.132*fund + 0.107*pay + 0.029*settlement + 0.028*resulting + 0.021*disclose + 0.020*requires + 0.020*losses + 0.019*tobacco + 0.017*result + 0.015*caused'),
 (36,
  u'0.279*senate + 0.068*time + 0.058*house + 0.038*house_of_representatives + 0.027*declaration + 0.021*designate + 0.020*interpretation + 0.018*cloture + 0.017*south + 0.015*east'),
 (137,
  u'0.106*secretary + 0.065*program + 0.050*technology + 0.033*activities + 0.020*research + 0.019*development + 0.017*conduct + 0.016*carry + 0.014*transfer + 0.013*identify'),
 (85,
  u'0.080*imposes + 0.078*article + 0.073*hours + 0.058*authorizing + 0.047*long-term_care + 0.046*awareness_week + 0.029*applying + 0.025*william + 0.022*hispanic + 0.022*leader'),
 (145,
  u'0.075*purchase + 0.039*program + 0.035*importation + 0.034*finance + 0.030*purchases + 0.027*commodities + 0.021*purchased + 0.018*crops + 0.018*donations + 0.017*courses'),
 (45,
  u'0.467*national + 0.154*day + 0.062*encourages + 0.026*americans + 0.025*ho

#### Infer the Document Probability Score P(topic|doc) using the LDA Model

In this section, each document from the corpus is passed into the LDA model which then infers the topic distribution for each document. The topic distributions are collected into a single numpy array.

In [25]:
# To retrieve all topics and their probabilities we must set the LDA minimum probability setting to zero
lda.minimum_probability = 0

# This function generates the topic probabilities for each doc from the trained LDA model
# The probabilities go into a single matrix where the rows are documents and columns are topics
def ExtractDocTopicProbsMatrix(corpus,lda):
    # Initialize the matrix
    docTopicProbs = numpy.zeros((len(corpus),lda.num_topics))
    for docID in range(0,len(corpus)):
        for topicProb in lda[corpus[docID]]:
            docTopicProbs[docID,topicProb[0]]=topicProb[1]
    return docTopicProbs

# docTopicProbs[docID,TopicID] --> P(topic|doc)
if False:
    docTopicProbs = ExtractDocTopicProbsMatrix(corpus,lda)

In [26]:
if False:
    numpy.save("../Data/CongressionalDocTopicProbs.npy",docTopicProbs)
    
if True:
    docTopicProbs = numpy.load("../Data/CongressionalDocTopicProbs.npy")

#### Compute the Global Topic Likelihood Scores P(topic)

In [27]:
# Computing the global topic likelihoods by aggregating topic probabilities over all documents
# topicProbs[topicID] --> P(topic)
def ComputeTopicProbs(docTopicProbs):
    topicProbs = docTopicProbs.sum(axis=0) 
    topicProbs = topicProbs/sum(topicProbs)
    return topicProbs

topicProbs = ComputeTopicProbs(docTopicProbs)

#### Convert the Topic Language Model Information P(term|topic) from the LDA Model into a NumPy Representation

In [28]:
def ExtractTopicLMMatrix(lda):
    # Initialize the matrix
    docTopicProbs = numpy.zeros((lda.num_topics,lda.num_terms))
    for topicID in range(0,lda.num_topics):
        termProbsList = lda.get_topic_terms(topicID,lda.num_terms)
        for termProb in termProbsList:
            docTopicProbs[topicID,termProb[0]]=termProb[1]
    return docTopicProbs
    
# topicTermProbs[topicID,termID] --> P(term|topic)
topicTermProbs = ExtractTopicLMMatrix(lda)

In [29]:
if False:
    numpy.save("../Data/CongressionalDocTopicLM.npy",topicTermProbs)

#### Compute P(topic,term), P(term), and P(topic|term)

In [30]:
# Compute the joint likelihoods of topics and terms
# jointTopicTermProbs[topicID,termID] --> P(topic,term) = P(term|topic)*P(topic)
jointTopicTermProbs = numpy.diag(topicProbs).dot(topicTermProbs) 

# termProbs[termID] --> P(term)
termProbs = jointTopicTermProbs.sum(axis=0)

# topicProbsPermTerm[topicID,termID] --> P(topic|term)
topicProbsPerTerm = jointTopicTermProbs / termProbs

In [31]:
# Print most frequent words in LDA vocab
mostFrequentTermIDs = (-termProbs).argsort()
for i in range(0,25):
    print ("%d: %s --> %f" % (i+1, id2token[mostFrequentTermIDs[i]], termProbs[mostFrequentTermIDs[i]]))

1: requires --> 0.004203
2: effective --> 0.004191
3: amends --> 0.004003
4: benefit --> 0.003621
5: certain --> 0.002886
6: including --> 0.002823
7: program --> 0.002773
8: harmonized_tariff_schedule_of_the_united_states --> 0.002734
9: provide --> 0.002722
10: united_states --> 0.002686
11: july_7 --> 0.002637
12: december_31 --> 0.002452
13: all --> 0.002003
14: prohibits --> 0.001951
15: specified --> 0.001835
16: individual --> 0.001796
17: service --> 0.001726
18: authorizes --> 0.001716
19: programs --> 0.001673
20: respect --> 0.001658
21: states --> 0.001643
22: services --> 0.001610
23: congress --> 0.001530
24: federal --> 0.001420
25: senate --> 0.001419


#### Compute WPMI

To determine which vocabulary terms are most representative of a topic, systems typically just choose a set of terms that are most likely for the topic, i.e., terms that maximize the languauge model expression <i>P(term|topic)</i> for the given topic. This approach is adequate for many data sets. However, for some data sets there may be common words in the corpus that are frequent terms within multiple topics, and hence not a distinguishing term for any of these topics. In this case, selecting words which have the largest weighted pointwise mutual information (WPMI) with a given topic is more appropriate. 

The expression for WPMI between a word and token is given as:


$WPMI(term,topic) = P(term,topic)\log\frac{P(term,topic)}{P(term)P(topic)} = P(term,topic)\log\frac{P(topic|term)}{P(topic)}$


In [32]:
topicTermWPMI =(jointTopicTermProbs.transpose()*numpy.log(topicProbsPerTerm.transpose()/topicProbs)).transpose()
topicTermWPMI.shape

(200, 73672)

#### Compute Topic to Document Purity measure for Each Topic

One measure of the importance or quality of a topic is its topic to document purity measure. This purity measure assumes latent topics that dominate the documents in which they appear are more semantically important than latent topics that are weakly spread across many documents. This concept was introduced in the paper ["Latent Topic Modeling for Audio Corpus Summarization](http://people.csail.mit.edu/hazen/publications/Hazen-Interspeech11.pdf). The purity measure is expressed by the following equation:

$Purity(topic) = \exp\left (
                 \frac{\sum_{\forall doc}P(topic|doc)\log P(topic|doc)}{\sum_{\forall doc}P(topic|doc)}
                \right )$

In [33]:
topicPurity = numpy.exp(((docTopicProbs * numpy.log(docTopicProbs)).sum(axis=0))/(docTopicProbs).sum(axis=0))

#### Create Topic Summaries 

In [34]:
def CreateTermIDToSurfaceFormMapping(id2token,token2surfaceform):
    termIDToSurfaceFormMap = []
    for i in range(0,len(id2token)):
        termIDToSurfaceFormMap.append(token2surfaceform[id2token[i]])
    return termIDToSurfaceFormMap;

termIDToSurfaceFormMap = CreateTermIDToSurfaceFormMapping(id2token, vocabToSurfaceFormHash);

In [35]:
i = 100
print(id2token[i], termIDToSurfaceFormMap[i])

bobby Bobby


-----
In the code snippet below we demonstrate how the WPMI measure lowers to score of some common token that do not provide value in a topic summary in comparison to the standard word likely measure P(token|topic). For topic 11 below notice how the generic words <i>certain</i>, <i>require</i>, <i>respect</i> and <i>provide</i> have their position in the summaries lowered by the WPMI measure.   

In [36]:
topicID = 11
highestWPMITermIDs = (-topicTermWPMI[topicID]).argsort()
highestProbTermIDs = (-topicTermProbs[topicID]).argsort()
print ("                                        WPMI                                                 Prob")
for i in range(0,25):
    print ("%2d: %35s ---> %8.6f    %35s ---> %8.6f" % (i+1, 
                                                        termIDToSurfaceFormMap[highestWPMITermIDs[i]], 
                                                        topicTermWPMI[topicID,highestWPMITermIDs[i]],
                                                        termIDToSurfaceFormMap[highestProbTermIDs[i]], 
                                                        topicTermProbs[topicID,highestProbTermIDs[i]]))                

                                        WPMI                                                 Prob
 1:                              Amends ---> 0.017044                                 Amends ---> 0.282902
 2:                            employer ---> 0.002314                               employer ---> 0.040485
 3:                             section ---> 0.001934                                section ---> 0.032095
 4:                                   B ---> 0.001661                                      B ---> 0.027566
 5:            Authorizes the President ---> 0.001564               Authorizes the President ---> 0.025961
 6:                              Alaska ---> 0.001473                                 Alaska ---> 0.024452
 7:                              revise ---> 0.001079                                certain ---> 0.022691
 8:                       Miscellaneous ---> 0.000923                                 revise ---> 0.019850
 9:                             Extends ---> 0

In [37]:
def CreateTopicSummaries(topicTermScores, id2token, tokenid2surfaceform, maxStringLen):
    topicSummaries = []
    for topicID in range(0,len(topicTermScores)):
        rankedTermIDs = (-topicTermScores[topicID]).argsort()
        maxNumTerms = len(rankedTermIDs)
        termIndex = 0
        stop = 0
        outputTokens = []
        topicSummary = ""
        while not stop:
            # If we've run out of tokens then stop...
            if (termIndex>=maxNumTerms):
                stop=1
            # ...otherwise consider adding next token to summary
            else:
                nextToken = id2token[rankedTermIDs[termIndex]]
                nextTokenOut = tokenid2surfaceform[rankedTermIDs[termIndex]]
                keepToken = 1
                # See if we should ignore this token
                if len(outputTokens) > 0:
                    for prevToken in outputTokens:
                        # Ignore token if it is a substring of a previous token
                        if nextToken in prevToken:
                            keepToken = 0
                            break
                        # Ignore token if it is a superstring of a previous token
                        elif prevToken in nextToken:
                            keepToken = 0
                            break
                if keepToken:
                    # Always add at least one token to the summary
                    if len(topicSummary) == 0:
                        topicSummary = nextTokenOut
                        outputTokens.append(nextToken)
                    # Add additional tokens if the summary string length is less than maxStringLen
                    elif ( len(topicSummary) + len(nextTokenOut) + 2 < maxStringLen):
                        topicSummary += ", " + nextTokenOut
                        outputTokens.append(nextToken)
                    else:
                        stop=1
                termIndex += 1         
        topicSummaries.append(topicSummary)
    return topicSummaries   
    
topicSummaries = CreateTopicSummaries(topicTermWPMI, id2token, termIDToSurfaceFormMap, 85)

In [38]:
# Rank the topics by their prominence score in the corpus
# The topic score combines the total weight of each a topic in the corpus 
# with a topic document purity score for topic 
# Topics with topicScore > 1 are generally very strong strong topics

topicScore =  (numTopics * topicProbs) * (2 * topicPurity)
topicRanking = (-topicScore).argsort()

#### Save LDA Topic Summaries

In [39]:
print ("     ID  Score  Prob  Purity  Summary")
for i in range(0,numTopics):
    topicID= topicRanking[i]
    print ("%3d %3d %6.3f (%5.3f, %4.3f) %s" 
           % (i, topicID, topicScore[topicID], 100*topicProbs[topicID], topicPurity[topicID], topicSummaries[topicID]))

     ID  Score  Prob  Purity  Summary
  0 180  2.293 (2.031, 0.282) Harmonized Tariff Schedule of the United States, December 31, pilot program
  1 135  0.699 (1.228, 0.142) interest, deduction, Pine, Amends the Internal Revenue Code, distributions
  2   1  0.688 (0.741, 0.232) regard, minor, Postal Service, United States for permanent, lawfully admitted
  3  11  0.491 (1.415, 0.087) Amends, employer, section, B, Authorizes the President, Alaska, revise
  4 178  0.467 (1.311, 0.089) insurance, percent, Trust Fund, SSA, accounts, increase, amounts, adjustment
  5  45  0.408 (0.844, 0.121) effective, day, Encourages, Americans, honor, Kansas, celebration
  6 170  0.400 (0.800, 0.125) covered, State, Applies, formula, Social Security, Old Age
  7 111  0.387 (0.803, 0.120) Medicare, Social Security Act, physician, services, individuals, Medicaid
  8  90  0.339 (0.933, 0.091) Recognizes, people, Calls, Urges, efforts, human rights, supports, political
  9  67  0.326 (0.835, 0.098) emissions

In [40]:
if False:
    fp = open("../Data/CongressionalDocTopicSummaries.tsv", "w")
    i = 0
    fp.write("TopicID\tTopicSummary\n")
    for line in topicSummaries:
        fp.write("%d\t%s\n" % (i, line))
        i += 1
    fp.close()