## Automatic Learning of Key Phrases and Topics in Document Collections

## Part 2: Phrase Learning

### Overview

This notebook is Part 2 of 4, in a series providing a step-by-step description of how to process and analyze the contents of a large collection of text documents in an unsupervised manner. Using Python packages and custom code examples, we have implemented the basic framework that combines key phrase learning and latent topic modeling as described in the paper entitled ["Modeling Multiword Phrases with Constrained Phrases Tree for Improved Topic Modeling of Conversational Speech"](http://people.csail.mit.edu/hazen/publications/Hazen-SLT-2012.pdf) which was originally presented in the 2012 IEEE Workshop on Spoken Language Technology.

Although the paper examines the use of the technology for analyzing human-to-human conversations, the techniques are quite general and can be applied to a wide range of natural language data including news stories, legal documents, research publications, social media forum discussions, customer feedback forms, product reviews, and many more.

Part 2 of the series shows how to learn the most salient phrases present in a large collection of documents. These phrases can be treated as single compound word units in down-stream processes such as topic modeling.

### Import Relevant Python Packages

In [1]:
import pandas 
import re
import math
from operator import itemgetter
from collections import namedtuple
import time
import gc
import sys
from __future__ import print_function

### Load Text Data

In [2]:
textFrame = pandas.read_csv('../Data/CongressionalDocsCleaned.tsv', sep='\t')

In [3]:
print ("Total lines in cleaned text: %d\n" % len(textFrame))

# Show the first five rows of the data in the frame
textFrame[0:25]

Total lines in cleaned text: 4050598



Unnamed: 0,DocID,DocLine,CleanedText
0,hconres1-100,0,Provides for a joint session of the Congress o...
1,hconres1-100,1,1987
2,hconres1-100,2,for a message from the President on the State ...
3,hconres1-101,0,Salvadoran Foreign Assistance Reform Resolution
4,hconres1-101,1,Expresses the sense of the Congress that
5,hconres1-101,2,the U.S. foreign assistance program for El Sal...
6,hconres1-101,3,the ratio of assistance should be reversed in ...
7,hconres1-101,4,such assistance should not be distributed in a...
8,hconres1-101,5,such assistance should be distributed through ...
9,hconres1-101,6,and


### Create Lowercase Version of the Text Data

Before learning phrases we lowercase the entire text corpus to ensure all casing variants for each word are collapsed into a single uniform variant used during the learning process. 

In [4]:
# Create a lowercased version of the data and add it into the data frame
lowercaseText = []
for textLine in textFrame['CleanedText']:
    lowercaseText.append(str(textLine).lower())
textFrame['LowercaseText'] = lowercaseText;           
            
textFrame[0:25]

Unnamed: 0,DocID,DocLine,CleanedText,LowercaseText
0,hconres1-100,0,Provides for a joint session of the Congress o...,provides for a joint session of the congress o...
1,hconres1-100,1,1987,1987
2,hconres1-100,2,for a message from the President on the State ...,for a message from the president on the state ...
3,hconres1-101,0,Salvadoran Foreign Assistance Reform Resolution,salvadoran foreign assistance reform resolution
4,hconres1-101,1,Expresses the sense of the Congress that,expresses the sense of the congress that
5,hconres1-101,2,the U.S. foreign assistance program for El Sal...,the u.s. foreign assistance program for el sal...
6,hconres1-101,3,the ratio of assistance should be reversed in ...,the ratio of assistance should be reversed in ...
7,hconres1-101,4,such assistance should not be distributed in a...,such assistance should not be distributed in a...
8,hconres1-101,5,such assistance should be distributed through ...,such assistance should be distributed through ...
9,hconres1-101,6,and,and


### Load the Supplemental Word Lists

Words in the black list are completely ignored by the process and cannot be used in the creation of phrases. Words in the function word list can only be used in between content words in the creation of phrases.

In [5]:
# Define a function for loading lists into dictionary hash tables
def LoadListAsHash(filename):
    listHash = {}
    fp = open(filename)

    # Read in lines one by one stripping away extra spaces, 
    # leading spaces, and trailing spaces and inserting each
    # cleaned up line into a hash table
    re1 = re.compile(' +')
    re2 = re.compile('^ +| +$')
    for stringIn in fp.readlines():
        term = re2.sub("",re1.sub(" ",stringIn.strip('\n')))
        if term != '':
            listHash[term] = 1

    fp.close()
    return listHash 

In [6]:
# Load the black list of words to ignore 
blacklistHash = LoadListAsHash('../Data/black_list.txt')

# Load the list of non-content bearing function words
functionwordHash = LoadListAsHash('../Data/function_words.txt')

# Add more terms to the function word list
functionwordHash["foo"] = 1

### Compute N-gram Statistics for Phrase Learning

In [7]:
# This is Step 1 for each iteration of phrase learning
# We count the number of occurances of all 2-gram, 3-ngram, and 4-gram
# word sequences 
def ComputeNgramStats(textData,functionwordHash,blacklistHash):
    
    # Create an array to store the total count of all ngrams up to 4-grams
    # Array element 0 is unused, element 1 is unigrams, element 2 is bigrams, etc.
    ngramCounts = [0]*5;
       
    # Create a list of structures to tabulate ngram count statistics
    # Array element 0 is the array of total ngram counts,
    # Array element 1 is a hash table of individual unigram counts
    # Array element 2 is a hash table of individual bigram counts
    # Array element 3 is a hash table of individual trigram counts
    # Array element 4 is a hash table of individual 4-gram counts
    ngramStats = [ngramCounts, {}, {}, {}, {}]
          
    # Create a regular expression for assessing validity of words
    # for phrase modeling. The expression says words in phrases
    # must either:
    # (1) contain an alphabetic character, or 
    # (2) be the single charcater '&', or
    # (3) be a one or two digit number
    reWordIsValid = re.compile('[A-Za-z]|^&$|^\d\d?$');
    
    # Go through the text data line by line collecting count statistics
    # for all valid n-grams that could appear in a potential phrase
    numLines = len(textData)
    for i in range(0,numLines):

        # Split the text line into an array of words
        wordArray = textData[i].split()
        numWords = len(wordArray)
        
        # Create an array marking each word as valid or invalid
        validArray = [];
        for word in wordArray:
            validArray.append(reWordIsValid.match(word) != None)        
            
        # Tabulate total raw ngrams for this line into counts for each ngram bin
        # The total ngrams counts include the counts of all ngrams including those
        # that we won't consider as parts of phrases
        for j in range(1,5):
            if j<=numWords:
                ngramCounts[j] += numWords - j + 1 
        
        # Collect counts for viable phrase ngrams and left context sub-phrases
        for j in range(0,numWords):
            word = wordArray[j]

            # Only bother counting the ngrams that start with a valid content word
            # i.e., valids words not in the function word list or the black list
            if ( ( word not in functionwordHash ) and ( word not in blacklistHash ) and validArray[j] ):

                # Initialize ngram string with first content word and add it to unigram counts
                ngramSeq = word 
                if ngramSeq in ngramStats[1]:
                    ngramStats[1][ngramSeq] += 1
                else:
                    ngramStats[1][ngramSeq] = 1

                # Count valid ngrams from bigrams up to 4-grams
                stop = 0
                k = 1
                while (k<4) and (j+k<numWords) and not stop:
                    n = k + 1
                    nextNgramWord = wordArray[j+k]
                    # Only count ngrams with valid words not in the blacklist
                    if ( validArray[j+k] and nextNgramWord not in blacklistHash ):
                        ngramSeq += " " + nextNgramWord
                        if ngramSeq in ngramStats[n]:
                            ngramStats[n][ngramSeq] += 1
                        else:
                            ngramStats[n][ngramSeq] = 1 
                        k += 1
                        if nextNgramWord not in functionwordHash:
                            # Stop counting new ngrams after second content word in 
                            # ngram is reached and ngram is a viable full phrase
                            stop = 1
                    else:
                        stop = 1
    return ngramStats

### Rank Potential Phrases by the Weighted Pointwise Mutual Information of their Constituent Words

In [8]:
def RankNgrams(ngramStats,functionwordHash,minCount):
    # Create a hash table to store weighted pointwise mutual 
    # information scores for each viable phrase
    ngramWPMIHash = {}
        
    # Go through each of the ngram tables and compute the phrase scores
    # for the viable phrases
    for n in range(2,5):
        i = n-1
        for ngram in ngramStats[n].keys():
            ngramCount = ngramStats[n][ngram]
            if ngramCount >= minCount:
                wordArray = ngram.split()
                # If the final word in the ngram is not a function word then
                # the ngram is a valid phrase candidate we want to score
                if wordArray[i] not in functionwordHash: 
                    leftNgram = wordArray[0]
                    for j in range(1,i):
                        leftNgram += ' ' + wordArray[j]
                    rightWord = wordArray[i]
                    
                    # Compute the weighted pointwise mutual information (WPMI) for the phrase
                    probNgram = float(ngramStats[n][ngram])/float(ngramStats[0][n])
                    probLeftNgram = float(ngramStats[n-1][leftNgram])/float(ngramStats[0][n-1])
                    probRightWord = float(ngramStats[1][rightWord])/float(ngramStats[0][1])
                    WPMI = probNgram * math.log(probNgram/(probLeftNgram*probRightWord));

                    # Add the phrase into the list of scored phrases only if WMPI is positive
                    if WPMI > 0:
                        ngramWPMIHash[ngram] = WPMI  
    
    # Create a sorted list of the phrase candidates
    rankedNgrams = sorted(ngramWPMIHash, key=ngramWPMIHash.__getitem__, reverse=True)

    # Force a memory clean-up
    ngramWPMIHash = None
    gc.collect()

    return rankedNgrams

### Apply Phrase Rewrites to Text Data

In [9]:
def ApplyPhraseRewrites(rankedNgrams,textData,learnedPhrases,                 
                        maxPhrasesToAdd,maxPhraseLength,verbose):

    # This function will consider at most maxRewrite 
    # new phrases to be added into the learned phrase 
    # list as specified by the calling fuinction
    maxRewrite=maxPhrasesToAdd

    # If the remaining number of proposed ngram phrases is less 
    # than the max allowed, then reset maxRewrite to the size of 
    # the proposed ngram phrases list
    numNgrams = len(rankedNgrams)
    if numNgrams < maxRewrite:
        maxRewrite = numNgrams
    
    # Create empty hash tables to keep track of phrase overlap conflicts
    leftConflictHash = {}
    rightConflictHash = {}
    
    # Create an empty hash table collecting the set of rewrite rules
    # to be applied during this iteration of phrase learning
    ngramRewriteHash = {}
    
    # Precompile the regex for finding spaces in ngram phrases
    regexSpace = re.compile(' ')

    # Initialize some bookkeeping variables
    numLines = len(textData)
    numPhrasesAdded = 0
    numConsidered = 0
    lastSkippedNgram = ""
    lastAddedNgram = ""
  
    # Collect list up to maxRewrite ngram phrase rewrites
    stop = False
    index = 0
    while not stop:

        # Get the next phrase to consider adding to the phrase list
        inputNgram = rankedNgrams[index]

        # Create the output compound word version of the phrase
        # The extra space is added to make the regex rewrite easier
        outputNgram = " " + regexSpace.sub("_",inputNgram)

        # Count the total number of words in the proposed phrase
        numWords = len(outputNgram.split("_"))

        # Only add phrases that don't exceed the max phrase length
        if (numWords <= maxPhraseLength):
    
            # Keep count of phrases considered for inclusion during this iteration
            numConsidered += 1

            # Extract the left and right words in the phrase to use
            # in checks for phrase overlap conflicts
            ngramArray = inputNgram.split()
            leftWord = ngramArray[0]
            rightWord = ngramArray[len(ngramArray)-1]

            # Skip any ngram phrases that conflict with earlier phrases added
            # These ngram phrases will be reconsidered in the next iteration
            if (leftWord in leftConflictHash) or (rightWord in rightConflictHash): 
                if verbose: 
                    print ("(%d) Skipping (context conflict): %s" % (numConsidered,inputNgram))
                lastSkippedNgram = inputNgram
                
            # If no conflict exists then add this phrase into the list of phrase rewrites     
            else: 
                if verbose:
                    print ("(%d) Adding: %s" % (numConsidered,inputNgram))
                ngramRewriteHash[" " + inputNgram] = outputNgram
                learnedPhrases.append(inputNgram) 
                lastAddedNgram = inputNgram
                numPhrasesAdded += 1
            
            # Keep track of all context words that might conflict with upcoming
            # propose phrases (even when phrases are skipped instead of added)
            leftConflictHash[rightWord] = 1
            rightConflictHash[leftWord] = 1

            # Stop when we've considered the maximum number of phrases per iteration
            if ( numConsidered >= maxRewrite ):
                stop = True
            
        # Increment to next phrase
        index += 1
    
        # Stop if we've reached the end of the ranked ngram list
        if index >= len(rankedNgrams):
            stop = True

    # Now do the phrase rewrites over the entire set of text data
    if numPhrasesAdded == 1:
        # If only one phrase to add use a single regex rule to do this phrase rewrite        
        inputNgram = " " + lastAddedNgram
        outputNgram = ngramRewriteHash[inputNgram]
        regexNgram = re.compile (r'%s(?= )' % re.escape(inputNgram)) 
        # Apply the regex over the full data set
        for j in range(0,numLines):
            textData[j] = regexNgram.sub(outputNgram, textData[j])
    elif numPhrasesAdded > 1:
        # Compile a single regex rule from the collected set of phrase rewrites for this iteration
        ngramRegex = re.compile(r'%s(?= )' % "|".join(map(re.escape, ngramRewriteHash.keys())))
        # Apply the regex over the full data set
        for i in range(0,len(textData)):
            # The regex substituion looks up the output string rewrite  
            # in the hash table for each matched input phrase regex
            textData[i] = ngramRegex.sub(lambda mo: ngramRewriteHash[mo.string[mo.start():mo.end()]], textData[i]) 
      
    return

### Run the full iterative phrase learning process

In [10]:
def ApplyPhraseLearning(textData,learnedPhrases,learningSettings):
    
    stop = 0
    iterNum = 0

    # Get the learning parameters from the structue passed in by thee calling function
    maxNumPhrases = learningSettings.maxNumPhrases
    maxPhraseLength = learningSettings.maxPhraseLength
    functionwordHash = learningSettings.functionwordHash
    blacklistHash = learningSettings.blacklistHash
    verbose = learningSettings.verbose
    minCount = learningSettings.minInstanceCount
    
    # Start timing the process
    functionStartTime = time.clock()
    
    numPhrasesLearned = len(learnedPhrases)
    print ("Start phrase learning with %d phrases of %d phrases learned" % (numPhrasesLearned,maxNumPhrases))

    while not stop:
        iterNum += 1
                
        # Start timing this iteration
        startTime = time.clock()
 
        # Collect ngram stats
        ngramStats = ComputeNgramStats(textData,functionwordHash,blacklistHash)

        # Rank ngrams
        rankedNgrams = RankNgrams(ngramStats,functionwordHash,minCount)
        
        # Incorporate top ranked phrases into phrase list
        # and rewrite the text to use these phrases
        maxPhrasesToAdd = maxNumPhrases - numPhrasesLearned
        if maxPhrasesToAdd > learningSettings.maxPhrasesPerIter:
            maxPhrasesToAdd = learningSettings.maxPhrasesPerIter
        ApplyPhraseRewrites(rankedNgrams,textData,learnedPhrases,maxPhrasesToAdd,maxPhraseLength,verbose)
        numPhrasesAdded = len(learnedPhrases) - numPhrasesLearned

        # Garbage collect
        ngramStats = None
        rankedNgrams = None
        gc.collect();
               
        elapsedTime = time.clock() - startTime

        numPhrasesLearned = len(learnedPhrases)
        print ("Iteration %d: Added %d new phrases in %.2f seconds (Learned %d of max %d)" % 
               (iterNum,numPhrasesAdded,elapsedTime,numPhrasesLearned,maxNumPhrases))
        
        if numPhrasesAdded >= maxPhrasesToAdd or numPhrasesAdded == 0:
            stop = 1
        
    # Remove the space padding at the start and end of each line
    regexSpacePadding = re.compile('^ +| +$')
    for i in range(0,len(textData)):
        textData[i] = regexSpacePadding.sub("",textData[i])
    
    gc.collect()
 
    elapsedTime = time.clock() - functionStartTime
    elapsedTimeHours = elapsedTime/3600.0;
    print ("*** Phrase learning completed in %.2f hours ***" % elapsedTimeHours) 

    return

-------
### Main top level execution of phrase learning functionality


In [11]:
# Create a structure defining the settings and word lists used during the phrase learning
learningSettings = namedtuple('learningSettings',['maxNumPhrases','maxPhrasesPerIter',
                                                  'maxPhraseLength','minInstanceCount'
                                                  'functionwordHash','blacklistHash','verbose'])

# If true it prints out the learned phrases to stdout buffer
# while its learning. This will generate a lot of text to stdout, 
# so best to turn this off except for testing and debugging
learningSettings.verbose = False

# Maximium number of phrases to learn
# If you want to test the code out quickly then set this to a small
# value (e.g. 100) and set verbose to true when running the quick test
learningSettings.maxNumPhrases = 25000

# Maximum number of phrases to learn per iteration 
# Increasing this number may speed up processing but will affect the ordering of the phrases 
# learned and good phrases could be by-passed if the maxNumPhrases is set to a small number
learningSettings.maxPhrasesPerIter = 200

# Maximum number of words allowed in the learned phrases 
learningSettings.maxPhraseLength = 7

# Minimum number of times a phrase must occur in the data to 
# be considered during the phrase learning process
learningSettings.minInstanceCount = 5

# This is a precreated hash table containing the list 
# of function words used during phrase learning
learningSettings.functionwordHash = functionwordHash

# This is a precreated hash table containing the list 
# of black list words to be ignored during phrase learning
learningSettings.blacklistHash = blacklistHash

# Initialize an empty list of learned phrases
# If you have completed a partial run of phrase learning
# and want to add more phrases, you can use the pre-learned 
# phrases as a starting point instead and the new phrases
# will be appended to the list
learnedPhrases = []

# Create a copy of the original text data that will be used during learning
# The copy is needed because the algorithm does in-place replacement of learned
# phrases directly on the text data structure it is provided
phraseTextData = []
for textLine in textFrame['LowercaseText']:
    phraseTextData.append(' ' + textLine + ' ')

# Run the phrase learning algorithm
if False:
    ApplyPhraseLearning(phraseTextData,learnedPhrases,learningSettings)

In [12]:
learnedPhrasesFile = "../Data/CongressionalDocsLearnedPhrases.txt"
phraseTextDataFile = "../Data/CongressionalDocsPhraseTextData.txt"

if False:
    # Write out the learned phrases to a text file
    fp = open(learnedPhrasesFile, 'w')
    for phrase in learnedPhrases:
        fp.write("%s\n" % phrase)
    fp.close()

    # Write out the text data containing the learned phrases to a text file
    fp = open(phraseTextDataFile, 'w')
    for line in phraseTextData:
        fp.write("%s\n" % line)
    fp.close()


if True:
    # Read in the learned phrases from a text file
    learnedPhrases = []
    fp = open(learnedPhrasesFile, 'r')
    for line in fp:
        learnedPhrases.append(line.strip())
    fp.close()

    # Read in the learned phrases from a text file
    phraseTextData = []
    fp = open(phraseTextDataFile, 'r')
    for line in fp:
        phraseTextData.append(line.strip())
    fp.close()

In [13]:
learnedPhrases[0:10]

['united states',
 'directs the secretary',
 'sets forth',
 'internal revenue',
 'authorizes appropriations',
 'authorizes the secretary',
 'requires the secretary',
 'social security',
 'expresses the sense',
 'fiscal year']

In [14]:
learnedPhrases[5000:5010]

['strike and sell',
 'generic drug',
 'crime control_act',
 'address the needs',
 'homeland_security and governmental_affairs',
 'committed by an adult',
 'subgrants to leas',
 'party in interest',
 'general education',
 'special nuclear_material']

In [15]:
phraseTextData[0:15]

['provides_for_a_joint session_of_the_congress on january_27',
 '1987',
 'for a message_from_the_president on the state_of_the_union',
 'salvadoran foreign_assistance reform resolution',
 'expresses_the_sense_of_the_congress that',
 'the u.s._foreign_assistance_program for el_salvador should be revised to promote a negotiated_settlement and a reduction of human suffering',
 'the ratio of assistance should be reversed in fy 1990 so that the amount spent on the war effort is only one-third of the amount spent for reform and development_activities',
 'such assistance should not be distributed in a manner which would promote the interests of any particular political_party',
 'such assistance should be distributed through church-related and other nongovernmental_organizations and international_organizations selected by the agency_for_international_development',
 'and',
 'the president should report_quarterly to the congress on the restructuring of such assistance',
 'the economic results of

In [16]:
# Add text with learned phrases back into data frame
textFrame['TextWithPhrases'] = phraseTextData

In [17]:
textFrame[0:10]

Unnamed: 0,DocID,DocLine,CleanedText,LowercaseText,TextWithPhrases
0,hconres1-100,0,Provides for a joint session of the Congress o...,provides for a joint session of the congress o...,provides_for_a_joint session_of_the_congress o...
1,hconres1-100,1,1987,1987,1987
2,hconres1-100,2,for a message from the President on the State ...,for a message from the president on the state ...,for a message_from_the_president on the state_...
3,hconres1-101,0,Salvadoran Foreign Assistance Reform Resolution,salvadoran foreign assistance reform resolution,salvadoran foreign_assistance reform resolution
4,hconres1-101,1,Expresses the sense of the Congress that,expresses the sense of the congress that,expresses_the_sense_of_the_congress that
5,hconres1-101,2,the U.S. foreign assistance program for El Sal...,the u.s. foreign assistance program for el sal...,the u.s._foreign_assistance_program for el_sal...
6,hconres1-101,3,the ratio of assistance should be reversed in ...,the ratio of assistance should be reversed in ...,the ratio of assistance should be reversed in ...
7,hconres1-101,4,such assistance should not be distributed in a...,such assistance should not be distributed in a...,such assistance should not be distributed in a...
8,hconres1-101,5,such assistance should be distributed through ...,such assistance should be distributed through ...,such assistance should be distributed through ...
9,hconres1-101,6,and,and,and


In [18]:
textFrame['TextWithPhrases'][2]

'for a message_from_the_president on the state_of_the_union'

### Find Most Common Surface Form of Each Lower-Cased Word and Phrase

The text data is lower cased in order to merge differently cased versions of the same word prior to doing topic modeling. In order to generate summaries of topics that will be learned, we would like to present the most likely surface form of a word to the user. For example, if a proper noun is converted to all lowercase characters for latent topic modeling, we want the user to see this proper name with its proper capitalization within summaries. The MapVocabToSurfaceForms() function achieves this by mapping every lowercased word and phrase used during latent topic modeling to its most common surface form in the text collection.



In [19]:
def MapVocabToSurfaceForms(textData):
    surfaceFormCountHash = {}
    vocabToSurfaceFormHash = {}
    regexUnderBar = re.compile('_')
    regexSpace = re.compile(' +')
    regexClean = re.compile('^ +| +$')
    
    # First go through every line of text, align each word/phrase with
    # it's surface form and count the number of times each surface form occurs
    for i in range(0,len(textData)):    
        origWords = regexSpace.split(regexClean.sub("",str(textData['CleanedText'][i])))
        numOrigWords = len(origWords)
        newWords = regexSpace.split(regexClean.sub("",str(textData['TextWithPhrases'][i])))
        numNewWords = len(newWords)
        origIndex = 0
        newIndex = 0
        while newIndex < numNewWords:
            # Get the next word or phrase in the lower-cased text with phrases and
            # match it to the original form of the same n-gram in the original text
            newWord = newWords[newIndex]
            phraseWords = regexUnderBar.split(newWord)
            numPhraseWords = len(phraseWords)
            matchedWords = origWords[origIndex]
            origIndex += 1
            for j in range(1,numPhraseWords):
                matchedWords += " " + origWords[origIndex]
                origIndex += 1
                
            # Now do the bookkeeping for collecting  the different surface form 
            # variations present for each lowercased word or phrase
            if newWord in vocabToSurfaceFormHash:
                if matchedWords not in vocabToSurfaceFormHash[newWord]:
                    vocabToSurfaceFormHash[newWord].add(matchedWords)
            else:
                vocabToSurfaceFormHash[newWord] = set([matchedWords])

            # Increment the counter for this surface form
            if matchedWords not in surfaceFormCountHash:
                surfaceFormCountHash[matchedWords] = 1
            else:
                surfaceFormCountHash[matchedWords] += 1
   
            if ( len(newWord) != len(matchedWords)):
                print ("##### Error #####")
                print ("Bad Match: %s ==> %s " % (newWord,matchedWords))
                print ("From line: %s" % textData['TextWithPhrases'][i])
                print ("Orig text: %s" % textData['CleanedText'][i])
                
                return False

            newIndex += 1
    # After aligning and counting, select the most common surface form for each

    # word/phrase to be the canonical example shown to the user for that word/phrase
    for ngram in vocabToSurfaceFormHash.keys():
        maxCount = 0
        bestSurfaceForm = ""
        for surfaceForm in vocabToSurfaceFormHash[ngram]:
            if surfaceFormCountHash[surfaceForm] > maxCount:
                maxCount = surfaceFormCountHash[surfaceForm]
                bestSurfaceForm = surfaceForm
        if ngram != "":
            if bestSurfaceForm == "":
                print ("Warning: NULL surface form for ngram '%s'" % ngram)
            else:
                vocabToSurfaceFormHash[ngram] = bestSurfaceForm
    
    return vocabToSurfaceFormHash



In [20]:
if False:
    vocabToSurfaceFormHash = MapVocabToSurfaceForms(textFrame)

In [21]:
# Save the mapping between model vocabulary and surface form mapping
tsvFile = "../Data/Vocab2SurfaceFormMapping.tsv"

if False:
    fp = open(tsvFile, 'w')
    for vocabItem in vocabToSurfaceFormHash:
        if vocabItem != "":
            strOut = "%s\t%s\n" % (vocabItem, vocabToSurfaceFormHash[vocabItem])
            fp.write(strOut)
    fp.close()
    
if True:
    # Load surface form mappings here
    vocabToSurfaceFormHash = {}
    fp = open(tsvFile)

    # Each line in the file has two tab separated fields;
    # the first is the vocabulary item used during modeling
    # and the second is its most common surface form in the 
    # original data
    for stringIn in fp.readlines():
        fields = stringIn.strip().split("\t")
        if len(fields) != 2:
            print ("Warning: Bad line in surface form mapping file: %s" % stringIn)
        elif fields[0] == "" or fields[1] == "":
            print ("Warning: Bad line in surface form mapping file: %s" % stringIn)
        else:
            vocabToSurfaceFormHash[fields[0]] = fields[1]
    fp.close()


In [22]:
print (vocabToSurfaceFormHash['security'])
print (vocabToSurfaceFormHash['declares'])
print (vocabToSurfaceFormHash['mental_health'])
print (vocabToSurfaceFormHash['el_salvador'])
print (vocabToSurfaceFormHash['department_of_the_interior'])

security
Declares
mental health
El Salvador
Department of the Interior


### Reconstruct the Full Processed Text of Each Document and Put it into a New Frame 

In [23]:
def ReconstituteDocsFromChunks(textData, idColumnName, textColumnName):
    dataOut = []
    
    currentDoc = "";
    currentDocID = "";
    
    for i in range(0,len(textData)):
        textChunk = textData[textColumnName][i]
        docID = textData[idColumnName][i]
        if docID != currentDocID:
            if currentDocID != "":
                dataOut.append([currentDocID, currentDoc])
            currentDoc = textChunk
            currentDocID = docID
        else:
            currentDoc += " " + textChunk
    dataOut.append([currentDocID,currentDoc])
    
    frameOut = pandas.DataFrame(dataOut, columns=['DocID','ProcessedText'])
    
    return frameOut

In [24]:
if False:
    docsFrame = ReconstituteDocsFromChunks(textFrame, 'DocID', 'TextWithPhrases')

In [25]:
# Save processed text for each document back out to a TSV file
if False:
    docsFrame.to_csv('../Data/CongressionalDocsProcessed.tsv', sep='\t', index=False)
    
if True: 
    docsFrame = pandas.read_csv('../Data/CongressionalDocsProcessed.tsv', sep='\t')

In [26]:
docsFrame[0:5]

Unnamed: 0,DocID,ProcessedText
0,hconres1-100,provides_for_a_joint session_of_the_congress o...
1,hconres1-101,salvadoran foreign_assistance reform resolutio...
2,hconres1-102,supports the president's actions to defend sau...
3,hconres1-103,declares that it is the sense_of_the_congress ...
4,hconres1-104,recognizes the sacrifice of army chief warrant...


In [27]:
docsFrame['ProcessedText'][1]

'salvadoran foreign_assistance reform resolution expresses_the_sense_of_the_congress that the u.s._foreign_assistance_program for el_salvador should be revised to promote a negotiated_settlement and a reduction of human suffering the ratio of assistance should be reversed in fy 1990 so that the amount spent on the war effort is only one-third of the amount spent for reform and development_activities such assistance should not be distributed in a manner which would promote the interests of any particular political_party such assistance should be distributed through church-related and other nongovernmental_organizations and international_organizations selected by the agency_for_international_development and the president should report_quarterly to the congress on the restructuring of such assistance the economic results of such restructuring and any reports of corruption in its distribution'