## Automatic Learning of Key Phrases and Topics in Document Collections

## Part 1: Text Preprocessing

### Overview

This notebook is Part 1 in a series of 4, providing a step-by-step description of how to process and analyze the contents of a large collection of text documents in an unsupervised manner. Using Python packages and custom code examples, we have implemented the basic framework that combines key phrase learning and latent topic modeling as described in the paper entitled ["Modeling Multiword Phrases with Constrained Phrases Tree for Improved Topic Modeling of Conversational Speech"](http://people.csail.mit.edu/hazen/publications/Hazen-SLT-2012.pdf) which was originally presented in the 2012 IEEE Workshop on Spoken Language Technology.

This notebook demonstrates how to preprocess the raw text from a collection of documents as precursor to applying the natural language processing techniques of unsupervised phrase learning and latent topic modeling.



### Import Relevant Python Packages

#### Importing NLTK Model for Sentence Tokenization


NLTK is a collection of Python modules, prebuilt models and corpora that provides tools for complex natural language processing tasks. Because the toolkit is large, the base installation of NLTK only installs the core skeleton of the toolkit. Installation of specific modules, corpora and pre-built models can be invoked from within Python using a download functionality provided by NLTK that can be invoked from Python. 

In this notebook, we make use of the NLTK sentence tokenization capability which takes a long string of text and splits it into sentence units. The tokenizer requires the installation of the 'punkt'  tokenizer models. After importing nltk, the nltk.download() function can be used to download specific packages such as 'punkt'.

For more information on NLTK see http://www.nltk.org/

In [1]:
import nltk
# The first time you run NLTK you will need to download the 'punkt' models 
# for breaking text strings into individual sentences
#nltk.download('punkt')
from nltk import tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     /home/tutorialuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


#### Import Other Required Packages
The 'pandas' package is used for handling and manipulating data frames. The 're' package is used for applying reguar expressions.

In [2]:
import pandas 
import re
from __future__ import print_function

### Load Text Data

In [3]:
# Load full TSV file including a column of text
frame = pandas.read_csv('../Data/CongressionalDocsData.tsv', sep='\t')

In [4]:
print ("Total documents in corpus: %d\n" % len(frame))

# Show the first five rows of the data in the frame
frame[0:5]

Total documents in corpus: 189088



Unnamed: 0,ID,Text,Date
0,hconres1-100,Provides for a joint session of the Congress o...,1987-01-06
1,hconres1-101,Salvadoran Foreign Assistance Reform Resolutio...,1989-01-03
2,hconres1-102,Supports the President's actions to defend Sau...,1991-01-03
3,hconres1-103,Declares that it is the sense of the Congress ...,1993-01-05
4,hconres1-104,Recognizes the sacrifice of Army Chief Warrant...,1995-01-04


In [5]:
# Print the full text of the first three documents
print(frame['Text'][0])
print('---')
print(frame['Text'][1])
print('---')
print(frame['Text'][2])

Provides for a joint session of the Congress on January 27, 1987, for a message from the President on the State of the Union.
---
Salvadoran Foreign Assistance Reform Resolution - Expresses the sense of the Congress that: (1) the U.S. foreign assistance program for El Salvador should be revised to promote a negotiated settlement and a reduction of human suffering; (2) the ratio of assistance should be reversed in FY 1990 so that the amount spent on the war effort is only one-third of the amount spent for reform and development activities; (3) such assistance should not be distributed in a manner which would promote the interests of any particular political party; (4) such assistance should be distributed through church-related and other nongovernmental organizations and international organizations selected by the Agency for International Development; and (5) the President should report quarterly to the Congress on the restructuring of such assistance, the economic results of such restr

### Preprocess Text Data

The CleanAndSplitText function below takes as input a list where each row element is a single cohesive long string of text, i.e. a "document". The function first splits each string by various forms of punctuation into chunks of text that are likely sentences, phrases or sub-phrases. The splitting is designed to prohibit the phrase learning process from using cross-sentence or cross-phrase word strings when learning phrases.

The function creates a table where each row represents a chunk of text from the original documents. The DocIndex coulmn indicates the original row index from associated document in the input from which the chunk of text originated. The TextLine column contains the original text excluding the punctuation marks and HTML markup that have been during the cleaning process.The TextLineLower column contains a fully lower-cased verion of the text in the TextLIne column.


In [6]:
def CleanAndSplitText(textDataFrame):

    textDataOut = [] 
   
    # This regular expression is for section headers in the bill summaries that we wish to ignore
    reHeaders = re.compile(r" *TABLE OF CONTENTS:? *"
                           "| *Title [IVXLC]+:? *"
                           "| *Subtitle [A-Z]+:? *"
                           "| *\(Sec\. \d+\) *")

    # This regular expression is for punctuation that we wish to clean out
    # We also will split sentences into smaller phrase like units using this expression
    rePhraseBreaks = re.compile("[\"\!\?\)\]\}\,\:\;\*\-]*\s+\([0-9]+\)\s+[\(\[\{\"\*\-]*"                             
                                "|[\"\!\?\)\]\}\,\:\;\*\-]+\s+[\(\[\{\"\*\-]*"
                                "|\.\.+"
                                "|\s*\-\-+\s*"
                                "|\s+\-\s+"
                                "|\:\:+"
                                "|\s+[\/\(\[\{\"\-\*]+\s*"
                                "|[\,!\?\"\)\(\]\[\}\{\:\;\*](?=[a-zA-Z])"
                                "|[\"\!\?\)\]\}\,\:\;]+[\.]*$"
                             )
    
    # Regex for underbars
    regexUnderbar = re.compile('_')
    
    # Regex for space
    regexSpace = re.compile(' +')
 
    # Regex for sentence final period
    regexPeriod = re.compile("\.$")

    # Iterate through each document and do:
    #    (1) Split documents into sections based on section headers and remove section headers
    #    (2) Split the sections into sentences using NLTK sentence tokenizer
    #    (3) Further split sentences into phrasal units based on punctuation and remove punctuation
    #    (4) Remove sentence final periods when not part of a abbreviation 

    for i in range(0,len(frame)):
        
        # Extract one document from frame
        docID = frame['ID'][i]
        docText = frame['Text'][i] 

        # Set counter for output line count for this document
        lineIndex=0;

        # Split document into sections by finding sections headers and splitting on them 
        sections = reHeaders.split(docText)
        
        for section in sections:
            # Split section into sentence using NLTK tokenizer 
            sentences = tokenize.sent_tokenize(section)
            
            for sentence in sentences:
                       
                # Split each sentence into phrase level chunks based on punctuation
                textSegs = rePhraseBreaks.split(sentence)
                numSegs = len(textSegs)
                
                for j in range(0,numSegs):
                    if len(textSegs[j])>0:
                        # Convert underbars to spaces 
                        # Underbars are reserved for building the compound word phrases                   
                        textSegs[j] = regexUnderbar.sub(" ",textSegs[j])
                    
                        # Split out the words so we can specially handle the last word
                        words = regexSpace.split(textSegs[j])
                        phraseOut = ""
                        last = len(words) -1
                        for i in range(0, last):
                            phraseOut += words[i] + " "
                        # If the last word ends in a period then remove the period
                        lastWord = regexPeriod.sub("", words[last])
                        # If the last word is an abbreviation like "U.S."
                        # then add the word final perios back on
                        if "\." in lastWord:
                            lastWord += "."
                        phraseOut += lastWord    

                        textDataOut.append([docID,lineIndex,phraseOut])
                        lineIndex += 1
                        
    # Convert to pandas frame 
    frameOut = pandas.DataFrame(textDataOut, columns=['DocID','DocLine','CleanedText'])                      
    
    return frameOut

In [7]:
if False:
    cleanedDataFrame = CleanAndSplitText(frame)

#### Writing and reading text data to and from a file 

In [8]:
# Writing the text data to file and reading it back in

if False:
    # Write frame with preprocessed text out to TSV file 
    cleanedDataFrame.to_csv('../Data/CongressionalDocsCleaned.tsv', sep='\t',index=False)

else:
    # Read a cleaned data frame in from a TSV file
    cleanedDataFrame = pandas.read_csv('../Data/CongressionalDocsCleaned.tsv', sep='\t')


#### Examining the processed text data

In [9]:
cleanedDataFrame[0:25]

Unnamed: 0,DocID,DocLine,CleanedText
0,hconres1-100,0,Provides for a joint session of the Congress o...
1,hconres1-100,1,1987
2,hconres1-100,2,for a message from the President on the State ...
3,hconres1-101,0,Salvadoran Foreign Assistance Reform Resolution
4,hconres1-101,1,Expresses the sense of the Congress that
5,hconres1-101,2,the U.S. foreign assistance program for El Sal...
6,hconres1-101,3,the ratio of assistance should be reversed in ...
7,hconres1-101,4,such assistance should not be distributed in a...
8,hconres1-101,5,such assistance should be distributed through ...
9,hconres1-101,6,and


In [10]:
print(cleanedDataFrame['CleanedText'][0])
print(cleanedDataFrame['CleanedText'][1])
print(cleanedDataFrame['CleanedText'][2])

Provides for a joint session of the Congress on January 27
1987
for a message from the President on the State of the Union
