In [1]:
from os import path
from nltk import word_tokenize, bigrams, trigrams, Counter
import pandas as pd
from numpy import log

In [2]:
# Defining input file paths
INPUT_FILE = './data/lincoln_speeches_000.txt'

# Defining output file paths
TOKEN_OUTPUT = './outputs/tokens_percent.csv'
BIGRAM_OUTPUT = './outputs/bigrams_pmi.csv'
TRIGRAM_OUTPUT = './outputs/trigrams_pmi.csv'

### Text Corpus Details

The text file is an extract from the Presidential Speech Corpus:

**Speech details:**

_Speaker: Abraham Lincoln_

_Location: Peoria, Illinois_

_Date: October 16, 1854_

------
#### Process

In order to process the text in the speech file, the file has to be read into the memory. The file text is stored as one string in memory. The following processes are then applied in order to study word distribution, bigrams and bigram probabilities, and trigram and trigram properties:
1. Word Tokenization, Token Probability and Logarithmic Token Probability
2. Bigram Construction and Association Measures
3. Trigram Construction and Association Measures
------

In [3]:
# Reading data from text file
with open(INPUT_FILE) as t:
    text_data = t.readlines()[2]

print("Total number of tokens in speech, including punctuation: {0:,}".format(len(text_data)))

Total number of tokens in speech, including punctuation: 187,150


------
### Word Tokenization

**Tokenization** or **Word Segmentation** is the task of separating out words from running text.

A **token** is a sequence of characters in a particular language or document that can be grouped together as a semantic unit for processing. The process of tokenization allows us to split a document into such individual words or tokens for the purpose of lexical analysis. Tokens have identifiable characteristics (Parts of Speech tags, singular/plural nouns, common/proper nouns, etc.) which can enable easier analysis and identification of context in a given document.

In [4]:
# Tokenizing text
speech_tokens = word_tokenize(text_data)

# Creating a toekn distribution dataframe for easier analysis
token_dist_df = pd.DataFrame(columns=['token', 'token_count', 'percent_dist', 'log_percent_dist'])
for token in speech_tokens:
    token_dist_df = token_dist_df.append({'token':token, 
                                         'token_count':Counter(speech_tokens)[token], 
                                         'percent_dist':Counter(speech_tokens)[token]/len(text_data), 
                                         'log_percent_dist': log(Counter(speech_tokens)[token]/len(text_data))}, 
                                         ignore_index=True)
# Dropping duplicates in token distribution dataframe    
token_dist_df.drop_duplicates(inplace=True)

# Saving token distribution dataframe to a csv file
token_dist_df.to_csv(TOKEN_OUTPUT)

print("Number of unique tokens: {0:,}".format(len(token_dist_df)))

Number of unique tokens: 2,746


------
### N-grams

An **n-gram** is a sequence of _n_ adjacent tokens in a given document.The tokens may be phonemes, syllables, letters, words or base pairs based on application.

N-grams may be used to identify the most common occurences in a given document, or to generate text from a chosen corpus.

The _n_ in n-grams corresponds to the number of adjacent tokens that are being analysed or generated. 

_n=2_ gives us **bigrams**, _n=3_ gives us **trigrams** and so on.

### Pointwise Mutual Information

**Pointwise mutual information** is a measure of association used in statistical analysis and information theory.

The Pointwise Mutual Information or PMI for a given pair of outcomes x and y quantifies the discrepency between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence.

$$PMI(x,y) = log\frac{P(x,y)}{P(x)P(y)} = log\frac{P(x|y)}{P(x)} = log\frac{P(y|x)}{P(y)}$$

PMI is a slightly normalized way of understanding the distribution of n-grams in a given document.

------

#### Bigrams, Pointwise Mutual Information and Bigram Probabilities

In [5]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

bigram_measures = BigramAssocMeasures()
bigram_collocation_finder = BigramCollocationFinder.from_words(speech_tokens)

# Storing bigrams sorted by pmi to a dataframe
bigrams_df = pd.DataFrame(bigram_collocation_finder.score_ngrams(bigram_measures.pmi), columns=['bigram', 'PMI'])

# Saving bigram dataframe to a csv file
bigrams_df.to_csv(BIGRAM_OUTPUT)

------
#### Trigrams, Pointwise Mutual Information and Trigram Probabilities

In [6]:
from nltk.collocations import TrigramAssocMeasures, TrigramCollocationFinder

trigram_measures = TrigramAssocMeasures()
trigram_collocation_finder = TrigramCollocationFinder.from_words(speech_tokens)

# Storing trigrams sorted by pmi to a dataframe
trigram_df = pd.DataFrame(trigram_collocation_finder.score_ngrams(trigram_measures.pmi), columns=['trigram', 'score'])

# Saving trigram datafrae=me to a csv file
trigram_df.to_csv(TRIGRAM_OUTPUT)

------
------

### Citations

^Brown, D. W. (2016). "Corpus of Presidential Speeches". Retrieved from [The Grammar Lab](http://www.thegrammarlab.com).

^D. Jurafsky, J. H. Martin (2007). "Speech and Language Processing".

^Bouma, Gerlof (2009). "Normalized (Pointwise) Mutual Information in Collocation Extraction". Proceedings of the Biennial GSCL Conference.