# Week2 Lab

As demonstrated in the lab session:
<br><br>
Choose a file that you want to work on—either one of the files from the book corpus or one from the Gutenberg corpus.
<br><br>
Make a bigram finder and experiment with whether to apply the filters or not. Run the scoring with both the raw frequency and the pmi scorers and compare results.
<br><br>
To complete the exercise, choose one of your top 20 frequency lists to report to show to the class. Write an introductory sentence or paragraph telling what text you chose and what bigram filters and scorer you used. Put this and the frequency list in your response. You may check out the frequency lists of other corpora by other students.

# Read In Data

In [5]:
import nltk
from nltk import FreqDist

In [6]:
# Week 2:  Bigram Frequencies and Mutual Information
# This file has small examples that are meant to be run individually
#   in the Python interpreter or jupyter notebook cells

nltk.download('punkt')
nltk.download('gutenberg')

# You can then view some books obtained from the Gutenberg on-line book project:
nltk.corpus.gutenberg.fileids()

# For purposes of this lab, we will work with the first book, Jane Austen’s “Emma”.  First, we save the first fileid (number 0 in the list) into a variable named file0 so that we can reuse it:

file0 = nltk.corpus.gutenberg.fileids( ) [1]
print(file0)

# We can get the original text, using the raw function.  This returns the text as a string of characters, and the length function tells us how many characters.
emmatext = nltk.corpus.gutenberg.raw(file0)
print(len(emmatext))

# Since this is quite long, we can view part of it, e.g. the first 120 characters
print(emmatext[:120])

# Processing Text

# NLTK has several tokenizers available to break the raw text into tokens; we will use one trained on news articles that separates by white space and by special characters (punctuation), but also leaves together some of these such as Mr.:
emmatokens = nltk.word_tokenize(emmatext)

emmawords = [w.lower( ) for w in emmatokens] 


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/erm1000255241/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/erm1000255241/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


austen-persuasion.txt
466292
[Persuasion by Jane Austen 1818]


Chapter 1


Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,
for


In [8]:
# Creating a frequency distribution of words
ndist = FreqDist(emmawords)

In [10]:
# print the top 30 tokens by frequency
nitems = ndist.most_common(5)
for item in nitems:
    print (item[0], '\t', item[1])

, 	 7024
the 	 3328
. 	 3119
and 	 2786
to 	 2782


In [11]:
# Bigrams and Bigram frequency distribution
emmabigrams = list(nltk.bigrams(emmawords))
print(emmawords[:5])
print(emmabigrams[:5])

['[', 'persuasion', 'by', 'jane', 'austen']
[('[', 'persuasion'), ('persuasion', 'by'), ('by', 'jane'), ('jane', 'austen'), ('austen', '1818')]


In [12]:
# setup for bigrams and bigram measures
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

## Finder 1

In [13]:
# create the bigram finder and score the bigrams by frequency
finder = BigramCollocationFinder.from_words(emmawords)
scored = finder.score_ngrams(bigram_measures.raw_freq)

In [14]:
# scored is a list of bigram pairs with their score
print(type(scored))
first = scored[0]
print(type(first))
print(first)

<class 'list'>
<class 'tuple'>
((',', 'and'), 0.012939133986928104)


In [15]:
# scores are sorted in decreasing frequency
for bscore in scored[:5]:
    print (bscore)

((',', 'and'), 0.012939133986928104)
((';', 'and'), 0.0048304738562091505)
(('of', 'the'), 0.004340277777777778)
(('to', 'be'), 0.003860294117647059)
(('.', "''"), 0.0037683823529411765)


In [17]:
# apply a filter to remove non-alphabetical tokens from the emma bigram finder
finder.apply_word_filter(alpha_filter)
scored = finder.score_ngrams(bigram_measures.raw_freq)
for bscore in scored[:5]:
    print (bscore)

(('of', 'the'), 0.004340277777777778)
(('to', 'be'), 0.003860294117647059)
(('in', 'the'), 0.003278186274509804)
(('had', 'been'), 0.002593954248366013)
(('she', 'had'), 0.002318218954248366)


In [18]:
# apply a filter to remove stop words
finder.apply_word_filter(lambda w: w in stopwords)
scored = finder.score_ngrams(bigram_measures.raw_freq)
for bscore in scored[:5]:
    print (bscore)

(('captain', 'wentworth'), 0.001991421568627451)
(('mr', 'elliot'), 0.0017565359477124183)
(('lady', 'russell'), 0.0015012254901960785)
(('sir', 'walter'), 0.0013174019607843138)
(('mrs', 'clay'), 0.0006638071895424837)


## Finder 2

In [24]:
# apply a filter (on a new finder) to remove low frequency words
finder2 = BigramCollocationFinder.from_words(emmawords)
finder2.apply_freq_filter(2)
scored = finder2.score_ngrams(bigram_measures.raw_freq)
for bscore in scored[:5]:
    print (bscore)

((',', 'and'), 0.012939133986928104)
((';', 'and'), 0.0048304738562091505)
(('of', 'the'), 0.004340277777777778)
(('to', 'be'), 0.003860294117647059)
(('.', "''"), 0.0037683823529411765)


In [25]:
# apply a filter on both words of the ngram
finder2.apply_ngram_filter(lambda w1, w2: len(w1) < 2)
scored = finder2.score_ngrams(bigram_measures.raw_freq)
for bscore in scored[:5]:
    print (bscore)

(('of', 'the'), 0.004340277777777778)
(('to', 'be'), 0.003860294117647059)
(('in', 'the'), 0.003278186274509804)
(('had', 'been'), 0.002593954248366013)
(("''", '``'), 0.002573529411764706)


## Finder 3

In [26]:
### pointwise mutual information
finder3 = BigramCollocationFinder.from_words(emmawords)
scored = finder3.score_ngrams(bigram_measures.pmi)
for bscore in scored[:5]:
    print (bscore)

(('1818', ']'), 16.579315937580013)
(('a.', 'e.'), 16.579315937580013)
(('accustomary', 'intervention'), 16.579315937580013)
(('anyone', 'intending'), 16.579315937580013)
(('apples', 'stolen'), 16.579315937580013)


## Final Example

In [28]:
# to get good results, must first apply frequency filter
finder.apply_freq_filter(5)
scored = finder.score_ngrams(bigram_measures.pmi)
for bscore in scored[:10]:
    print (bscore)

(('west', 'indies'), 13.508926609688618)
(('dr', 'shirley'), 12.935459747805288)
(('marlborough', 'buildings'), 12.672425341971497)
(('westgate', 'buildings'), 12.672425341971497)
(('milsom', 'street'), 11.878876219438924)
(('colonel', 'wallis'), 11.491853096329676)
(('eldest', 'son'), 11.316281531746222)
(('five', 'minutes'), 11.293913718717766)
(('poor', 'richard'), 10.797956224055355)
(('ten', 'minutes'), 10.61584181360513)


# Optimal Solution by ME

In [51]:
# apply a filter (on a new finder) to remove low frequency words
finder_opt = BigramCollocationFinder.from_words(emmawords)
finder_opt.apply_word_filter(alpha_filter)
finder_opt.apply_word_filter(lambda w: w in stopwords)
scored = finder_opt.score_ngrams(bigram_measures.raw_freq)
for bscore in scored[:20]:
    print (bscore)

(('captain', 'wentworth'), 0.001991421568627451)
(('mr', 'elliot'), 0.0017565359477124183)
(('lady', 'russell'), 0.0015012254901960785)
(('sir', 'walter'), 0.0013174019607843138)
(('mrs', 'clay'), 0.0006638071895424837)
(('mrs', 'musgrove'), 0.0006638071895424837)
(('mrs', 'smith'), 0.00065359477124183)
(('captain', 'benwick'), 0.0005616830065359477)
(('miss', 'elliot'), 0.0004901960784313725)
(('mrs', 'croft'), 0.00041870915032679736)
(('captain', 'harville'), 0.000377859477124183)
(('great', 'deal'), 0.00034722222222222224)
(('charles', 'hayter'), 0.0003370098039215686)
(('camden', 'place'), 0.0002961601307189542)
(('mr', 'shepherd'), 0.00026552287581699344)
(('kellynch', 'hall'), 0.0002553104575163399)
(('lady', 'dalrymple'), 0.0002553104575163399)
(('mrs', 'harville'), 0.00024509803921568627)
(('anne', 'elliot'), 0.00023488562091503269)
(('colonel', 'wallis'), 0.00023488562091503269)


In [50]:
# apply a filter (on a new finder) to remove low frequency words
finder_opt2 = BigramCollocationFinder.from_words(emmawords)
finder_opt2.apply_freq_filter(5)
finder_opt2.apply_word_filter(alpha_filter)
finder_opt2.apply_word_filter(lambda w: w in stopwords)
scored = finder_opt2.score_ngrams(bigram_measures.pmi)
for bscore in scored[:20]:
    print (bscore)

(('west', 'indies'), 13.508926609688618)
(('dr', 'shirley'), 12.935459747805288)
(('marlborough', 'buildings'), 12.672425341971497)
(('westgate', 'buildings'), 12.672425341971497)
(('milsom', 'street'), 11.878876219438924)
(('colonel', 'wallis'), 11.491853096329676)
(('eldest', 'son'), 11.316281531746222)
(('five', 'minutes'), 11.293913718717766)
(('poor', 'richard'), 10.797956224055355)
(('ten', 'minutes'), 10.61584181360513)
(('eight', 'years'), 10.399406847565078)
(('kellynch', 'hall'), 10.225992646417119)
(('camden', 'place'), 10.169925001442312)
(('laura', 'place'), 10.169925001442312)
(('depend', 'upon'), 9.949959317500403)
(('dare', 'say'), 9.79914902696819)
(('anybody', 'else'), 9.75403910752515)
(('miss', 'carteret'), 9.613531652917928)
(('years', 'ago'), 9.306297443173598)
(('sir', 'walter'), 9.2524335998176)
