## Find the top <i>k</i> bigrams and trigrams in Jane Austen's <i>Emma</i>:
- I do it using two approaches:
    - NLTK's collocations module
    - Manually without using NLTK; using collections.Counter

In [1]:
# Imports:
import nltk
from nltk.collocations import *
from collections import Counter

# Load corpus from nltk:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
clean_emma = [] # cleaning the text to only include words
for w in emma:
    if w[0].isalpha():
        clean_emma.append(w)
print(clean_emma[:25]) # preview
        
# Set the value of k:
k = 10

['Emma', 'by', 'Jane', 'Austen', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', 'seemed', 'to', 'unite', 'some']


### Using NLTK:

In [2]:
bigram_measures = nltk.collocations.BigramAssocMeasures() # creates an instance
trigram_measures = nltk.collocations.TrigramAssocMeasures()

In [3]:
# Finds top k bigrams
finder = BigramCollocationFinder.from_words(clean_emma)
top_bigrams = finder.nbest(bigram_measures.raw_freq, k) # Uses the raw frequency (there are other options like PMI etc)
print("Top {} bigrams:\n{}".format(k, top_bigrams))

Top 10 bigrams:
[('to', 'be'), ('of', 'the'), ('in', 'the'), ('I', 'am'), ('had', 'been'), ('Mr', 'Knightley'), ('it', 'was'), ('I', 'have'), ('could', 'not'), ('of', 'her')]


In [4]:
# Finds top k trigrams
finder = TrigramCollocationFinder.from_words(clean_emma)
top_trigrams = finder.nbest(trigram_measures.raw_freq, k)
print("Top {} trigrams:\n{}".format(k, top_trigrams))

Top 10 trigrams:
[('I', 'do', 'not'), ('I', 'am', 'sure'), ('a', 'great', 'deal'), ('would', 'have', 'been'), ('do', 'not', 'know'), ('she', 'could', 'not'), ('I', 'dare', 'say'), ('Mr', 'Frank', 'Churchill'), ('in', 'the', 'world'), ('I', 'assure', 'you')]


### Without using NLTK:

In [5]:
# Helper functions:

def find_bigrams(words):
    """
    Takes in a list of words as input; Returns a list of bigrams.
    """
    bigram_list = []
    for i in range(len(words)-1): # loop till len - 1 to avoid IndexError
        bigram_list.append((words[i], words[i+1]))
    return bigram_list

def find_trigrams(words):
    """
    Takes in a list of words as input; Returns a list of trigrams.
    """
    trigram_list = []
    for i in range(len(words)-2): # loop till len - 2 to avoid IndexError
        trigram_list.append((words[i], words[i+1], words[i+2]))
    return trigram_list

In [6]:
# Finds top k bigrams:
bigrams = find_bigrams(clean_emma)
top_bigrams = Counter(bigrams).most_common(k) # calling most_common() of the Counter object
print("Top {} bigrams with counts:\n{}".format(k, top_bigrams))

Top 10 bigrams with counts:
[(('to', 'be'), 595), (('of', 'the'), 557), (('in', 'the'), 434), (('I', 'am'), 395), (('had', 'been'), 308), (('Mr', 'Knightley'), 299), (('it', 'was'), 288), (('I', 'have'), 281), (('could', 'not'), 277), (('of', 'her'), 262)]


In [7]:
# Finds top k trigrams:
trigrams = find_trigrams(clean_emma)
top_trigrams = Counter(trigrams).most_common(k)
print("Top {} trigrams with counts:\n{}".format(k, top_trigrams))

Top 10 trigrams with counts:
[(('I', 'do', 'not'), 135), (('I', 'am', 'sure'), 109), (('a', 'great', 'deal'), 63), (('would', 'have', 'been'), 60), (('do', 'not', 'know'), 55), (('she', 'could', 'not'), 52), (('I', 'dare', 'say'), 50), (('Mr', 'Frank', 'Churchill'), 50), (('in', 'the', 'world'), 49), (('I', 'assure', 'you'), 47)]


- Phew! The results from both approaches are identical (nltk.collocations works!)
- An interesting next step would be to use these bigrams/trigrams to train a language model. Using that language model trained on this text and perhaps additional texts, make the computer write Emma 2.0!

## fin.