# NLP Lab Session Week 1
# Word Frequencies and Frequency Distributions from text in NLTK

### Demonstration of using nltk's word_tokenize method, FreqDist class and ngrams function.

## Part 3: Processing Text using NLTK

In [1]:
# import packages
import nltk
from nltk import FreqDist
from nltk.util import ngrams
from collections import Counter

## Working with Text from the NLTK Gutenberg corpora

In [2]:
# view some books obtained from the Gutenberg online book project:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

For the purposes of this lab, we'll work with the first book, Jane Austen's "Emma".

In [3]:
# get ID of book we want
file0 = nltk.corpus.gutenberg.fileids()[0]
file0

'austen-emma.txt'

In [4]:
# get original text as a string of characters given file ID
emmatext = nltk.corpus.gutenberg.raw(file0)
len(emmatext)

887071

In [5]:
type(emmatext)

str

Text contains 887071 characters. Next we can view the first 120 characters.

In [6]:
emmatext[:120] # first 120 characters in text

'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nan'

## Processing Text

NLTK has several tokenizers available to break the raw text into tokens; we will use one trained on news articles that separates by white space and by special characters (punctuation), but also leaves together some of these such as Mr.:

In [7]:
# get tokens from text
emmatokens = nltk.word_tokenize(emmatext)
len(emmatokens) # count tokens

191776

Text contains 191785 tokens that were either separated by white space or punctuation.

In [8]:
# view first 50 tokens
print(emmatokens[:50])

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world', 'with']


We probably want to use the lowercase versions of the words.  To do this, we can use the function “lower()” that reduces all characters in a string to lowercase.

In [9]:
emmawords = [w.lower() for w in emmatokens] # convert tokens to lower
print(emmawords[:50]) # first 50 lower tokens
print(len(emmawords)) # count of lower tokens

['[', 'emma', 'by', 'jane', 'austen', '1816', ']', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world', 'with']
191776


The function “set” converts the list of words to a set of words (which are only the unique ones), then the function “sorted” puts them into alphabetical order. This is called a **vocabulary**.

In [10]:
emmavocab = sorted(set(emmawords)) # sorted list of unique tokens (i.e. vocabulary)
print(emmavocab[:50]) # first 50 of sorted list

['!', '&', "'", "''", "'d", "'s", "'t", "'ye", '(', ')', ',', '--', '.', '10,000', '1816', '23rd', '24th', '26th', '28th', '7th', '8th', ':', ';', '?', '[', ']', '_______', '_a_', '_accepted_', '_adair_', '_addition_', '_all_', '_almost_', '_alone_', '_amor_', '_and_', '_answer_', '_any_', '_appropriation_', '_as_', '_assistance_', '_at_', '_bath_', '_be_', '_been_', '_blunder_', '_boiled_', '_both_', '_bride_', '_broke_']


## Ngrams

After word tokenization, you can extract bigrams (two-word phrases) and trigrams (three-word phrases)

In [11]:
# count the frequency by bigram
bigrams = ngrams(emmatokens,2)
print(Counter(bigrams).most_common(30))
print(type(bigrams))

[((',', 'and'), 1880), (('.', "''"), 1158), (("''", '``'), 958), ((';', 'and'), 867), (('to', 'be'), 593), ((',', "''"), 584), (('.', 'I'), 570), ((',', 'I'), 569), (('of', 'the'), 556), (('in', 'the'), 434), ((';', 'but'), 427), (('.', '``'), 415), (('.', 'She'), 413), (('I', 'am'), 394), ((',', 'that'), 360), (('!', '--'), 344), (('had', 'been'), 307), ((',', 'she'), 304), ((',', 'but'), 303), (('.', 'He'), 303), (('--', 'and'), 302), ((',', 'as'), 292), (('it', 'was'), 282), (('I', 'have'), 281), (('could', 'not'), 277), (('Mr.', 'Knightley'), 273), (('.', 'It'), 266), (("''", 'said'), 265), ((',', 'to'), 264), (('``', 'I'), 261)]
<class 'zip'>


In [12]:
# count the frequency by trigram
trigrams = ngrams(emmatokens,3)
print(Counter(trigrams).most_common(30))

[(('.', "''", '``'), 757), ((',', "''", 'said'), 225), (('?', "''", '``'), 147), (("''", '``', 'I'), 136), (('I', 'do', 'not'), 135), (('.', 'It', 'was'), 117), (('I', 'am', 'sure'), 107), ((',', 'however', ','), 89), ((',', 'and', 'the'), 89), ((',', 'my', 'dear'), 87), (('Miss', 'Woodhouse', ','), 86), (('.', 'She', 'was'), 85), (('.', "''", 'Emma'), 79), (('.', '``', 'I'), 78), ((',', 'I', 'am'), 78), (('``', 'Oh', '!'), 76), (('.', 'I', 'am'), 75), (('Mr.', 'Knightley', ','), 75), (('Mrs.', 'Weston', ','), 73), (("''", '``', 'Oh'), 71), ((',', 'you', 'know'), 70), (('I', 'can', 'not'), 66), (('a', 'great', 'deal'), 63), (('.', 'She', 'had'), 61), ((';', 'and', 'the'), 60), (('would', 'have', 'been'), 59), (('.', 'He', 'had'), 58), (('.', 'It', 'is'), 58), (("''", 'said', 'he'), 58), ((',', 'it', 'was'), 57)]


In [13]:
# adhoc example
sentence = "I went to the park today and played with my son."
tokens = nltk.word_tokenize(sentence)

In [14]:
n = 6
sixgrams = ngrams(tokens, n)
for grams in sixgrams:
  print(grams)

('I', 'went', 'to', 'the', 'park', 'today')
('went', 'to', 'the', 'park', 'today', 'and')
('to', 'the', 'park', 'today', 'and', 'played')
('the', 'park', 'today', 'and', 'played', 'with')
('park', 'today', 'and', 'played', 'with', 'my')
('today', 'and', 'played', 'with', 'my', 'son')
('and', 'played', 'with', 'my', 'son', '.')


On a side note, regular expressions can help us get rid of special characters...

## Find the counts of different words in the text.

In [15]:
# get specific word frequencies
print('The word "food" appears {:d} times in the text'.format(emmawords.count('food')))
print('The word "love" appears {:d} times in the text'.format(emmawords.count('love')))
print('The word "she" appears {:d} times in the text'.format(emmawords.count('she')))
print('The word "the" appears {:d} times in the text'.format(emmawords.count('the')))

The word "food" appears 3 times in the text
The word "love" appears 117 times in the text
The word "she" appears 2336 times in the text
The word "the" appears 5201 times in the text


## Part 4: Word Frequency Distributions
## Using NLTK Frequency Distributions to Count Words

We will create a dictionary that has a list of (key, value) pairs where the key is the word and the value is the frequency, or the count of the number of occurrences of each word.  

In [16]:
fdist = FreqDist(emmawords) # get frequencies of all words; this is a dictionary
fdistkeys = list(fdist.keys()) # get all keys in dictionary
print(fdistkeys[:50]) # print first 50 keys in dictionary

['[', 'emma', 'by', 'jane', 'austen', '1816', ']', 'volume', 'i', 'chapter', 'woodhouse', ',', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home', 'happy', 'disposition', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'existence', ';', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'world', 'very', 'little', 'distress', 'or', 'vex', 'her', '.', 'she', 'was', 'youngest', 'two']


We can treat the frequency distribution just like a Python dictionary and we can look at the frequencies of individual words:

In [17]:
# print the frequencies of specific words
print('The word "emma" appears {:d} times in the text'.format(fdist['emma']))
print('The word "the" appears {:d} times in the text'.format(fdist['the']))

The word "emma" appears 855 times in the text
The word "the" appears 5201 times in the text


The FreqDist module has a function to return the most frequent words in the corpus.  It is a list of (word, frequency) pairs, and we can use this list to print the most frequent words.

In [18]:
topkeys = fdist.most_common(10) # 50 most common words (or punctuations)
for pair in topkeys:
    print(pair)

(',', 12016)
('.', 6346)
('the', 5201)
('to', 5181)
('and', 4877)
('of', 4284)
('i', 3177)
('a', 3124)
('--', 3100)
('it', 2503)


The frequencies listed here are the counts of words in the document, but sometimes what we want are normalized frequencies, for example, if we want to compare frequencies between 2 or more documents.  A normalized frequency (also sometimes confusingly just called frequency) is the count of the words occurring in the corpus divided by the total number of words in the document.  Here is one way to normalize our top frequency words.

In [19]:
# total number of words in document
numwords = len(emmawords)
# divide each frequency by the total number of words
topkeysnormalized = [(word, freq/numwords) for word, freq in topkeys]
for pair in topkeysnormalized:
    print(pair)

(',', 0.06265643250458869)
('.', 0.03309068913732688)
('the', 0.02712018187885867)
('to', 0.02701589354246621)
('and', 0.025430710829300852)
('of', 0.022338561655264474)
('i', 0.01656620223594193)
('a', 0.01628983814450192)
('--', 0.01616469214083097)
('it', 0.013051685299516102)


In [20]:
for word, val in topkeysnormalized:
    print(word + ' appears in {:.1f} percent of the text'.format(round(val*100,1)))

, appears in 6.3 percent of the text
. appears in 3.3 percent of the text
the appears in 2.7 percent of the text
to appears in 2.7 percent of the text
and appears in 2.5 percent of the text
of appears in 2.2 percent of the text
i appears in 1.7 percent of the text
a appears in 1.6 percent of the text
-- appears in 1.6 percent of the text
it appears in 1.3 percent of the text


### Pick a different file number for the Gutenberg books, change the variable fileno to this number and rerun the steps to create a top 30 frequency distribution, FreqDist, for this book.

In [21]:
#nltk.corpus.gutenberg.fileids()
file4 = nltk.corpus.gutenberg.fileids()[4] # specify file id we want
print(file4)
# get text for that file
myText = nltk.corpus.gutenberg.raw(file4)
print('The total number of characters in {:s} is {:d}'.format(file4, len(myText)))

blake-poems.txt
The total number of characters in blake-poems.txt is 38153


In [22]:
# tokenize text
myTokens = nltk.word_tokenize(myText)
# convert to lower
myTokens = [w.lower() for w in myTokens]
print(myTokens[:10])

['[', 'poems', 'by', 'william', 'blake', '1789', ']', 'songs', 'of', 'innocence']


In [23]:
# frequency distribution of words in my tokens
myFreqDist = FreqDist(myTokens)
# top 30 most common key value pairs
myTop30 = myFreqDist.most_common(30)
print(myTop30)

[(',', 685), ('the', 439), ('and', 348), ('.', 221), ('of', 146), ('in', 141), ('i', 130), ('a', 127), ('to', 111), (';', 99), ('my', 83), (':', 76), ('!', 68), ('with', 66), ('?', 65), ('his', 57), ('he', 56), ('is', 52), ('``', 50), ("'s", 48), ('little', 45), ('on', 44), ('they', 44), ('not', 43), ('thee', 42), ('that', 39), ('all', 39), ('but', 38), ("''", 35), ('like', 35)]
