# Intro to NLTK

In 2001, NLTK was created as part of a computational linguistics course at UPenn. Today, it's open source. It simplifies common language processing tasks into a framework.

Other natural language frameworks do exist today. NLTK algorithms may not be as advanced or highly optimizes as what is found in other toolkits.

NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides: 
* Basic classes for representing data relevant to natural language processing 
* Standard interfaces for performing tasks, such as tokenization, tagging, and parsing 
* Demonstrations (parsers, chunkers, chatbots) 
* Extensive documentation, including tutorials and reference documentation 

## Installing NLTK

We first need to install NLTK onto our system. Once it's installed, the following import should work.

See http://www.nltk.org/install.html. There's a: Windows binary installation and Mac/Unix command-line installation.

On my Mac, I used the command `pip3 install nltk`.

In [1]:
import nltk

ModuleNotFoundError: No module named 'nltk'

This notebook follows Chapter 1 in the NLTK book, which explores some basic text processing on books that are pre-loaded into NLTK. 

The default NLTK installation is only the bare minimum. Other parts of the toolkit can be installed such as 
corpora, taggers, parsers, etc. 

The NLTK book collection has to be first downloaded onto your system. <u>Download the book collection.</u> This only has to be done once.

In [3]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> book
    Downloading collection 'book'
       | 
       | Downloading package abc to /home/nbuser/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package brown to /home/nbuser/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package chat80 to /home/nbuser/nltk_data...
       |   Unzipping corpora/chat80.zip.
       | Downloading package cmudict to /home/nbuser/nltk_data...
       |   Unzipping corpora/cmudict.zip.
       | Downloading package conll2000 to /home/nbuser/nltk_data...
       |   Unzipping corpora/conll2000.zip.
       | Downloading package conll2002 to /home/nbuser/nltk_data...
       |   Unzipping corpora/conll2002.zip

True

## Loading the `book` collection

From NLTK's `book` module, load all items.

In [7]:
from nltk.book import *

### More information about a specific NLTK text

In [8]:
text3

<Text: The Book of Genesis>

In [9]:
text5

<Text: Chat Corpus>

### Searching a text: `concordance`

Shows every occurrence of a given word, together with some context.

In [10]:
text1.concordance('monstrous')

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u


### Dispersion Plot

Can be used to investigate changes in language use over time:
* each stripe represents an instance of a word 
* each row represents the entire text 

In this example, the text4 book is an artificial text constructed by joining the texts of the Inaugural Address Corpus end-to-end. **What do you expect this graph to look like?**

(In order for this to work, the Python library `matplotlib` also needs to be installed. On my Mac, I had to do `pip3 install matplotlib`.)

In [11]:
%matplotlib notebook

text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])

  'Matplotlib is building the font cache using fc-list. '


<IPython.core.display.Javascript object>

### Counting Tokens

Calculating length of a text from start to finish.

In [12]:
len(text3)

44764

_What does this mean?_

The book of Genesis has 44,764 **tokens**.
* sequence of characters that are treated as a group
* usually words and punctuation symbols (no spaces)

### Vocabulary

Vocabulary of a text is the set of tokens that it uses. (Recall that there are no duplicates in a set.) 

How many **distinct words** does the book of Genesis contain? 

In [13]:
sorted(set(text3))

['!',
 "'",
 '(',
 ')',
 ',',
 ',)',
 '.',
 '.)',
 ':',
 ';',
 ';)',
 '?',
 '?)',
 'A',
 'Abel',
 'Abelmizraim',
 'Abidah',
 'Abide',
 'Abimael',
 'Abimelech',
 'Abr',
 'Abrah',
 'Abraham',
 'Abram',
 'Accad',
 'Achbor',
 'Adah',
 'Adam',
 'Adbeel',
 'Admah',
 'Adullamite',
 'After',
 'Aholibamah',
 'Ahuzzath',
 'Ajah',
 'Akan',
 'All',
 'Allonbachuth',
 'Almighty',
 'Almodad',
 'Also',
 'Alvah',
 'Alvan',
 'Am',
 'Amal',
 'Amalek',
 'Amalekites',
 'Ammon',
 'Amorite',
 'Amorites',
 'Amraphel',
 'An',
 'Anah',
 'Anamim',
 'And',
 'Aner',
 'Angel',
 'Appoint',
 'Aram',
 'Aran',
 'Ararat',
 'Arbah',
 'Ard',
 'Are',
 'Areli',
 'Arioch',
 'Arise',
 'Arkite',
 'Arodi',
 'Arphaxad',
 'Art',
 'Arvadite',
 'As',
 'Asenath',
 'Ashbel',
 'Asher',
 'Ashkenaz',
 'Ashteroth',
 'Ask',
 'Asshur',
 'Asshurim',
 'Assyr',
 'Assyria',
 'At',
 'Atad',
 'Avith',
 'Baalhanan',
 'Babel',
 'Bashemath',
 'Be',
 'Because',
 'Becher',
 'Bedad',
 'Beeri',
 'Beerlahairoi',
 'Beersheba',
 'Behold',
 'Bela',
 'Belah

In [14]:
len(set(text3))

2789

Note that the number of unique types includes punctuation symbols, so it’s not completely accurately to say that there are 2,789 different words. 

### Lexical Diversity (TTR: Type-Token Radio)

Equivalent to a measure of lexical richness.

$$ Lexical Diversity = \frac{Text Length}{Number of Unique Types} $$

In [15]:
def lexical_diversity(text):
    return len(text) / len(set(text))

In [16]:
lexical_diversity(text3)

16.050197203298673

In [17]:
lexical_diversity(text5)

7.420046158918563

In [18]:
len(text3)

44764

In [19]:
len(text5)

45010

Necessity for writers to re-use several function words, so lexical diversity is better used for comparing texts of equal length.

### Lexical Diversity in the Brown Corpus

**Corpus:** a collection of “real-world” text (plural is corpora)

Brown Corpus - famous corpus compiled in the 1960s at Brown University. 
* a general corpus of 500 samples of English-language text, totaling approximately one million words, compiled from works published in the United States in 1961 

<u>Lexical Diversity of various genres in the Brown corpus:</u>

```
Genre              Tokens    Types   Lexical Diversity
Skill and Hobbies  82345     11935   6.9
Humor              21695     5017    4.3
Fiction: science   14470     3233    4.5
Press: reportage   100554    14394   7.0
Fiction: romance   70022     8452    8.3
Religion           39399     6373    6.2
```

### Frequency Distributions

A frequency distribution contains the frequency of each vocabulary item in the text. 

In other words, it contains the count of every word. 

_What Python data structure would be used to represent a Frequency Distribution?_

### NLTK's `FreqDist`

NLTK has built-in support for maintaining the data in a frequency distribution. 

* If you look at the NLTK code (since it’s open-source), all it is doing is using a Python dictionary in the background. 

Finding the 50 most frequent words of Moby Dick:

In [20]:
fdist1 = FreqDist(text1)

In [21]:
print(fdist1)

<FreqDist with 19317 samples and 260819 outcomes>


In [22]:
fdist1.most_common(50)

[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982),
 ("'", 2684),
 ('-', 2552),
 ('his', 2459),
 ('it', 2209),
 ('I', 2124),
 ('s', 1739),
 ('is', 1695),
 ('he', 1661),
 ('with', 1659),
 ('was', 1632),
 ('as', 1620),
 ('"', 1478),
 ('all', 1462),
 ('for', 1414),
 ('this', 1280),
 ('!', 1269),
 ('at', 1231),
 ('by', 1137),
 ('but', 1113),
 ('not', 1103),
 ('--', 1070),
 ('him', 1058),
 ('from', 1052),
 ('be', 1030),
 ('on', 1005),
 ('so', 918),
 ('whale', 906),
 ('one', 889),
 ('you', 841),
 ('had', 767),
 ('have', 760),
 ('there', 715),
 ('But', 705),
 ('or', 697),
 ('were', 680),
 ('now', 646),
 ('which', 640),
 ('?', 637),
 ('me', 627),
 ('like', 624)]

In [23]:
fdist1['whale']

906

In [24]:
fdist1['isn']

1

In [25]:
type(fdist1)

nltk.probability.FreqDist

### Common Words and Hapaxes

Notice anything?

In [None]:
text1

In [None]:
fdist1.most_common(50)

Notice that the most frequent words of Moby Dick don’t describe the topic of genre of the text. This is a common finding! 
* whale is the exception
* **hapaxes** (plural of “hapax”) are the words that only appear once in a language, or written work 
  * Sometimes hapaxes help with their context

In [26]:
fdist1.hapaxes()

['tornadoed',
 'cursorily',
 'unmatched',
 'slatternly',
 'vividness',
 'adornment',
 'slapping',
 'ravening',
 'tearingly',
 'HORRID',
 'inanimate',
 'vero',
 'doored',
 'emptying',
 'fresher',
 'howdah',
 'Scales',
 'ROSE',
 '42',
 'aleak',
 'lesser',
 'symmetry',
 'Potters',
 'Pizarro',
 'Salem',
 'melodious',
 'continuation',
 'offence',
 'antelope',
 '1779',
 'enjoins',
 'patrolled',
 'SWEDISH',
 'questionings',
 'watcher',
 'funereal',
 'Hardicanutes',
 'sultanically',
 'unequal',
 'tendon',
 'stoopingly',
 'finite',
 'Meshach',
 'homewardbound',
 'alpine',
 'Mab',
 'cachalot',
 'unfulfilments',
 'succeeds',
 'apricot',
 'turnstile',
 'taunts',
 'scrutinized',
 'MIRABILIS',
 'Socratic',
 'rug',
 'Midwifery',
 'misfortune',
 'flambeaux',
 'Fife',
 'EZEKIEL',
 'felled',
 'inferentially',
 'HEIGHT',
 'hugely',
 'OATHS',
 'Lookee',
 'Snatching',
 'Zoology',
 'KINROSS',
 'trials',
 'pang',
 'wits',
 '74',
 'crashing',
 'harsh',
 'Drop',
 'unoutgrown',
 'entablatures',
 'arrant',
 'cap

### Fine-Grained Selection of Words

List comprehensions will come in very handy.

Let's build a list comprehension that finds _frequently occurring long words_.

General syntax: `[w for w in V if p(w)]`
* iterates over a collection and returns a list 
* (remember: duplicates are possible in a list) 
* read as: “for each word w in collection V, if p(w) is true, then add w to the returned list” 

#### Find words from the vocabulary of a text that are more than 15 characters long. 

In [27]:
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)

['CIRCUMNAVIGATION',
 'Physiognomically',
 'apprehensiveness',
 'cannibalistically',
 'characteristically',
 'circumnavigating',
 'circumnavigation',
 'circumnavigations',
 'comprehensiveness',
 'hermaphroditical',
 'indiscriminately',
 'indispensableness',
 'irresistibleness',
 'physiognomically',
 'preternaturalness',
 'responsibilities',
 'simultaneousness',
 'subterraneousness',
 'supernaturalness',
 'superstitiousness',
 'uncomfortableness',
 'uncompromisedness',
 'undiscriminating',
 'uninterpenetratingly']

#### Find commonly occurring long words in a text.

In [28]:
fdist5 = FreqDist(text5)
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])

['#14-19teens',
 '#talkcity_adults',
 '((((((((((',
 '........',
 'Question',
 'actually',
 'anything',
 'computer',
 'cute.-ass',
 'everyone',
 'football',
 'innocent',
 'listening',
 'remember',
 'seriously',
 'something',
 'together',
 'tomorrow',
 'watching']

#### Find the most frequent word length in a text.

In [29]:
[len(w) for w in text1]

[1,
 4,
 4,
 2,
 6,
 8,
 4,
 1,
 9,
 1,
 1,
 8,
 2,
 1,
 4,
 11,
 5,
 2,
 1,
 7,
 6,
 1,
 3,
 4,
 5,
 2,
 10,
 2,
 4,
 1,
 5,
 1,
 4,
 1,
 3,
 5,
 1,
 1,
 3,
 3,
 3,
 1,
 2,
 3,
 4,
 7,
 3,
 3,
 8,
 3,
 8,
 1,
 4,
 1,
 5,
 12,
 1,
 9,
 11,
 4,
 3,
 3,
 3,
 5,
 2,
 3,
 3,
 5,
 7,
 2,
 3,
 5,
 1,
 2,
 5,
 2,
 4,
 3,
 3,
 8,
 1,
 2,
 7,
 6,
 8,
 3,
 2,
 3,
 9,
 1,
 1,
 5,
 3,
 4,
 2,
 4,
 2,
 6,
 6,
 1,
 3,
 2,
 5,
 4,
 2,
 4,
 4,
 1,
 5,
 1,
 4,
 2,
 2,
 2,
 6,
 2,
 3,
 6,
 7,
 3,
 1,
 7,
 9,
 1,
 3,
 6,
 1,
 1,
 5,
 6,
 5,
 6,
 3,
 13,
 2,
 3,
 4,
 1,
 3,
 7,
 4,
 5,
 2,
 3,
 4,
 2,
 2,
 8,
 1,
 5,
 1,
 3,
 2,
 1,
 3,
 3,
 1,
 4,
 1,
 4,
 6,
 2,
 5,
 4,
 9,
 2,
 7,
 1,
 3,
 2,
 3,
 1,
 5,
 2,
 6,
 2,
 7,
 2,
 2,
 7,
 1,
 1,
 10,
 1,
 5,
 1,
 3,
 2,
 2,
 4,
 11,
 4,
 3,
 3,
 1,
 3,
 3,
 1,
 6,
 1,
 1,
 1,
 1,
 1,
 4,
 1,
 3,
 1,
 2,
 4,
 1,
 2,
 6,
 2,
 2,
 10,
 1,
 1,
 10,
 5,
 1,
 5,
 1,
 5,
 1,
 5,
 1,
 5,
 1,
 5,
 1,
 5,
 1,
 5,
 1,
 6,
 1,
 3,
 1,
 5,
 1,
 4,
 1,
 7,
 1,
 5,
 1,
 9,

In [30]:
fdist = FreqDist(len(w) for w in text1)

In [31]:
print(fdist)

<FreqDist with 19 samples and 260819 outcomes>


In [32]:
fdist

FreqDist({1: 47933,
          2: 38513,
          3: 50223,
          4: 42345,
          5: 26597,
          6: 17111,
          7: 14399,
          8: 9966,
          9: 6428,
          10: 3528,
          11: 1873,
          12: 1053,
          13: 567,
          14: 177,
          15: 70,
          16: 22,
          17: 12,
          18: 1,
          20: 1})

In [33]:
fdist.most_common()

[(3, 50223),
 (1, 47933),
 (4, 42345),
 (2, 38513),
 (5, 26597),
 (6, 17111),
 (7, 14399),
 (8, 9966),
 (9, 6428),
 (10, 3528),
 (11, 1873),
 (12, 1053),
 (13, 567),
 (14, 177),
 (15, 70),
 (16, 22),
 (17, 12),
 (18, 1),
 (20, 1)]

In [34]:
fdist.max()

3

In [35]:
fdist[3]

50223

In [36]:
fdist.freq(3)

0.19255882431878046

### Functions defined for NLTK's `FreqDist`

```
Example                           Description
-------                           -----------
fdist = FreqDist(samples)         create a frequency distribution containing the given samples
fdist[sample] += 1                increment the count for this sample
fdist['monstrous']                count of the number of times a given sample occurred
fdist.freq('monstrous')           frequency of a given sample
fdist.N()                         total number of samples
fdist.most_common(n)              the n most common samples and their frequencies
for sample in fdist:              iterate over the samples
fdist.max()                       sample with the greatest count
fdist.tabulate()                  tabulate the frequency distribution
fdist.plot()                      graphical plot of the frequency distribution
fdist.plot(cumulative=True)       cumulative plot of the frequency distribution
fdist1 |= fdist2                  update fdist1 with counts from fdist2
fdist1 < fdist2                   test if samples in fdist1 occur less frequently than in fdist2
```

### Example: Counting Word Occurrences <u>without</u> `FreqDist`

In [37]:
nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')

['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', ...]

In [38]:
count = {}
for word in nltk.corpus.gutenberg.words('shakespeare-macbeth.txt'):
    word = word.lower()
    if word not in count:
        count[word]=0
    count[word] += 1

Now inspect the dictionary:

In [39]:
count['scotland']

12

In [40]:
frequencies = [(freq,word) for (word,freq) in count.items()]
frequencies.sort()
frequencies.reverse()
frequencies[:20]

[(1962, ','),
 (1235, '.'),
 (650, 'the'),
 (637, "'"),
 (546, 'and'),
 (477, ':'),
 (384, 'to'),
 (348, 'i'),
 (338, 'of'),
 (241, 'a'),
 (241, '?'),
 (238, 'that'),
 (224, 'd'),
 (206, 'you'),
 (203, 'my'),
 (201, 'in'),
 (188, 'is'),
 (165, 'not'),
 (161, 'it'),
 (153, 'with')]

## Alternatives to NLTK

* http://en.wikipedia.org/wiki/Outline_of_natural_language_processing#Natural_language_processing_toolkits
* StanfordNLP, LingPipe, Mallet are popular
  * for Java, not Python 