In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Make the graphs a bit prettier, and bigger
plt.rcParams['figure.figsize'] = (15, 5)

### Installing NLTK toolkit

Before starting let's install the NLTK library (http://www.nltk.org/), by typing the following commands in vagrant terminal.

* Install Numpy: `sudo pip install -U numpy`
* Install NLTK: `sudo pip install -U nltk`

Test that the library is installed properly by executing the following command:

In [None]:
import nltk

Once the NLTK toolkit is installed, you may need to install the NLTK data [www.nltk.org/data.html](http://www.nltk.org/data.html) 

In [None]:
# select 'all' packages
nltk.download()

#### Extra NLTK resources

NLTK also comes with some of the files from Project Gutenberg already included:

In [None]:
print(nltk.corpus.gutenberg.fileids())

In [None]:
alice  = nltk.corpus.gutenberg.words('carroll-alice.txt')
print(alice[:30])
print(len(alice))

In [None]:
from nltk.book import *

In [None]:
len(text4)

In [None]:
text4

It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text. You can produce this plot as shown below. You might like to try more words (e.g., liberty, constitution), and different texts. Can you predict the dispersion of a word before you view it? 

In [None]:
# Text4 is the inauguration addresses
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America", "world"])

In [None]:
alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
t_alice = nltk.Text(alice)
t_alice.dispersion_plot(["rabbit", "Alice", "hole", "Queen"])

### Exercise

* Pick your own book (from loaded above) and create a dispersion plot for your keywords of choice.

In [None]:
# Your code here


### Processing Text: Introduction 

Let's start by fetching a piece of text. We will go to [Project Gutenberg](https://www.gutenberg.org/) and fetch the text for "The origin of species"

In [None]:
import requests

# The origin of species
# Original at http://www.gutenberg.org/cache/epub/1228/pg1228.txt 
url = "http://www.gutenberg.org/cache/epub/1228/pg1228.txt"

# Get the URL, do not check the SSL certificate
resp = requests.get(url)

# Get the text
content = resp.text


In [None]:
# The text contains template stuff at the beginning and end. Let's get rid of these
start_phrase = "*** START OF THIS PROJECT GUTENBERG EBOOK ON THE ORIGIN OF SPECIES ***"
end_phrase = "*** END OF THIS PROJECT GUTENBERG EBOOK ON THE ORIGIN OF SPECIES ***"
s = content.index(start_phrase)
e = content.index(end_phrase)
true_content = content[s+len(start_phrase):e]

# Approximate bytes of text
print(len(true_content))

### Frequency distributions, Zipf's law

Zipf's law says that the frequencies of words in text follow a power-law: A few words account for a big fraction of the text (the very frequent ones, usually just the "plumping" of English), and a large fraction of the unique vocabularly (the "hapaxes") appear very infrequently.

Now, we have our first text ready to be analyzed. Let's first do some analysis of the words that appear in this classic text:

In [None]:
tokens = true_content.split()

# The nltk.Text object will offer us many useful functions for text analysis
text = nltk.Text(tokens)

# Frequency analysis for words of interest
fdist = text.vocab()

# Number of unique and total words in the text
print(fdist)

Let's take a look at the frequencies of some words in the text:

In [None]:
print(fdist["species"])
print(fdist["nature"])
print(fdist["sexual"])
print(fdist["origin"])

In [None]:
text.dispersion_plot(["species", "nature", "sexual", "origin"])

Let's take a look at the actual words of the text:

In [None]:
fdist

OK, let's see a few more words

In [None]:
print(fdist.most_common(50))

Hm, that is not very useful. These are all words that are needed by every single English text. Only the world `"species"` seems to have some meaning. The rest of the words tell us nothing about the text; they're just English "plumbing."

What proportion of the text is taken up with such words? We can generate a cumulative frequency plot for these words:


In [None]:
fdist.plot(50, cumulative=True)

These 50 words account for nearly half the book! (If you rememeber, we had 155443 words in the book)

If the frequent words don't help us, how about the words that occur once only, the so-called hapaxes? View them by typing `fdist.hapaxes()`: 

In [None]:
fdist.hapaxes()

In [None]:
print(len(fdist.hapaxes()))

OK, we have a problem. We generated the words of the text by doing a simple `split()`. So our "words" also contain punctuation, and words with different capitalization are considered difference.

### Normalization and Tokenization

So, in order to to proper analysis we need to remove from the document all the punctuation. However, keeping only alphanumeric characters will break things like `B.Sc.` `N.Y.U.` and so on. The process of properly splitting the document into appropriate basic elements is called `tokenization`.

NLTK gives us a (set of ) function call(s) that can do the tokenization (see also http://www.nltk.org/_modules/nltk/tokenize.html):

In [None]:
example = '''Good bagels cost $2.88 in Philadelphia.  
    Hey Prof. Bauman, please buy me two of them.
    
    Thanks.
    
    PS: You have a Ph.D., you can handle this, right?'''

print(nltk.word_tokenize(example))

In [None]:
s1 = "On a $50,000 mortgage of 30 years at 8 percent, the monthly payment would be $366.88."
print(nltk.word_tokenize(s1))

In [None]:
s2 = "\"We beat some pretty good teams to get here,\" Slocum said."
print(nltk.word_tokenize(s2))

In [None]:
s3 = "Well, we couldn't have this predictable, cliche-ridden, \"Touched by an Angel\" (a show creator John Masius worked on) wanna-be if she didn't."
print(nltk.word_tokenize(s3))

In [None]:
s4 = "I cannot cannot work under these conditions!"
print(nltk.word_tokenize(s4))

In [None]:
s5 = "The company spent $30,000,000 last year."
print(nltk.word_tokenize(s5))

In [None]:
s6 = "The company spent 40.75% of its income last year."
print(nltk.word_tokenize(s6))

In [None]:
s7 = "He arrived at 3:00 pm."
print(nltk.word_tokenize(s7))

In [None]:
s8 = "I bought these items: books, pencils, and pens."
print(nltk.word_tokenize(s8))

In [None]:
s9 = "Though there were 150, 100 of them were old."
print(nltk.word_tokenize(s9))

In [None]:
s10 = "There were 300,000, but that wasn't enough."
print(nltk.word_tokenize(s10))

So, let's repeat the process now for our original text:

In [None]:
# We tokenize and we also convert to lowercase for further normalization
tokens = nltk.word_tokenize(true_content.lower())
text = nltk.Text(tokens)

# Frequency analysis for words of interest
fdist = text.vocab()

# Number of unique and total words in the text
print(fdist)

We went from `13908 samples and 155443 outcomes` to `7687 samples and 175682 outcomes`. In other words, we have now 7687 unique tokens, and a set of 175682 tokens, as punctuation characters are now separate tokens.

In [None]:
print(fdist.most_common(50))

In [None]:
print(fdist.hapaxes())

In [None]:
print(len(fdist.hapaxes()))

So out of the 7687 unique words, 2666 of them appear only once in the text. But these are only 2666 out of the total of 175682 words in the text. This is ~1.5% of the text.

### Sentence splitting

The tokenization process can also work on separating sentences

In [None]:
example = '''Good bagels cost $2.88 in N.Y.C. Hey Prof. Bauman, please buy me two of them.
    
    Thanks.
    
    PS: You have a Ph.D. you can handle this, right?'''

print(nltk.sent_tokenize(example))

### Zipf's Law

Zipf's law says that the frequencies of words in text follow a power-law: A few words account for a big fraction of the text (the very frequent ones, usually just the "plumping" of English), and a large fraction of the unique vocabularly (the "hapaxes") appear very infrequently.

In [None]:
fdist.plot(50, cumulative=False)

In [None]:
fdist.plot(50, cumulative=True)

### Normalization: Stopwords

NLTK contains a corpus of stopwords, that is, high-frequency words like `the`, `to` and `also` that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.

In [None]:
from nltk.corpus import stopwords

print(stopwords.words('english'))

Let's define a function to remove the words in a text are in the stopwords list:

In [None]:
mystops = []
mystops.append('one')
mystops.append('may')
mystops.append('would')
mystops.append('many')

def remove_stopwords(text, hapaxes):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w.lower() for w in text 
               if w.lower() not in stopwords # w should NOT be in NLTK stopwords
                   and w.lower() not in mystops # w should NOT be in custom stop word list
                   and w.isalpha() # w should consists of letters, not numbers, not punctuation
                   and w.lower() not in hapaxes] # w should have frequency > 1 
    return nltk.Text(content)

text_nostopwords = remove_stopwords(text, fdist.hapaxes())
fdist_nostopwords = text_nostopwords.vocab()
print(fdist_nostopwords)

In [None]:
print(fdist_nostopwords.most_common(50))

In [None]:
fdist_nostopwords.plot(50, cumulative=True)

### Exercise:
    
Build similar plot for the 'Myths and Legends of Ancient Greece and Rome' book.  
url = `http://www.gutenberg.org/cache/epub/22381/pg22381.txt`  
Can you guess which words will be on top of the list?

In [None]:
url = "http://www.gutenberg.org/cache/epub/22381/pg22381.txt"
# Your code is here


### Normalization: Stemming 

So far, the only normalization that we did is to convert text to lowercase before doing anything with its words, e.g. `set(w.lower() for w in text)`. By using lower(), we have normalized the text to lowercase so that the distinction between `The` and `the` is ignored. Often we want to go further than this, and strip off any affixes, a task known as **stemming**. 

NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these. The Porter stemmer is a very well-known stemmer, and should suffice for most of our applications. 


In [None]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tok = nltk.word_tokenize(raw)

porter = nltk.PorterStemmer()
stemmed =  [porter.stem(t) for t in tok]

# This idiom concatenates all the words in a list. 
# The call is s.join(list), where we join the element 
# of the list using the string s as the concatenacting character
print(" ".join(stemmed))

### Normalization: Lemmatization

A further step in the normalization process is to make sure that the resulting form is a known word in a dictionary, a task known as **lemmatization**. The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. This additional checking process makes the lemmatizer slower than the above stemmers. Notice that it doesn't handle lying, but it converts women to woman.

In [None]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tok = nltk.word_tokenize(raw)

wnl = nltk.WordNetLemmatizer()
lemmatized =  [wnl.lemmatize(t) for t in tok]
print(" ".join(lemmatized)) 

But what is this `WordNet`?  
It is one of the most useful resources for anyone interested in analyzing text at a more semantic level than simply frequency counts.

### WordNet

WordNet ([wordnet.princeton.edu](https://wordnet.princeton.edu)) is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. We'll begin by looking at synonyms and how they are accessed in WordNet.

####  Senses and Synonyms

Consider the sentence below. If we replace the word motorcar in by automobile, the meaning of the sentence stays pretty much the same:

* Benz is credited with the invention of the motorcar.
* Benz is credited with the invention of the automobile.

Since everything else in the sentence has remained unchanged, we can conclude that the words motorcar and automobile have the same meaning, i.e. they are **synonyms**. We can explore these words with the help of WordNet:

In [None]:
from nltk.corpus import wordnet as wn
wn.synsets('history')

Thus, motorcar has just one possible meaning and it is identified as car.n.01, the first noun sense of car. The entity car.n.01 is called a **synset**, or **"synonym set"**, a collection of synonymous words (or "lemmas"):

In [None]:
 wn.synset('car.n.01').lemma_names()

Each word of a synset can have several meanings, e.g., car can also signify a train carriage, a gondola, or an elevator car. However, we are only interested in the single meaning that is common to all words of the above synset. Synsets also come with a prose definition and some example sentences:

In [None]:
wn.synset('car.n.01').definition()

In [None]:
wn.synset('car.n.01').examples()

Although definitions help humans to understand the intended meaning of a synset, the words of the synset are often more useful for our programs. To eliminate ambiguity, we will identify these words as car.n.01.automobile, car.n.01.motorcar, and so on. This pairing of a synset with a word is called a **lemma**. We can get all the lemmas for a given synset, look up a particular lemma, get the synset corresponding to a lemma, and get the "name" of a lemma:

In [None]:
wn.synset('car.n.01').lemmas() 

In [None]:
wn.lemma('car.n.01.automobile')

In [None]:
wn.lemma('car.n.01.automobile').synset()

In [None]:
wn.lemma('car.n.01.automobile').name()

Now let's analyze the word `car`, which has multiple **senses** (ie., meanings of the word)

In [None]:
wn.synsets('car')

In [None]:
senses = [(s.lemma_names(), s.definition(), s.examples()) for s in wn.synsets('history')]
for s in senses:
    print("Lemma name:", s[0])
    print("Definition:", s[1])
    print("Examples  :", s[2])
    print("=======================")

### Exercise:
Analyze the word `"bank"`

In [None]:
## Your code is here


### Summary

* A frequency distribution is a collection of items along with their frequency counts (e.g., the words of a text and their frequency of appearance).
* Tokenization is the segmentation of a text into basic units — or tokens — such as words and punctuation. Tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words. NLTK provides an off-the-shelf tokenizer nltk.word_tokenize().
* Stemming is the process of removing the affix of a word, to create a "normalized" representation of the token
* Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical or citation form of the word, also known as the lexeme or lemma (e.g. appear). 
* WordNet is a semantically-oriented dictionary of English, consisting of synonym sets — or synsets — and organized into a network.

