In [24]:
%matplotlib inline

# Introduction to NLP with NLTK

Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK). 

NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications. 

## Quick Overview of NLTK
NLTK stands for the Natural Language Toolkit and is written by two eminent computational linguists, Steven Bird (Senior Research Associate of the LDC and professor at the University of Melbourne) and Ewan Klein (Professor of Linguistics at Edinburgh University). NTLK provides a combination of natural language corpora, lexical resources, and example grammars with language processing algorithms, methodologies and demonstrations for a very pythonic "batteries included" view of Natural Language Processing.   

As such, NLTK is perfect for researh driven (hypothesis driven) workflows for agile data science. Its suite of libraries includes:

- tokenization, stemming, and tagging
- chunking and parsing
- language modeling
- classification and clustering
- logical semantics

NLTK is a useful pedagogical resource for learning NLP with Python and serves as a starting place for producing production grade code that requires natural language analysis. It is also important to understand what NLTK is _not_:

- Production ready out of the box
- Lightweight
- Generally applicable
- Magic

NLTK provides a variety of tools that can be used to explore the linguistic domain but is not a lightweight dependency that can be easily included in other workflows, especially those that require unit and integration testing or other build processes. This stems from the fact that NLTK includes a lot of added code but also a rich and complete library of corpora that power the built-in algorithms. 

### The Good parts of NLTK

- Preprocessing
    - segmentation
    - tokenization
    - PoS tagging
- Word level processing
    - WordNet
    - Lemmatization
    - Stemming
    - NGrams
- Utilities
    - Tree
    - FreqDist
    - ConditionalFreqDist
    - Streaming CorpusReaders
- Classification
    - Maximum Entropy
    - Naive Bayes
    - Decision Tree
- Chunking
- Named Entity Recognition
- Parsers Galore!

### The Bad parts of NLTK

- Syntactic Parsing

    - No included grammar (not a black box)
    - No Feature/Dependency Parsing
    - No included feature grammar

- The sem package
    
    - Toy only (lambda-calculus & first order logic)

- Lots of extra stuff (heavyweight dependency)

    - papers, chat programs, alignments, etc.

Knowing the good and the bad parts will help you explore NLTK further - looking into the source code to extract the material you need, then moving that code to production. We will explore NLTK in more detail in the rest of this notebook. 

## Installing NLTK

This notebook has a few dependencies, most of which can be installed via the python package manger - `pip`. 

1. Python 2.7 or later (anaconda is ok)
2. NLTK
3. The NLTK corpora 
4. The BeautifulSoup library
5. The gensim libary

Once you have Python and pip installed you can install NLTK as follows:

    ~$ pip install nltk
    
    ~$ pip install matplotlib
    
    ~$ pip install beautifulsoup4
    
    ~$ pip install gensim

Note that these will also install Numpy and Scipy if they aren't already installed. 

To download the corpora, open a python interperter:


In [25]:
!pip install nltk



In [None]:
!pip install beautifulsoup4 

In [2]:
import nltk

In [None]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
Error connecting to server: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


In [4]:
!pip install nltk

/bin/sh: 1: pip3: not found


In [3]:
nltk.download()
nltk.download('gutenberg')

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/synerzip/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

This will open up a window with which you can download the various corpora and models to a specified location. For now, go ahead and download it all as we will be exploring as much of NLTK as we can. Also take note of the `download_directory` - you're going to want to know where that is so you can get a detailed look at the corpora that's included. I usually export an enviornment variable to track this:

    ~$ export NLTK_DATA=/path/to/nltk_data
    
_Take a moment to explore what is in this directory_

## Working with Example Corpora

NLTK ships with a variety of corpora, let's use a few of them to do some work. Get access to the text from _Moby Dick_ as follows:

In [4]:
moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))

The `nltk.text.Text` class is a wrapper around a sequence of simple (string) tokens - intended only for _the initial exploration of text_ usually via the Python REPL. It has the following methods:

- common_contexts
- concordance
- collocations
- count
- plot
- findall
- index

You shouldn't use this class in production level systems, but it is useful to explore (small) snippets of text in a meaningful fashion.

The corcordance function performs a search for the given token and then also provides the surrounding context:


In [5]:
moby.concordance("monstrous", 55, lines=10)

Displaying 10 of 11 matches:
er , one was of a most monstrous size . ... This came t
ALMS . " Touching that monstrous bulk of the whale or o
 a heathenish array of monstrous clubs and spears . Som
ed , and wondered what monstrous cannibal and savage co
vived the flood ; most monstrous and most mountainous !
cout at Moby Dick as a monstrous fable , or still worse
 .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I
 In connexion with the monstrous pictures of whales , I
 upon those still more monstrous stories of them which 
n rummaged out of this monstrous cabinet there is no te


Given some context surrounding a word, we can discover similar words, e.g. words that that occur frequently in the same context and with a similar distribution: Distributional similarity: 

In [6]:
print (moby.similar("ahab"))
austen = nltk.text.Text(nltk.corpus.gutenberg.words('/home/synerzip/nltk_data/corpora/gutenberg/austen-sense.txt'))

print (austen.similar("monstrous"))

it he that queequeg stubb him there starbuck i what man and all then
this me peleg which here you
None
very so exceedingly heartily a as good great extremely remarkably
sweet vast amazingly
None


As you can see, this takes a bit of time to build the index in memory, one of the reasons it's not suggested to use this class in production code. Now that we can do searching and similarity, find the common contexts of a set of words:

In [7]:
moby.common_contexts(["ahab", "starbuck"])

of_s to_the cried_give with_the murmured_gazing at_s to_s and_s
cried_let cried_look by_s of_and in_the but_s cried_to


_your turn, go ahead and explore similar words and contexts - what does the common context mean?_

NLTK also uses matplotlib and pylab to display graphs and charts that can show dispersions and frequency. This is especially interesting for the corpus of innagural addresses given by U.S. presidents.

In [8]:
nltk.download('inaugural')
inaugural = nltk.text.Text(nltk.corpus.inaugural.words())
inaugural.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

[nltk_data] Downloading package inaugural to
[nltk_data]     /home/synerzip/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


<matplotlib.figure.Figure at 0x7f31b077ea90>

To explore much of the built in corpus, use the following methods:

In [None]:
# Lists the various corpora and CorpusReader classes in the nltk.corpus module
for name in dir(nltk.corpus):
    if name.islower() and not name.startswith('_'): print name

In [None]:
# For a specific corpus, list the fileids that are available:
print nltk.corpus.shakespeare.fileids()

In [None]:
print nltk.corpus.gutenberg.fileids()

In [None]:
print nltk.corpus.stopwords.fileids()
nltk.corpus.stopwords.words('english')
import string
print string.punctuation

These corpora export several vital methods:

- paras (iterate through each paragraph)
- sents (iterate through each sentence)
- words (iterate through each word)
- raw   (get access to the raw text)

In [None]:
corpus = nltk.corpus.brown
print corpus.paras()

In [None]:
print corpus.sents()

In [None]:
print corpus.words()

In [None]:
print corpus.raw()[:200] # Be careful!

_Your turn! Explore some of the text in the available corpora_

## Frequency Analyses

In statistical machine learning approaches to NLP, the very first thing we need to do is count things - especially the unigrams that appear in the text and their relationships to each other. NLTK provides two very excellent classes to enable these frequency analyses:

- `FreqDist`
- `ConditionalFreqDist` 

And these two classes serve as the foundation for most of the probability and statistical analyses that we will conduct.

First we will compute the following:

- The count of words
- The vocabulary (unique words)
- The lexical diversity (the ratio of word count to vocabulary)

In [None]:
nltk.download('reuters')
reuters = nltk.corpus.reuters # Corpus of news articles
counts  = nltk.FreqDist(reuters.words())
vocab   = len(counts.keys())
words   = sum(counts.values())
lexdiv  = float(words) / float(vocab)

print "Corpus has %i types and %i tokens for a lexical diversity of %0.3f" % (vocab, words, lexdiv)

In [None]:
counts.B()

In [None]:
print counts.most_common(40)  # The n most common tokens in the corpus

In [None]:
print counts.max() # The most frequent token in the corpus

In [None]:
print counts.hapaxes()[0:10]  # A list of all hapax legomena 

In [None]:
counts.freq('stipulate') * 100 # percentage of the corpus for this token

In [None]:
counts.plot(200, cumulative=False)

In [None]:
from itertools import chain 

brown = nltk.corpus.brown
categories = brown.categories()

counts = nltk.ConditionalFreqDist(chain(*[[(cat, word) for word in brown.words(categories=cat)] for cat in categories]))

for category, dist in counts.items():
    vocab  = len(dist.keys())
    tokens = sum(dist.values())
    lexdiv = float(tokens) / float(vocab)
    print "%s: %i types with %i tokens and lexical diveristy of %0.3f" % (category, vocab, tokens, lexdiv)

_Your turn: compute the conditional frequency distribution of bigrams in a corpus_

Hint:


In [None]:
for ngram in nltk.ngrams(["The", "bear", "walked", "in", "the", "woods", "at", "midnight"], 5):
    print ngram

    

## Preprocessing Text

NLTK is great at the preprocessing of Raw text - it provides the following tools for dividing text into it's constituent parts:

- `sent_tokenize`: a Punkt sentence tokenizer:

    This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.  It must be trained on a large collection of plaintext in the target language before it can be used.
    
    However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.
    
    
- `word_tokenize`: a Treebank tokenizer 

    The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by ``word_tokenize()``.  It assumes that the text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
    

- `pos_tag`: a maximum entropy tagger trained on the Penn Treebank

    There are several other taggers including (notably) the BrillTagger as well as the BrillTrainer to train your own tagger or tagset. 

In [None]:
text = u"Medical personnel returning to New York and New Jersey from the Ebola-riddled countries in West Africa will be automatically quarantined if they had direct contact with an infected person, officials announced Friday. New York Gov. Andrew Cuomo (D) and New Jersey Gov. Chris Christie (R) announced the decision at a joint news conference Friday at 7 World Trade Center. “We have to do more,” Cuomo said. “It’s too serious of a situation to leave it to the honor system of compliance.” They said that public-health officials at John F. Kennedy and Newark Liberty international airports, where enhanced screening for Ebola is taking place, would make the determination on who would be quarantined. Anyone who had direct contact with an Ebola patient in Liberia, Sierra Leone or Guinea will be quarantined. In addition, anyone who traveled there but had no such contact would be actively monitored and possibly quarantined, authorities said. This news came a day after a doctor who had treated Ebola patients in Guinea was diagnosed in Manhattan, becoming the fourth person diagnosed with the virus in the United States and the first outside of Dallas. And the decision came not long after a health-care worker who had treated Ebola patients arrived at Newark, one of five airports where people traveling from West Africa to the United States are encountering the stricter screening rules."

for sent in nltk.sent_tokenize(text): 
    print sent
    print


In [None]:
for sent in nltk.sent_tokenize(text):
    print list(nltk.wordpunct_tokenize(sent))
    print

In [None]:
for sent in nltk.sent_tokenize(text):
    print list(nltk.pos_tag(nltk.word_tokenize(sent)))
    print

All of these taggers work _pretty_ well - but you can (and should train them on your own corpora). 

## Stemming and Lemmatization

We have an immense number of word forms as you can see from our various counts in the `FreqDist` above - it is helpful for many applications to normalize these word forms (especially applications like search) into some canonical word for further exploration. In English (and many other languages) - mophological context indicate gender, tense, quantity, etc. but these sublties might not be necessary:

Stemming = chop off affixes to get the root stem of the word:

    running --> run
    flowers --> flower
    geese   --> geese 
    
Lemmatization = look up word form in a lexicon to get canonical lemma

    women   --> woman
    foxes   --> fox
    sheep   --> sheep
    
There are several stemmers available:

    - Lancaster (English, newer and aggressive)
    - Porter (English, original stemmer)
    - Snowball (Many langauges, newest)
    
The Lemmatizer uses the WordNet lexicon
    

In [None]:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer

text = list(nltk.word_tokenize("The women running in the fog passed bunnies working as computer scientists."))

snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
porter = PorterStemmer()

for stemmer in (snowball, lancaster, porter):
    stemmed_text = [stemmer.stem(t) for t in text]
    print " ".join(stemmed_text)

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in text]
print " ".join(lemmas)

Note that the lemmatizer has to load the WordNet corpus which takes a bit.

Typical normalization of text for use as features in machine learning models looks something like this:

In [None]:
import string

## Module constants
lemmatizer  = WordNetLemmatizer()
stopwords   = set(nltk.corpus.stopwords.words('english'))
punctuation = string.punctuation

def normalize(text):
    for token in nltk.word_tokenize(text):
        token = token.lower()
        token = lemmatizer.lemmatize(token)
        if token not in stopwords and token not in punctuation:
            yield token

print list(normalize("The eagle flies at midnight."))

## Named Entity Recognition

NLTK has an excellent MaxEnt backed Named Entity Recognizer that is trained on the Penn Treebank. You can also retrain the chunker if you'd like - the code is very readable to extend it with a Gazette or otherwise. 

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
print nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("John Smith is from the United States of America and works at Microsoft Research Labs")))

You can also wrap the Stanford NER system, which many of you are also probably used to using.