## Introduction

Have you ever wondered what are some writing habits of Herman Melville, the author of the Americal national novel, *Moby Dick*? What are his favorite verbs and adjectives in this great novel? In this toturial, we will show some basic natural language processing techniques with python. 

[<img src="https://images-na.ssl-images-amazon.com/images/I/71Q4R237BZL.jpg" style="width: 400px;">](https://images-na.ssl-images-amazon.com/images/I/71Q4R237BZL.jpg)

### Tutorial content

This tutorial will mainly show how to do basic natural language processing with python with [`nltk`](http://www.nltk.org/) package.

`nltk` is short for Natural Language Toolkit. It is a package that provides easy-to-use interfaces for python programs to process natural language data. `nltk` is a massive library that can't all be covered within the realm of this tutorial. This tutorial would discuss nltk about its ability to tokenize, stem, tag and semantic reasoning.

We will cover the following topics in this tutorial:
- [Tokenizing](#Tokenizing)
- [Stemming](#Stemming)
- [Natural Language](#Natural Language)
- [Putting things together: basic NLP on *Moby Dick*](#Putting things together)


## Installing the libraries

Install `nltk` using `pip`:

    $ pip install --upgrade nltk
    


In [1]:
import nltk
from collections import defaultdict
import operator

Download necessary `nltk` packages manually. For this tutorial, please install the following packages.

  ```python
  >>>nltk.download('treebank')
  >>>nltk.download('wordnet')
  >>>nltk.download('punkt')
  >>>nltk.download('gutenberg')
  ```

## Tokenizing

### Regex Tokenizer

Regular expression is powerful and handy to use. `nltk` tokenizer supports regular expression to control how to tokenize text. However, regular expression can be costly and over complicated, so it's better to use it do simple tasks.

In [2]:
from nltk.tokenize import RegexpTokenizer
regex_tokenizer = RegexpTokenizer('[\w]+')
print regex_tokenizer.tokenize('This is a sentence.')

['This', 'is', 'a', 'sentence']


### Treebank Tokenizer
`TreebankWordTokenizer` is the underlying instance for `word_tokenizer()`, which creates a list of words from a string using spaces and punctuation.

In [3]:
from nltk.tokenize import TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()
print treebank_tokenizer.tokenize('This is a sentence.')

['This', 'is', 'a', 'sentence', '.']


Notice that `TreebankWordTokenizer` keeps the punctuation. This gives to more freedom to decide on how to deal with text.

### Tokenizer for Other Languages
nltk is built with human language in mind, and there are many languages in the world. It can tokenize languages other than English. The following code tokenizes spanish.

In [4]:
import nltk.data
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')
print spanish_tokenizer.tokenize('Buenos dias! Como estas?')

['Buenos dias!', 'Como estas?']


## Stemming

Stemming removes suffixes from a word and only keeps the 'stem' of the word. Some frequent suffix to be removed are "es", "ing". "y" is replaced with "i". Here are 2 examples of stemming.

In [5]:
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
print porter_stemmer.stem('preffixes')
print porter_stemmer.stem('looking')
print porter_stemmer.stem('lucky')

preffix
look
lucki


The stemming algorithm used above is Porter Stemming. You can find more information about the algorithm [here](http://snowball.tartarus.org/algorithms/porter/stemmer.html). There are many stemming algorithms, but they have similar processes and outputs.

### Stemming vs. Lemmatizing

Lemmatizing find the root word of a word. It converts a variation of a word back to its original form. The biggest difference between lemmatization and stemming is that lemmatization always produces a valid word, while stemming simply get rid of unwanted parts of a word.

In [6]:
from nltk.stem import WordNetLemmatizer
word_net_lemmatizer = WordNetLemmatizer()
print word_net_lemmatizer.lemmatize('themselves')
print porter_stemmer.stem('themselves')

themselves
themselv


Some applications choose a stemmer rather than a lemmatizer because stemmed words are more uniform and are usually more suitable for further processes. Also, a precise lemmatizer requires comprehensive knowledge of all variations of all words from a text. It's a much costly option than a stemmer.

### Tagging
`nltk` provides powerful tagging tools to tag words with part-of-speech. We uses `POS-tagger` to tag sentences.

In [7]:
nltk.pos_tag(regex_tokenizer.tokenize('This is a sentence.'))

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sentence', 'NN')]

 We will be using **singular noun: 'NN', verb: 'VB', adjective: 'JJ', and adverb: 'RB'**. 
 
 Here's the full [list](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) of part-of-speech tags.

## Natural Language

The previous sections are mainly about how to preprocess texts. Natural language processing is not just about strings, but about natural languages. `nltk` provides libraries that help working with semantic part of natural language processing.

`WordNet` is a lexical database with which you can play with means of English words, synonyms, hypernyms, antonyms, meronyms and more.

First, create a `wordnet.synsets` object. The example simply uses 'computer'. `synsets` is a set of synonyms of the word 'computer'

In [8]:
from nltk.corpus import wordnet as wn
synsets = wn.synsets('computer')

### Synonyms

Each object in 'synsets' has a name. 

In [9]:
print[syns.name() for syns in synsets]

[u'computer.n.01', u'calculator.n.01']


For each name, there are 3 parts. For example, 'computer.n.01', 'telephone' is the synonyms itself; 'n' means for this word, it uses the noun meaning; '01' means it uses the 1st definition of the noun form of this word.

### Definition of words

For each object in `synsets`, we can get the definition of the word.

In [10]:
print [syns.definition() for syns in synsets]

[u'a machine for performing calculations automatically', u'an expert at calculation (or at operating calculating machines)']


### Lemmas of the words

To get the word itself instead of the name of an object in synsets, we need to use the lemma of the word.

In [11]:
lemma_names = [syns.lemmas()[0].name() for syns in synsets]
print lemma_names

[u'computer', u'calculator']


### Antonyms

With words themselves, we can get antonyms of them by getting antonyms of each lemmas object. We use another example word 'good' because there's no antonym for computer

In [12]:
lemmas = [syns.lemmas()[0] for syns in wn.synsets('good')]
print set(None if len(l.antonyms()) == 0 else l.antonyms()[0].name() for l in lemmas)

set([u'bad', u'ill', u'evil', None])


### Similarity
The `wordnet` database comes with functions to get semantic related-ness of words. `wordnet.synset.wu_similarity` uses the [Wu and Palmer method](http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity/wup.pm).

Here are some fun words to play with.

A programmer has lower similarity with human than an ape:

In [13]:
w1 = wn.synset('human.n.01')
w2 = wn.synset('ape.n.01')
print 'ape: ' + w2.definition()
print `w1.wup_similarity(w2)` + '\n'
w3 = wn.synset('programmer.n.01')
print 'programmer: ' + w3.definition()
print (w1.wup_similarity(w3))

ape: any of various primates with short tails or no tail at all
0.8888888888888888

programmer: a person who designs and writes and tests computer programs
0.521739130435


It seems that programmers are more related to liquor than coffee:

In [14]:
w1 = wn.synset('programmer.n.01')
w2 = wn.synset('coffee.n.01')
print 'coffee: ' + w2.definition()
print `w1.wup_similarity(w2)` + '\n'
w3 = wn.synset('liquor.n.01')
print 'liquor: ' + w3.definition()
print w1.wup_similarity(w3)

coffee: a beverage consisting of an infusion of ground coffee beans
0.3076923076923077

liquor: an alcoholic beverage that is distilled rather than fermented
0.428571428571


Programmers are not that anti-social. After all, they love partying as much as they love coding XD

In [15]:
w1 = wn.synset('programmer.n.01')
w2 = wn.synset('code.n.03')
print 'code: ' + w2.definition()
print `w1.wup_similarity(w2)` + '\n'
w3 = wn.synset('party.n.02')
print 'party: ' + w2.definition()
print w1.wup_similarity(w3)

code: (computer science) the symbolic arrangement of data or instructions in a computer program or the set of such instructions
0.15384615384615385

party: (computer science) the symbolic arrangement of data or instructions in a computer program or the set of such instructions
0.153846153846


## Putting things together
`nltk.corpus` not only has functions that manipulates natural languages, but also contains actual text from books, web chat rooms, and many other sources. In this section, we are going to apply some of the techniques presented above to *Moby Dick* from `nltk.corpus`

### Gutenberg Corpus - *Moby Dick*
Project Gutenberg has more than 20,000 free electronic books. `nltk.corpus` has a small number of the collection. We are going to use a classic American novel, *Moby Dick*, as our example.

In [16]:
from nltk.corpus import gutenberg
gutenberg.fileids()

[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

Above are the books in `nltk.corpus.gutenberg`. We will be using 'melville-moby_dick.txt'. Get the book and some basic metrics of the book.

In [17]:
moby_dick_sents = gutenberg.sents('melville-moby_dick.txt')

Since the text is from Project Gutenberg, there are some sentences that's about the project but not *Moby Dick*. Let's start with getting then content of the book.

*Moby Dick* starts with "Call me Ishmael." and ends with "It was the devious-cruising Rachel, that in her retracing search after her missing children, only found another orphan."

In [18]:
first_sent = treebank_tokenizer.tokenize('Call me Ishmael.');
last_sent = regex_tokenizer.tokenize('It was the devious-cruising Rachel, that in her retracing search after her missing children, only found another orphan.')
for i in range(len(moby_dick_sents)):
    if first_sent == moby_dick_sents[i]:
        first_idx = i
    if set(last_sent).issubset(moby_dick_sents[i]):
        last_idx = i

moby_dick_sents = moby_dick_sents[first_idx:last_idx+2]

We need to do more cleaning on words in *Moby Dick*. Remove stopwords and puncuations so that we only focus on meaningful words.

In [19]:
import string
punctuations = set(string.punctuation)
stopwords=nltk.corpus.stopwords.words('english')
stopwords.extend([''])
moby_dick_words = []
for sent in moby_dick_sents:
    for word in sent:
        word = word.lower()
        if word not in punctuations\
            and word not in stopwords\
            and len(word) != 1:
            moby_dick_words.append(word)
print 'Moby Dick has ' + `len(moby_dick_words)` + ' words, and '\
                       + `len(moby_dick_sents)` + ' sentences.'

Moby Dick has 111868 words, and 9767 sentences.


### Stemming vs. Lemmatizing on *Moby Dick*

Let's get the number of unique words in *Moby Dick*.

In [20]:
len(set(moby_dick_words))

16768

However, as we know from previous sections, many words are variations of same words. For example, apples is the plural form of apple and they are essentially the same word. So the result above doesn't reflect the real number of unique words in *Moby Dick*. 

Let's compare the two ways to normalize words: `lemmatize` and `stem`.

In [21]:
lemmatized_words = [word_net_lemmatizer.lemmatize(w) for w in moby_dick_words]
stemmed_words = [porter_stemmer.stem(w) for w in moby_dick_words]

print 'There are ' + `len(set(lemmatized_words))` + ' unique words after lemmatizing.'
print 'There are ' + `len(set(stemmed_words))` + ' unique words after stemming.'

There are 14751 unique words after lemmatizing.
There are 10568 unique words after stemming.


There's a big difference between the number of unique words processed with lemmatizing and stemming. There are 50% more lemmatized words than stemmed words. The reason might be that the lemmatizing requires the knowledge of words: it needs to know all variations of the same word in order to lemmatize them. If a variation of a word is not in the library of lemmatizer, it wouldn't be able to lemmatize it. On the contrary, stemming only checks suffix of words. This difference caused large difference between the 2 results.

### Tag *Moby Dick* Words
Now let's tag all words with `nltk`'s POS-tagger. Although there are fewer stemmed words, they might not be valid English words. We should choose lemmatized words instead of stemmed ones for tagging.
Again, we are only interested in nouns, verbs, adjectives and adverbs. (The following cell takes about 20 seconds to run as it's tagging all words.)

In [22]:
tagged_words = nltk.pos_tag(lemmatized_words)
pos_cat = defaultdict(list)
for tag in tagged_words:
    pos_cat[tag[1]].append(tag[0])
nouns = pos_cat['NN']
verbs = pos_cat['VB']
adjs = pos_cat['JJ']
advs = pos_cat['RB']

In [23]:
print 'In Moby Dick, \
there are {0} nouns, {1} verbs, {2} adjectives and {3} adverbs '.format(len(nouns), 
                                                                       len(verbs), 
                                                                       len(adjs), 
                                                                       len(advs))

In Moby Dick, there are 43594 nouns, 2545 verbs, 24337 adjectives and 8867 adverbs 


### Top words used by Herman Melville
With tagging information and nicely stemmed words, we can get top words with counting and sorting.

First, top nouns used by Herman Melville in *Moby Dick*.

In [24]:
noun_counts = defaultdict(int)
for noun in nouns:
    noun_counts[noun] += 1
sorted_noun_counts = sorted(noun_counts.items(), key=operator.itemgetter(1))
sorted_noun_counts.reverse()

From top ten nouns, we can easily tell that this novel is about whale, man and sea. Adventures is rolling out.

In [25]:
for i in range(10):
    print 'noun: {0}\tcount:{1}'.format(sorted_noun_counts[i][0], sorted_noun_counts[i][1])

noun: whale	count:689
noun: man	count:525
noun: sea	count:504
noun: boat	count:457
noun: time	count:444
noun: ship	count:441
noun: head	count:391
noun: hand	count:342
noun: thing	count:316
noun: way	count:289


Repeat the same process for verbs, adjectives and adverbs.

In [26]:
verb_counts = defaultdict(int)
for verb in verbs:
    verb_counts[verb] += 1
sorted_verb_counts = sorted(verb_counts.items(), key=operator.itemgetter(1))
sorted_verb_counts.reverse()
for i in range(10):
    print 'verb: {0}\tcount:{1}'.format(sorted_verb_counts[i][0], sorted_verb_counts[i][1])

verb: go	count:106
verb: take	count:84
verb: see	count:63
verb: keep	count:63
verb: get	count:61
verb: come	count:42
verb: say	count:35
verb: give	count:34
verb: let	count:34
verb: make	count:31


In [27]:
adj_counts = defaultdict(int)
for adj in adjs:
    adj_counts[adj] += 1
sorted_adj_counts = sorted(adj_counts.items(), key=operator.itemgetter(1))
sorted_adj_counts.reverse()
for i in range(10):
    print 'adj: {0}\tcount:{1}'.format(sorted_adj_counts[i][0], sorted_adj_counts[i][1])

adj: whale	count:681
adj: old	count:446
adj: great	count:292
adj: white	count:281
adj: last	count:278
adj: little	count:235
adj: good	count:216
adj: u	count:186
adj: ahab	count:185
adj: sperm	count:173


In [28]:
adv_counts = defaultdict(int)
for adv in advs:
    adv_counts[adv] += 1
sorted_adv_counts = sorted(adv_counts.items(), key=operator.itemgetter(1))
sorted_adv_counts.reverse()
for i in range(10):
    print 'adv: {0}\tcount:{1}'.format(sorted_adv_counts[i][0], sorted_adv_counts[i][1])

adv: yet	count:344
adv: still	count:312
adv: well	count:224
adv: never	count:203
adv: ever	count:202
adv: almost	count:194
adv: long	count:192
adv: even	count:188
adv: far	count:163
adv: away	count:159


Words that Herman Melville often used are not much different from what we use daily. The greatness of *Moby Dick* not lies in the words it uses, but in the story and spirit that drives Americans to persue their dreams.

## Summary and References
This tutorial only covers a few functions that one can use with `nltk`. Deeper natural language analysis is possible with more knowledge on both language theories and tools.

### Useful Resources

- [Porter Stemming Algorithm](http://snowball.tartarus.org/algorithms/porter/stemmer.html)

- [Full List of Part-of-Speech Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

- [Wu and Palmer Similarity](http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity/wup.pm)

### References

Bird, Steven, Edward Loper and Ewan Klein (2009), *Natural Language Processing with Python.* O’Reilly Media Inc.