# NLTK - Getting Started

### Gabriele Pergola - gabriele.pergola@warwick.ac.uk

NLTK or Natural Language Tool Kit is a Python module that contains a number of resources for building natural language processing applications.

The toolkit comes with a number of existing text corpora that you can use for building and training models out of the box. It also comes with useful "lexical resource" datasets that can be used to augment and supplement your application.

More information about the following exercises are available in the [Chapter 1](http://www.nltk.org/book/ch01.html#sec-computing-with-language-texts-and-words) of the NLTK book.

## NLTK text resources

NLTK comes with a number of resoures. It is very easy to import them and use them to build NLP tools. <br>
Let's start by listing NLTK resources available to us.


In [None]:
# First, let's download NLTK corpora
import nltk

nltk.download('book')

In [None]:
from nltk.book import *
# Print the list of the available books
texts()

## The NLTK Text object

The Text object is a wrapper for a list of tokens representing the documents:

In [None]:
print(type(text1))

In [None]:
print(text1.tokens[:50])

NLTK Text objects are stored in a way such that it is very easy to do some common NLP tasks on them.

## Concordance and similarity

The NLTK `concordance()` function generates a list of all of the instances of a particular word in context, this allows you to see how the word is being used. Let's try this on the Moby Dick text.


In [None]:
text1.concordance("monstrous")

We can see that "monstrous" is often used in the context of size and whales. I guess this is no surprise given the book we're reading.

Another function we can use here is the `similar()` function. This uses the context of the word to find words used in similar context. I.e. it looks for the words surrounding "monstrous" such as <i> "most _ size" </i> or <i>"the _ pictures"</i> and tries to find other words that match.

In [None]:
text1.similar("monstrous")

Although perhaps a little tenuously related, these are all adjectives that do roughly fit the contexts described above.

We can also look at the other words used in the book and how frequently they are used.

## Frequency Distributions and NLTK

NLTK provides a special `dictionary` that counts occurrences of items in a list. It is called `FreqDist` and allows you to plot graphs.

Let's examine the words in Moby dick with a frequency dist.

In [None]:
f = FreqDist(text1)

print("--- Sample of word frequencies ---")
print("'the': ", f["the"])
print("'whale': ", f["whale"])
print("'monstrous': ", f["monstrous"])

In [None]:
%matplotlib inline
f.plot(20, cumulative=False)

This is interesting but unfortunately a lot of the words that are being flagged up as the most frequent are common words like 'the', 'of', 'and', 'to' and more. These are what we call <b>stopwords</b> - words common to almost all documents and as such, that provide no value to an analyst. We want to filter these out if we can. 

Thankfully NLTK comes with a stopwords list too. All we need to do is filter moby dick using this list.

In [None]:
from nltk.corpus import stopwords as StopwordsLoader

stopwords = StopwordsLoader.words() + [':','?','!','"','--','-', "'", '."', ';','.',',']

f = FreqDist([ x for x in text1 if x not in stopwords ])

In [None]:
# Print and plot the most frequent words except stopwords
print(f.most_common(20))

f.plot(20, cumulative=False)

This is much more interesting and informative. This plot really helps paint a picture of the plots and themes of the book. We are still seeing a number of words that are not descriptive. Let's introduce a rule that filters out words shorter than 5 characters long.

In [None]:
f = FreqDist([ x for x in text1 if (x not in stopwords and len(x) > 4)])

In [None]:
# Print and plot the most frequent words longer than 4 characters except stopwords
print(f.most_common(20))

f.plot(20, cumulative=False)

## Collocations

Collocations are sequences of words that occur together more frequently than normal in a passage of text. For example "red wine" or "single mum" or "slim build". Often collocations describe well known phrases or idioms or are compound nouns. The NLTK book describes collocations as being resistant to substitution with words that have similar senses - e.g. maroon wine just doesn't seem the same as red wine.

We find collocations by identifying the most frequent bigrams in the text. Bigrams are just pairs of words that occur next to each other. Like the following

In [None]:
from nltk import bigrams

print (list(bigrams("Moby Dick is about whales and human beings!".split(" "))))

NLTK has a built in collocations function that can be run on Moby Dick like so. It calculates the most common bigrams in the corpus.

In [None]:
text1.collocations()

The collocations here are very specific to the book - Moby Dick. This gives us a great idea of the sorts of concepts and ideas that are important in Moby Dick.

## Using your own text with NLTK

It's great that NLTK comes with so many resources, but how do you go about using your own corpus? If you have a series of plain text files, like a movie review dataset, this is very simple. We use a `PlainTextCorpusReader` to enable NLTK to ingest and preprocess the corpus and allow us to do exercises like the ones above.

It is possibile to create a Text object from a text file on your filesystem:

In [None]:
from nltk.corpus import PlaintextCorpusReader

# Reading from disk and creating the Text object
my_local_corpus = PlaintextCorpusReader("Datasets/movie_reviews", "\w+\.txt")

We have now loaded the movie review corpus into NLTK. It can be split into words and sentences automatically for us. Let's examine the overall word frequency across the movie reviews.

In [None]:
f = FreqDist([ x for x in my_local_corpus.words() if (x not in stopwords and len(x) > 4)])

In [None]:
print(f.most_common(20))

f.plot(20, cumulative=False)

Not really any surprises here. Lots of words that make sense in a movie review context. Let's try doing collocations again.

As this is a custom corpus we will have to do a bit of set up this time.

In [None]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

bigram_measures = BigramAssocMeasures()

finder = BigramCollocationFinder.from_words(my_local_corpus.words())

# Filter collocations appearing less than 3 times
finder.apply_freq_filter(3)

print (finder.nbest(bigram_measures.pmi, 20))

This is the already implemented function for extracting collocation using an NLTK Text object:

In [None]:
# This is the implementation of "collocations()" in a Text object, compare the differences:
# http://www.nltk.org/_modules/nltk/text.html

# Convert to NLTK text object
my_local_corpus_asText  = nltk.Text(my_local_corpus.words())

print(my_local_corpus_asText.collocations())

This is much more interesting. What we start to see are names of actors and other crew members from movies under review.

## Further reading and more activities

NLTK provides a huge amount of scope for NLP experiments and text mining. For more ideas and guidance it is worth reading the [NLTK book](http://www.nltk.org/book/) online.