Introduction to NLP course (2017-2018).

Notebook 1: Introduction to NLTK.
by Venelin Kovatchev, University of Barcelona

In [1]:
# Import the main nltk library
import nltk

In [None]:
# Download all the data for the nltk
# This step downloads a big chunk of data. 
# Do this at home on a stable network connection
# DO NOT DO THIS AT CLASS !

nltk.download() 

Gutenberg Project contains over 25,000 books in a digital format. A small portion of it is available in NLTK.

In [2]:
# Load the gutenberg corpus
from nltk.corpus import gutenberg

# Check the different files available in the gutenberg corpus
print(gutenberg.fileids())

[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']


The corpus is available in three different formats: raw, tokenized and sentence.

In [3]:
# Assign three different versions of the ``emma'' to variables
emma_raw = gutenberg.raw('austen-emma.txt')
emma_words = gutenberg.words('austen-emma.txt')
emma_sents = gutenberg.sents('austen-emma.txt')

Inspect different corpora formats.
- most corpora is available in "raw format"; a single string
- corpora in "tokenized" format contains linguistic units; a list of strings
- corpora in "sentence" format contains two linguistic levels; a list of lists of strings

In [4]:
print("Raw corpus\n----------------")
print("Raw corpus format: {0}".format(emma_raw[:10]))
print("Raw corpus length: {0}".format(len(emma_raw)))
print("\nTokenized corpus\n----------------")
print("Tokenized corpus format: {0}".format(emma_words[:10]))
print("Tokenized corpus length: {0}".format(len(emma_words)))
print("\nSentence corpus\n----------------")
print("Sentence corpus format: {0}".format(emma_sents[:3]))
print("Sentence corpus length: {0}".format(len(emma_sents)))

Raw corpus
----------------
Raw corpus format: [Emma by J
Raw corpus length: 887071

Tokenized corpus
----------------
Tokenized corpus format: [u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', u']', u'VOLUME', u'I', u'CHAPTER']
Tokenized corpus length: 192427

Sentence corpus
----------------
Sentence corpus format: [[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', u']'], [u'VOLUME', u'I'], [u'CHAPTER', u'I']]
Sentence corpus length: 7752


Linguistic processing is a standard step in many tasks. It adds additional (linguistic) information to the corpus, which can be used in the following steps in a pipeline or as features in a ML system.

The Brown corpus in NLTK is annotated with morphological categories.

In [5]:
# Import the Brown corpus
from nltk.corpus import brown

# Print the available categories
print(brown.categories())

# Load the tagged version of "news"
brown_twords = brown.tagged_words(categories='news')

print("\nPart of speech tagged tokenized corpus: {0}".format(brown_twords[:10]))

[u'adventure', u'belles_lettres', u'editorial', u'fiction', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', u'news', u'religion', u'reviews', u'romance', u'science_fiction']

Part of speech tagged tokenized corpus: [(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN')]


The PENN Treebank corpus in NLTK is annotated with syntactic relations.

In [6]:
# Load the Penn treebank
from nltk.corpus import treebank

# Get the id of the first sentence
first_id = treebank.fileids()[0]

# Look at the format
print("Syntactically parsed sentence: {}".format(treebank.parsed_sents(first_id)))

Syntactically parsed sentence: [Tree('S', [Tree('NP-SBJ', [Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]), Tree(',', [',']), Tree('ADJP', [Tree('NP', [Tree('CD', ['61']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])]), Tree(',', [','])]), Tree('VP', [Tree('MD', ['will']), Tree('VP', [Tree('VB', ['join']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['board'])]), Tree('PP-CLR', [Tree('IN', ['as']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])])]), Tree('NP-TMP', [Tree('NNP', ['Nov.']), Tree('CD', ['29'])])])]), Tree('.', ['.'])]), Tree('S', [Tree('NP-SBJ', [Tree('NNP', ['Mr.']), Tree('NNP', ['Vinken'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP-PRD', [Tree('NP', [Tree('NN', ['chairman'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('NNP', ['Elsevier']), Tree('NNP', ['N.V.'])]), Tree(',', [',']), Tree('NP', [Tree('DT', ['the']), Tree('NNP', ['Dutch']), Tree('VBG', ['publishing']), Tree('NN', ['group'])])])])])

In [7]:
# Look at the format from a tree representation
treebank.parsed_sents(first_id)[0].draw()

Some corpora are built around a specific task. 
They have additional information (such as a label).

An example of such corpus is the movie review corpus.

In [8]:
# Load the corpus
from nltk.corpus import movie_reviews
# Check the categories
print("Categories in the movie review corpus: {0}".format(movie_reviews.categories()))
# Load the first files of each category
movie_pos = movie_reviews.sents(movie_reviews.fileids("pos")[0])
movie_neg = movie_reviews.sents(movie_reviews.fileids("neg")[0])

# Look at each of the sentences
print("\nPositive review: {0}".format(movie_pos[:3]))
print("\nNegative review: {0}".format(movie_neg[:3]))

Categories in the movie review corpus: [u'neg', u'pos']

Positive review: [[u'films', u'adapted', u'from', u'comic', u'books', u'have', u'had', u'plenty', u'of', u'success', u',', u'whether', u'they', u"'", u're', u'about', u'superheroes', u'(', u'batman', u',', u'superman', u',', u'spawn', u')', u',', u'or', u'geared', u'toward', u'kids', u'(', u'casper', u')', u'or', u'the', u'arthouse', u'crowd', u'(', u'ghost', u'world', u')', u',', u'but', u'there', u"'", u's', u'never', u'really', u'been', u'a', u'comic', u'book', u'like', u'from', u'hell', u'before', u'.'], [u'for', u'starters', u',', u'it', u'was', u'created', u'by', u'alan', u'moore', u'(', u'and', u'eddie', u'campbell', u')', u',', u'who', u'brought', u'the', u'medium', u'to', u'a', u'whole', u'new', u'level', u'in', u'the', u'mid', u"'", u'80s', u'with', u'a', u'12', u'-', u'part', u'series', u'called', u'the', u'watchmen', u'.'], [u'to', u'say', u'moore', u'and', u'campbell', u'thoroughly', u'researched', u'the', u'subject'