# Part of Speech Tagging

Several corpora with manual part of speech (POS) tagging are included in NLTK. For this exercise, we'll use a sample of the Penn Treebank corpus, a collection of Wall Street Journal articles. We can access the part-of-speech information for either the Penn Treebank or the Brown as follows. We use sentences here because that is the preferred representation for doing POS tagging.

In [1]:
from nltk.corpus import treebank, brown

print treebank.tagged_sents()[0]
print brown.tagged_sents()[0]

[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'), (u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'), (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]


In NLTK, word/tag pairs are stored as tuples, the transformation from the plain text "word/tag" representation to the python data types is done by the corpus reader.

The two corpus do not have the same tagset; the Brown was tagged with a more fine-grained tagset: for instance, instead of "DT" (determiner) as in the Penn Treebank, the word "the" is tagged as "AT" (article, which is a kind of determiner). We can actually convert them both to the Universal tagset.

In [2]:
print treebank.tagged_sents(tagset="universal")[0]
print brown.tagged_sents(tagset="universal")[0]

[(u'Pierre', u'NOUN'), (u'Vinken', u'NOUN'), (u',', u'.'), (u'61', u'NUM'), (u'years', u'NOUN'), (u'old', u'ADJ'), (u',', u'.'), (u'will', u'VERB'), (u'join', u'VERB'), (u'the', u'DET'), (u'board', u'NOUN'), (u'as', u'ADP'), (u'a', u'DET'), (u'nonexecutive', u'ADJ'), (u'director', u'NOUN'), (u'Nov.', u'NOUN'), (u'29', u'NUM'), (u'.', u'.')]
[(u'The', u'DET'), (u'Fulton', u'NOUN'), (u'County', u'NOUN'), (u'Grand', u'ADJ'), (u'Jury', u'NOUN'), (u'said', u'VERB'), (u'Friday', u'NOUN'), (u'an', u'DET'), (u'investigation', u'NOUN'), (u'of', u'ADP'), (u"Atlanta's", u'NOUN'), (u'recent', u'ADJ'), (u'primary', u'NOUN'), (u'election', u'NOUN'), (u'produced', u'VERB'), (u'``', u'.'), (u'no', u'DET'), (u'evidence', u'NOUN'), (u"''", u'.'), (u'that', u'ADP'), (u'any', u'DET'), (u'irregularities', u'NOUN'), (u'took', u'VERB'), (u'place', u'NOUN'), (u'.', u'.')]


Now, let's create a basic unigram POS tagger. First, we need to collect POS distributions for each word. We'll do this (somewhat inefficiently) using a dictionary of dictionaries.

In [3]:
from collections import defaultdict

POS_dict = defaultdict(dict)
for word_pos_pair in treebank.tagged_words():
    word = word_pos_pair[0].lower()
    POS = word_pos_pair[1]
    POS_dict[word][POS] = POS_dict[word].get(POS,0) + 1

Let's look at some words which appear with multiple POS, and their POS counts:

In [4]:
for word in POS_dict.keys()[:100]:
    if len(POS_dict[word]) > 1:
        print word
        print POS_dict[word]

increase
{u'VB': 11, u'NN': 20}
refunding
{u'VBG': 1, u'NN': 3}
straight
{u'RB': 2, u'JJ': 3}
second
{u'JJ': 16, u'NNP': 2}
contributed
{u'VBN': 1, u'VBD': 6}
reported
{u'VBN': 12, u'VBD': 25}
elaborate
{u'VB': 7, u'JJ': 2}
climbed
{u'VBN': 1, u'VBD': 4}
reports
{u'VBZ': 2, u'NNS': 13}
sino-u.s.
{u'JJ': 1, u'NNP': 1}
criticism
{u'NN': 1, u'NNP': 1}
brought
{u'VBN': 3, u'VBD': 3}


Common ambiguities that we see here are between nouns and verbs (<i>increase</i>, <i>refunding</i>, <i>reports</i>), and, among verbs, between past tense and past participles (<i>contributed</i>, <i>reported</i>, <i>climbed</i>).

To create an actual tagger, we just need to pick the most common tag for each

In [None]:
tagger_dict = {}
for word in POS_dict:
    tagger_dict[word] = max(POS_dict[word],key=lambda (x): POS_dict[word][x])

def tag(sentence):
    return [(word,tagger_dict.get(word,"NN")) for word in sentence]

print tag(brown.sents()[0])

[(u'The', 'NN'), (u'Fulton', 'NN'), (u'County', 'NN'), (u'Grand', 'NN'), (u'Jury', 'NN'), (u'said', u'VBD'), (u'Friday', 'NN'), (u'an', u'DT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", 'NN'), (u'recent', u'JJ'), (u'primary', u'JJ'), (u'election', u'NN'), (u'produced', u'VBN'), (u'``', u'``'), (u'no', u'DT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'IN'), (u'any', u'DT'), (u'irregularities', 'NN'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]


Though we'd probably want some better handling of capitalized phrases (backing off to NNP rather than NN when a word is capitalized), and there are a few other obvious errors, generally it's not too bad. 

NLTK has built-in support for n-gram taggers; Let's build unigram and bigram taggers, and test their performance. First we need to split our corpus into training and testing

In [None]:
size = int(len(treebank.tagged_sents()) * 0.9)
train_sents = treebank.tagged_sents()[:size]
test_sents = treebank.tagged_sents()[size:]


Let's first compare a unigram and bigram tagger. All NLTK taggers have an evaluate method which prints out the accuracy on some test set.

In [None]:
from nltk import UnigramTagger, BigramTagger

unigram_tagger = UnigramTagger(train_sents)
bigram_tagger = BigramTagger(train_sents)
print unigram_tagger.evaluate(test_sents)
print unigram_tagger.tag(brown.sents()[1])
print bigram_tagger.evaluate(test_sents)
print bigram_tagger.tag(brown.sents()[1])

The unigram tagger does way better. The reason is sparsity, the bigram tagger doesn't have counts for many of the word/tag context pairs; what's worse, once it can't tag something, it fails catastrophically for the rest of the sentence tag, because it has no counts at all for missing tag contexts. We can fix this by adding backoffs, including the default tagger with just tags everything as NN

In [None]:
from nltk import DefaultTagger

default_tagger = DefaultTagger("NN")
unigram_tagger = UnigramTagger(train_sents,backoff=default_tagger)
bigram_tagger = BigramTagger(train_sents,backoff=unigram_tagger)

print bigram_tagger.evaluate(test_sents)
print bigram_tagger.tag(brown.sents()[1])

We see a 3% increase in performance from adding the bigram information on top of the unigram information.

NLTK has interfaces to the Brill tagger (nltk.tag.brill) and also pre-build, state-of-the-art sequential POS tagging models, for instance the Stanford POS tagger (StanfordPOSTagger), which is what you should use if you actually need high-quality POS tagging for some application; if you are working on a computer with the Stanford CoreNLP tools installed and NLTK set up to use them (this is the case for the lab computers where workshops are held), the below code should work. If not, see the documentation <a href="https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software"> here </a> 

In [None]:
from nltk import StanfordPOSTagger

stanford_tagger = StanfordPOSTagger('english-bidirectional-distsim.tagger')
print stanford_tagger.tag(brown.sents()[1])