
### NLP Lab Session Week 4
### POS Taggers in the NLTK
### Part 1:  Session Setup and Tagged Corpora



Getting Started

In this lab session, we will work together through a series of small examples using the Python interpreter in Jupyter notebook.  As before, you may use the text file with the Python examples.  

Download LabWeek4.POStags.txt

Save it in a folder where you keep materials for this class.   Open your command prompt or terminal window and use the cd command to change directory to your class materials folder.  Type at the prompt:

$ jupyter notebook

As usual, start your nlp session by:

import nltk

Reading Tagged Corpora

The NLTK corpus readers have additional methods (aka functions) that can give the additional tag information from reading a tagged corpus.  Both the Brown corpus and the Penn Treebank corpus have text in which each token has been tagged with a POS tag.  (These were manually assigned by annotators.)

The tagged_sents function gives a list of sentences, each sentence is a list of (word, tag) tuples. We’ll first look at the Brown corpus, which is described in Chapter 2 of the NLTK book.


In [1]:
import nltk
from nltk.corpus import brown
brown.tagged_sents()[:2]


[[('The', 'AT'),
  ('Fulton', 'NP-TL'),
  ('County', 'NN-TL'),
  ('Grand', 'JJ-TL'),
  ('Jury', 'NN-TL'),
  ('said', 'VBD'),
  ('Friday', 'NR'),
  ('an', 'AT'),
  ('investigation', 'NN'),
  ('of', 'IN'),
  ("Atlanta's", 'NP$'),
  ('recent', 'JJ'),
  ('primary', 'NN'),
  ('election', 'NN'),
  ('produced', 'VBD'),
  ('``', '``'),
  ('no', 'AT'),
  ('evidence', 'NN'),
  ("''", "''"),
  ('that', 'CS'),
  ('any', 'DTI'),
  ('irregularities', 'NNS'),
  ('took', 'VBD'),
  ('place', 'NN'),
  ('.', '.')],
 [('The', 'AT'),
  ('jury', 'NN'),
  ('further', 'RBR'),
  ('said', 'VBD'),
  ('in', 'IN'),
  ('term-end', 'NN'),
  ('presentments', 'NNS'),
  ('that', 'CS'),
  ('the', 'AT'),
  ('City', 'NN-TL'),
  ('Executive', 'JJ-TL'),
  ('Committee', 'NN-TL'),
  (',', ','),
  ('which', 'WDT'),
  ('had', 'HVD'),
  ('over-all', 'JJ'),
  ('charge', 'NN'),
  ('of', 'IN'),
  ('the', 'AT'),
  ('election', 'NN'),
  (',', ','),
  ('``', '``'),
  ('deserves', 'VBZ'),
  ('the', 'AT'),
  ('praise', 'NN'),
  ('and', 

The tagged_words function just gives a list of all the (word, tag) tuples, ignoring the sentence structure.

In [2]:
brown.tagged_words()[:50]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN'),
 ("Atlanta's", 'NP$'),
 ('recent', 'JJ'),
 ('primary', 'NN'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', '``'),
 ('no', 'AT'),
 ('evidence', 'NN'),
 ("''", "''"),
 ('that', 'CS'),
 ('any', 'DTI'),
 ('irregularities', 'NNS'),
 ('took', 'VBD'),
 ('place', 'NN'),
 ('.', '.'),
 ('The', 'AT'),
 ('jury', 'NN'),
 ('further', 'RBR'),
 ('said', 'VBD'),
 ('in', 'IN'),
 ('term-end', 'NN'),
 ('presentments', 'NNS'),
 ('that', 'CS'),
 ('the', 'AT'),
 ('City', 'NN-TL'),
 ('Executive', 'JJ-TL'),
 ('Committee', 'NN-TL'),
 (',', ','),
 ('which', 'WDT'),
 ('had', 'HVD'),
 ('over-all', 'JJ'),
 ('charge', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('election', 'NN'),
 (',', ','),
 ('``', '``'),
 ('deserves', 'VBZ'),
 ('the', 'AT'),
 ('praise', 'NN')]

We said that each of these is what Python calls a tuple, which is a pair (or triple, etc.) in which you can’t change the elements.
  


In [3]:
wordtag = brown.tagged_words()[0]

In [4]:
print(wordtag)

print(wordtag[0])

print(wordtag[1])


('The', 'AT')
The
AT


The Brown corpus is organized into different types of text, which can be selected by the categories argument, and it also allows you to map the tags to a simplified tag set, described in table 5.1 in the NLTK book.

In [5]:
brown.categories()
brown_humor_tagged = brown.tagged_words(categories='humor', tagset='universal')
brown_humor_tagged[:50]


[('It', 'PRON'),
 ('was', 'VERB'),
 ('among', 'ADP'),
 ('these', 'DET'),
 ('that', 'ADP'),
 ('Hinkle', 'NOUN'),
 ('identified', 'VERB'),
 ('a', 'DET'),
 ('photograph', 'NOUN'),
 ('of', 'ADP'),
 ('Barco', 'NOUN'),
 ('!', '.'),
 ('!', '.'),
 ('For', 'ADP'),
 ('it', 'PRON'),
 ('seems', 'VERB'),
 ('that', 'ADP'),
 ('Barco', 'NOUN'),
 (',', '.'),
 ('fancying', 'VERB'),
 ('himself', 'PRON'),
 ('a', 'DET'),
 ("ladies'", 'NOUN'),
 ('man', 'NOUN'),
 ('(', '.'),
 ('and', 'CONJ'),
 ('why', 'ADV'),
 ('not', 'ADV'),
 (',', '.'),
 ('after', 'ADP'),
 ('seven', 'NUM'),
 ('marriages', 'NOUN'),
 ('?', '.'),
 ('?', '.'),
 (')', '.'),
 (',', '.'),
 ('had', 'VERB'),
 ('listed', 'VERB'),
 ('himself', 'PRON'),
 ('for', 'ADP'),
 ('Mormon', 'NOUN'),
 ('Beard', 'NOUN'),
 ('roles', 'NOUN'),
 ('at', 'ADP'),
 ('the', 'DET'),
 ('instigation', 'NOUN'),
 ('of', 'ADP'),
 ('his', 'DET'),
 ('fourth', 'ADJ'),
 ('murder', 'NOUN')]

Other tagged corpora also come with the tagged_words method.  Note that the chat corpus is tagged with Penn Treebank POS tags.



In [6]:
nltk.corpus.nps_chat.tagged_words()[:50]

[('now', 'RB'),
 ('im', 'PRP'),
 ('left', 'VBD'),
 ('with', 'IN'),
 ('this', 'DT'),
 ('gay', 'JJ'),
 ('name', 'NN'),
 (':P', 'UH'),
 ('PART', 'VB'),
 ('hey', 'UH'),
 ('everyone', 'NN'),
 ('ah', 'UH'),
 ('well', 'UH'),
 ('NICK', 'NN'),
 (':', ':'),
 ('U7', 'NNP'),
 ('U7', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('gay', 'JJ'),
 ('name', 'NN'),
 ('.', '.'),
 ('.', 'SYM'),
 ('ACTION', 'NN'),
 ('gives', 'VBZ'),
 ('U121', 'NNP'),
 ('a', 'DT'),
 ('golf', 'NN'),
 ('clap', 'NN'),
 ('.', '.'),
 (':)', 'UH'),
 ('JOIN', 'VB'),
 ('hi', 'UH'),
 ('U59', 'NNP'),
 ('26', 'CD'),
 ('/', 'CC'),
 ('m', 'NN'),
 ('/', 'CC'),
 ('ky', 'NNP'),
 ('women', 'NNS'),
 ('that', 'WDT'),
 ('are', 'VBP'),
 ('nice', 'JJ'),
 ('please', 'VB'),
 ('pm', 'VB'),
 ('me', 'PRP'),
 ('JOIN', 'VB'),
 ('PART', 'VB'),
 ('there', 'RB'),
 ('ya', 'PRP')]

### Penn Treebank

In this class, we will mostly use the Penn Treebank tag set, as it is the most widely used.  The Treebank has the tagged_words and tagged_sents methods, as well as the words method that we used before to get the tokens.


In [8]:
from nltk.corpus import treebank
# the .raw() and .words() functions still get the text as strings and as tokens
treebank_text = treebank.raw()
print(treebank_text[:150], '\n')
treebank_tokens = treebank.words()
print(treebank_tokens[:20])



( (S 
    (NP-SBJ 
      (NP (NNP Pierre) (NNP Vinken) )
      (, ,) 
      (ADJP 
        (NP (CD 61) (NNS years) )
        (JJ old) )
      (, ,) ) 

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.', 'Mr.', 'Vinken']


In [9]:
# but we also have functions to get words with tags and sentences with tagged words
treebank_tagged_words = treebank.tagged_words()[:50]
len(treebank.tagged_words())
treebank_tagged_words[:50]

treebank_tagged = treebank.tagged_sents()[:2]
len(treebank.tagged_sents())
treebank_tagged[:2]



[[('Pierre', 'NNP'),
  ('Vinken', 'NNP'),
  (',', ','),
  ('61', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  (',', ','),
  ('will', 'MD'),
  ('join', 'VB'),
  ('the', 'DT'),
  ('board', 'NN'),
  ('as', 'IN'),
  ('a', 'DT'),
  ('nonexecutive', 'JJ'),
  ('director', 'NN'),
  ('Nov.', 'NNP'),
  ('29', 'CD'),
  ('.', '.')],
 [('Mr.', 'NNP'),
  ('Vinken', 'NNP'),
  ('is', 'VBZ'),
  ('chairman', 'NN'),
  ('of', 'IN'),
  ('Elsevier', 'NNP'),
  ('N.V.', 'NNP'),
  (',', ','),
  ('the', 'DT'),
  ('Dutch', 'NNP'),
  ('publishing', 'VBG'),
  ('group', 'NN'),
  ('.', '.')]]

The NLTK has almost 4,000 sentences of tagged data from Penn Treebank, while the actual Treebank has much more.  This will limit the accuracy of the POS taggers (and later parsers) that we can define in lab, but also make the running times short enough for labs.
Let’s look at the frequencies of the tags in this portion of Penn Treebank.  To do that, we use the NLTK Frequency Distribution for all the tags from the (word, tag) pairs in the Treebank.

In [10]:
tag_fd = nltk.FreqDist(tag for (word, tag) in treebank_tagged_words)
print(tag_fd.keys(), '\n')
for tag,freq in tag_fd.most_common():
    print (tag, freq)


dict_keys(['NNP', ',', 'CD', 'NNS', 'JJ', 'MD', 'VB', 'DT', 'NN', 'IN', '.', 'VBZ', 'VBG', 'CC', 'VBD', 'VBN', '-NONE-']) 

NNP 14
, 5
NN 5
JJ 4
DT 4
CD 3
IN 3
NNS 2
. 2
MD 1
VB 1
VBZ 1
VBG 1
CC 1
VBD 1
VBN 1
-NONE- 1


We see that NN, the tag of single nouns, is the most frequent tag;  it has 13,166 occurrences of the 100,676 words, or about 13%.  The tags IN, for prepositions except to, NNP, for single proper nouns, and DT, for determiners, are close behind at 10%, 9% and 8%, respectively.  The next tag in the list is –NONE-, which is the tag of those empty elements, which come from the grammar syntactic constructs.

This is a very detailed look at the POS tags.  We could also approximate classes of tags by using the first letter of the POS tag as the key in a frequency distribution.  For example, this will group all the nouns, which all start with “N”, all the verbs, “V”, adjectives “J”, and adverbs “R”.  The prepositions will be split between “I” for the ones that are tagged “IN” and “T” for the ones that are tagged “TO”.



In [11]:
# use the first letter of the POS tag to get classes of tags
tag_classes_fd = nltk.FreqDist(tag[0] for (word, tag) in treebank_tagged_words)
print(tag_classes_fd.keys(), '\n')
for tag,freq in tag_classes_fd.most_common():
    print (tag, freq)




dict_keys(['N', ',', 'C', 'J', 'M', 'V', 'D', 'I', '.', '-']) 

N 21
, 5
V 5
C 4
J 4
D 4
I 3
. 2
M 1
- 1



### Part 2:  Tagger Training Setup

### The POS Tagging Task and Training a default tagger

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. 

We will use the tagged sentences and words from the Penn Treebank that we defined in the previous section.

We separate our tagged data into a training set, where we’ll learn the probabilities of the words and their tags, and a test set to evaluate how out taggers perform.  This allows us to test the tagger’s accuracy on similar, but not the same, data that it was trained on.  The training set is the first 90% of the sentences and the test set is the remaining 10%.


In [25]:
treebank_tagged

[[('Pierre', 'NNP'),
  ('Vinken', 'NNP'),
  (',', ','),
  ('61', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  (',', ','),
  ('will', 'MD'),
  ('join', 'VB'),
  ('the', 'DT'),
  ('board', 'NN'),
  ('as', 'IN'),
  ('a', 'DT'),
  ('nonexecutive', 'JJ'),
  ('director', 'NN'),
  ('Nov.', 'NNP'),
  ('29', 'CD'),
  ('.', '.')],
 [('Mr.', 'NNP'),
  ('Vinken', 'NNP'),
  ('is', 'VBZ'),
  ('chairman', 'NN'),
  ('of', 'IN'),
  ('Elsevier', 'NNP'),
  ('N.V.', 'NNP'),
  (',', ','),
  ('the', 'DT'),
  ('Dutch', 'NNP'),
  ('publishing', 'VBG'),
  ('group', 'NN'),
  ('.', '.')]]

In [12]:
size = int(len(treebank_tagged) * 0.9)
treebank_train = treebank_tagged[:size]
treebank_test = treebank_tagged[size:]


In the NLTK, a number of POS taggers are included in the tag module, including one that we can use that has been trained on all of Penn Treebank.  But for instructional purposes, we will develop a sequence of N-gram taggers whose performance improves.

To introduce the N-gram taggers in NLTK, we start with a default tagger that just tags everything with the most frequent tag:  NN.   We create the tagger and run it on text.  Note that this simple tagger doesn’t actually use the training set.


In [14]:
# creates the tagger
t0 = nltk.DefaultTagger('NN')
# show the effect of the tagger by tagging the first 50 words
t0.tag(treebank_tokens[:50])


[('Pierre', 'NN'),
 ('Vinken', 'NN'),
 (',', 'NN'),
 ('61', 'NN'),
 ('years', 'NN'),
 ('old', 'NN'),
 (',', 'NN'),
 ('will', 'NN'),
 ('join', 'NN'),
 ('the', 'NN'),
 ('board', 'NN'),
 ('as', 'NN'),
 ('a', 'NN'),
 ('nonexecutive', 'NN'),
 ('director', 'NN'),
 ('Nov.', 'NN'),
 ('29', 'NN'),
 ('.', 'NN'),
 ('Mr.', 'NN'),
 ('Vinken', 'NN'),
 ('is', 'NN'),
 ('chairman', 'NN'),
 ('of', 'NN'),
 ('Elsevier', 'NN'),
 ('N.V.', 'NN'),
 (',', 'NN'),
 ('the', 'NN'),
 ('Dutch', 'NN'),
 ('publishing', 'NN'),
 ('group', 'NN'),
 ('.', 'NN'),
 ('Rudolph', 'NN'),
 ('Agnew', 'NN'),
 (',', 'NN'),
 ('55', 'NN'),
 ('years', 'NN'),
 ('old', 'NN'),
 ('and', 'NN'),
 ('former', 'NN'),
 ('chairman', 'NN'),
 ('of', 'NN'),
 ('Consolidated', 'NN'),
 ('Gold', 'NN'),
 ('Fields', 'NN'),
 ('PLC', 'NN'),
 (',', 'NN'),
 ('was', 'NN'),
 ('named', 'NN'),
 ('*-1', 'NN'),
 ('a', 'NN')]

The NLTK includes a function for taggers that computes tagging accuracy by comparing the result of a tagger with the original “gold standard” tagged text.  Here we use the NLTK function “evaluate” to apply the default tagger (to the untagged text) and compare it with the gold standard tagged text in the test set.

In [15]:
t0.evaluate(treebank_test)

0.15384615384615385

The evaluate function first takes the tagged text and removes the tags, so that only tokens are left.  Then it runs the tagger, in this case t0, to tag all the text.  Then it compares the tags predicted by the tagger with the “gold standard” tags already given.  It reports the accuracy, which is the percentage of words with correct tags.
	Given a trained tagger, the evaluate function:
		Takes a test set consisting of text words with “correct” tags
		Creates a predicted test set by
			removing tags from the test set
			running the tagger to get predicted tags
		Compares the correct tag test set with the predicted tag test set
			and reports accuracy


Other simple taggers described in the NLTK book are the Regular Expression Tagger and the Lookup Tagger.




### Part 3:  Training the N-Gram Tagger

We continue our development of training a tagger by training a Unigram tagger.  It tags each word with the most frequent tag in that word has in the corpus.  For example, if the word “bank” occurs 30 times with the tag “NN” and 10 times with the tag “VB”, we’ll just tag it with “NN”. 

In [16]:
t1 = nltk.UnigramTagger(treebank_tagged)
t1.tag(treebank_tokens[:50])


[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.'),
 ('Mr.', 'NNP'),
 ('Vinken', 'NNP'),
 ('is', 'VBZ'),
 ('chairman', 'NN'),
 ('of', 'IN'),
 ('Elsevier', 'NNP'),
 ('N.V.', 'NNP'),
 (',', ','),
 ('the', 'DT'),
 ('Dutch', 'NNP'),
 ('publishing', 'VBG'),
 ('group', 'NN'),
 ('.', '.'),
 ('Rudolph', None),
 ('Agnew', None),
 (',', ','),
 ('55', None),
 ('years', 'NNS'),
 ('old', 'JJ'),
 ('and', None),
 ('former', None),
 ('chairman', 'NN'),
 ('of', 'IN'),
 ('Consolidated', None),
 ('Gold', None),
 ('Fields', None),
 ('PLC', None),
 (',', ','),
 ('was', None),
 ('named', None),
 ('*-1', None),
 ('a', 'DT')]

Train the tagger on the training set and evaluate on the test set.

In [17]:
t1 = nltk.UnigramTagger(treebank_train)
t1.evaluate(treebank_test)


0.3076923076923077

In the lecture slides, this Unigram Tagger is what Chris Manning called their baseline tagger and they got about 90% accuracy.  Why isn’t ours quite as good?

Finally, NLTK has a Bigram tagger that can be trained using 2 tag-word sequences. 
But there will be unknown frequencies in the test data for the bigram tagger, and unknown words for the unigram tagger, so we can use the backoff tagger capability of NLTK to create a combined tagger.  This tagger uses bigram frequencies to tag as much as possible.  If a word doesn’t occur in a bigram, it uses the unigram tagger to tag that word.  If the word is unknown to the unigram tagger, then we use the default tagger to tag it as ‘NN’.


In [18]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(treebank_train, backoff=t0)
t2 = nltk.BigramTagger(treebank_train, backoff=t1)
t2.evaluate(treebank_test)


0.46153846153846156

This accuracy is not bad, especially on only part of Penn Treebank!  We know that HMM and other feature techniques can raise the accuracy to between 95 and 98%.  

But note that this good performance is also on a test set taken from the Penn Treebank, where there may not be very many unknown words.  More modern text or text on different topics than the Wall Street Journal will have more difficulty with unknown words.  We know from the lectures that combining this with a regular expression tagger or a classifier tagger can improve performance on unknown words.

Save your tagger t2 for the next section.


### Part 4:  Using a Tagger to Tag Text

### Applying the N-Gram tagger to text


Let’s use the N-gram tagger that we trained in the previous section to tag some example text.  We will define some example text, tokenize it, and apply the tagger.

In [19]:
text = "Three Calgarians have found a rather unusual way of leaving snow and ice behind. They set off this week on foot and by camels on a grueling trek across the burning Arabian desert."


In previous labs, we applied the function “nltk.word_tokenize” directly to multi-sentence text for simplicity.  But this function is actually trained to tokenize individual sentences and will work better if we first use the sentence splitter, aka tokenizer, to produce a list of text strings for individual sentences.


In [20]:
textsplit = nltk.sent_tokenize(text)
textsplit

['Three Calgarians have found a rather unusual way of leaving snow and ice behind.',
 'They set off this week on foot and by camels on a grueling trek across the burning Arabian desert.']

After producing the list of sentence texts, apply the word tokenizer to each sentence.

In [21]:
tokentext = [nltk.word_tokenize(sent) for sent in textsplit]
tokentext

[['Three',
  'Calgarians',
  'have',
  'found',
  'a',
  'rather',
  'unusual',
  'way',
  'of',
  'leaving',
  'snow',
  'and',
  'ice',
  'behind',
  '.'],
 ['They',
  'set',
  'off',
  'this',
  'week',
  'on',
  'foot',
  'and',
  'by',
  'camels',
  'on',
  'a',
  'grueling',
  'trek',
  'across',
  'the',
  'burning',
  'Arabian',
  'desert',
  '.']]

Now apply the t2 bigram POS tagger to each sentence of tokens in the list.

In [22]:
taggedtext = [t2.tag(tokens) for tokens in tokentext]
taggedtext

[[('Three', 'NN'),
  ('Calgarians', 'NN'),
  ('have', 'NN'),
  ('found', 'NN'),
  ('a', 'DT'),
  ('rather', 'NN'),
  ('unusual', 'NN'),
  ('way', 'NN'),
  ('of', 'NN'),
  ('leaving', 'NN'),
  ('snow', 'NN'),
  ('and', 'NN'),
  ('ice', 'NN'),
  ('behind', 'NN'),
  ('.', '.')],
 [('They', 'NN'),
  ('set', 'NN'),
  ('off', 'NN'),
  ('this', 'NN'),
  ('week', 'NN'),
  ('on', 'NN'),
  ('foot', 'NN'),
  ('and', 'NN'),
  ('by', 'NN'),
  ('camels', 'NN'),
  ('on', 'NN'),
  ('a', 'DT'),
  ('grueling', 'NN'),
  ('trek', 'NN'),
  ('across', 'NN'),
  ('the', 'DT'),
  ('burning', 'NN'),
  ('Arabian', 'NN'),
  ('desert', 'NN'),
  ('.', '.')]]

We observe that this text has quite a few words that appear to be unknown to this tagger from the data it was trained on.  Examples of this are “Calgarians” and “camels”.  In both cases, these two words are tagged as NN instead of the correct tags of NNPS and NNS, respectively.  This points out the benefit of adding sequence information such as an HMM tagger would use and lexical information, such as a Maximum Entropy tagger could use if you defined such features.  In the NLTK, another strategy would be to use a Regular Expression tagger as a backoff tagger that could take into account word features.

### Stanford POS Tagger

One of the problems with training our own POS tagger is that we don’t have all the Penn Treebank data.  But NLTK also provides some taggers that come pre-trained on the larger amount of data.  One of these is the Stanford POS tagger, which was trained using a maximum entropy classifier.  This is described in the nltk.tag module:

http://www.nltk.org/_modules/nltk/tag.html

This tagger is available in the module: 'taggers/maxent_treebank_pos_tagger/english.pickle' and it is used for the standard nltk.pos_tag function.

We use the standard nltk pos tagger on the same example text.


In [23]:
taggedtextStanford = [nltk.pos_tag(tokens) for tokens in tokentext]
taggedtextStanford


[[('Three', 'CD'),
  ('Calgarians', 'NNPS'),
  ('have', 'VBP'),
  ('found', 'VBN'),
  ('a', 'DT'),
  ('rather', 'RB'),
  ('unusual', 'JJ'),
  ('way', 'NN'),
  ('of', 'IN'),
  ('leaving', 'VBG'),
  ('snow', 'NN'),
  ('and', 'CC'),
  ('ice', 'NN'),
  ('behind', 'NN'),
  ('.', '.')],
 [('They', 'PRP'),
  ('set', 'VBD'),
  ('off', 'RP'),
  ('this', 'DT'),
  ('week', 'NN'),
  ('on', 'IN'),
  ('foot', 'NN'),
  ('and', 'CC'),
  ('by', 'IN'),
  ('camels', 'NNS'),
  ('on', 'IN'),
  ('a', 'DT'),
  ('grueling', 'NN'),
  ('trek', 'NN'),
  ('across', 'IN'),
  ('the', 'DT'),
  ('burning', 'NN'),
  ('Arabian', 'JJ'),
  ('desert', 'NN'),
  ('.', '.')]]

Since we first split our text into a list of sentences and then each sentence into a list of tokens, our tagged text has the structure of a list of lists.  Suppose that instead we just want one long list of tagged tokens.  We can use a list comprehension to define the new list as all of the tagged tokens in each for the sentences.

In [24]:
taggedtext_flat = [pair for sent in taggedtext for pair in sent]
print(taggedtext_flat)

taggedtextStanford_flat = [pair for sent in taggedtextStanford for pair in sent]
print(taggedtextStanford_flat)


[('Three', 'NN'), ('Calgarians', 'NN'), ('have', 'NN'), ('found', 'NN'), ('a', 'DT'), ('rather', 'NN'), ('unusual', 'NN'), ('way', 'NN'), ('of', 'NN'), ('leaving', 'NN'), ('snow', 'NN'), ('and', 'NN'), ('ice', 'NN'), ('behind', 'NN'), ('.', '.'), ('They', 'NN'), ('set', 'NN'), ('off', 'NN'), ('this', 'NN'), ('week', 'NN'), ('on', 'NN'), ('foot', 'NN'), ('and', 'NN'), ('by', 'NN'), ('camels', 'NN'), ('on', 'NN'), ('a', 'DT'), ('grueling', 'NN'), ('trek', 'NN'), ('across', 'NN'), ('the', 'DT'), ('burning', 'NN'), ('Arabian', 'NN'), ('desert', 'NN'), ('.', '.')]
[('Three', 'CD'), ('Calgarians', 'NNPS'), ('have', 'VBP'), ('found', 'VBN'), ('a', 'DT'), ('rather', 'RB'), ('unusual', 'JJ'), ('way', 'NN'), ('of', 'IN'), ('leaving', 'VBG'), ('snow', 'NN'), ('and', 'CC'), ('ice', 'NN'), ('behind', 'NN'), ('.', '.'), ('They', 'PRP'), ('set', 'VBD'), ('off', 'RP'), ('this', 'DT'), ('week', 'NN'), ('on', 'IN'), ('foot', 'NN'), ('and', 'CC'), ('by', 'IN'), ('camels', 'NNS'), ('on', 'IN'), ('a', 'DT'