# Categorizing and Tagging Words

In [1]:
import nltk

## Using a Tagger

In [3]:
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [3]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

Lexical categories like "noun" and part-of-speech tags like NN seem to have their uses, but the details will be obscure to many readers. You might wonder what justification there is for introducing this extra level of information. Many of these categories arise from superficial analysis the distribution of words in text. Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner). The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w' that appear in the same context, i.e. w1w'w2.

## Text Similar

In [7]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

In [9]:
print(text.similar('woman'))

man time day year car moment world house family child country boy
state job place way war girl work word
None


In [10]:
print(text.similar('bought'))

made said done put had seen found given left heard was been brought
set got that took in told felt
None


In [11]:
print(text.similar('over'))

in on to of and for with from at by that into as up out down through
is all about
None


In [12]:
print(text.similar('the'))

a his this their its her an that our any all one these my in your no
some other and
None


## Tagged Corpora
* Representing Tagged Tokens
* Reading Tagged Corpora
* A Universal Part-of-Speech Tag-set
* Nouns
* Verbs
* Adjectives and Adverbs
* Unsimplified Tags
* Exploring Tagged Corpora

## Representing Tagged Tokens
A tagged token is a tuple.

In [16]:
tagged_token = nltk.tag.str2tuple('fly/NN')
print(tagged_token)
print(tagged_token[0])
print(tagged_token[1])

('fly', 'NN')
fly
NN


In [17]:
sent = ''' The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT interest/NN of/IN both/ABX governments/NNS ''/'' ./.'''
print([nltk.tag.str2tuple(t) for t in sent.split()])

[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ('of', 'IN'), ('other', 'AP'), ('topics', 'NNS'), (',', ','), ('AMONG', 'IN'), ('them', 'PPO'), ('the', 'AT'), ('Atlanta', 'NP'), ('and', 'CC'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('purchasing', 'VBG'), ('departments', 'NNS'), ('which', 'WDT'), ('it', 'PPS'), ('said', 'VBD'), ('``', '``'), ('ARE', 'BER'), ('well', 'QL'), ('operated', 'VBN'), ('and', 'CC'), ('follow', 'VB'), ('generally', 'RB'), ('accepted', 'VBN'), ('practices', 'NNS'), ('which', 'WDT'), ('inure', 'VB'), ('to', 'IN'), ('the', 'AT'), ('best', 'JJT'), ('interest', 'NN'), ('of', 'IN'), ('both', 'ABX'), ('governments', 'NNS'), ("''", "''"), ('.', '.')]


## Reading Tagged Corpora
Several of the corpora included with NLTK have been tagged for their part-of-speech

In [23]:
print(nltk.corpus.brown.tagged_words())
print(nltk.corpus.brown.tagged_words(tagset='universal'))

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
[('The', 'DET'), ('Fulton', 'NOUN'), ...]


In [24]:
print(nltk.corpus.nps_chat.tagged_words())
print(nltk.corpus.nps_chat.tagged_words(tagset = 'universal'))

[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
[('now', 'ADV'), ('im', 'PRON'), ('left', 'VERB'), ...]


In [25]:
print(nltk.corpus.conll2000.tagged_words())
print(nltk.corpus.conll2000.tagged_words(tagset = 'universal'))

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]
[('Confidence', 'NOUN'), ('in', 'ADP'), ('the', 'DET'), ...]


In [26]:
print(nltk.corpus.treebank.tagged_words())
print(nltk.corpus.treebank.tagged_words(tagset = 'universal'))

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ...]


## A Universal Part-of-Speect Tag-set

### Tag frequency distribution

In [30]:
brown_news_tagged = nltk.corpus.brown.tagged_words(categories='news', tagset='universal')
# tag frequency distribution
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

[('NOUN', 30654),
 ('VERB', 14399),
 ('ADP', 12355),
 ('.', 11928),
 ('DET', 11389),
 ('ADJ', 6706),
 ('ADV', 3349),
 ('CONJ', 2717),
 ('PRON', 2535),
 ('PRT', 2264),
 ('NUM', 2166),
 ('X', 92)]

## Nouns

The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.

Let's inspect some tagged text to see what parts of speech occur before a noun, with the most frequent ones first. To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as (('The', 'DET'), ('Fulton', 'NP')) and (('Fulton', 'NP'), ('County', 'N')). Then we construct a FreqDist from the tag parts of the bigrams.

In [37]:
# get bigrams
word_tag_pairs = nltk.bigrams(brown_news_tagged)
# what tags precede nouns
noun_preceders = [a[1] for (a,b) in word_tag_pairs if b[1] == 'NOUN']
fdist = nltk.FreqDist(noun_preceders)
print([tag for (tag, _) in fdist.most_common()])
print(fdist.most_common(20))

['NOUN', 'DET', 'ADJ', 'ADP', '.', 'VERB', 'CONJ', 'NUM', 'ADV', 'PRT', 'PRON', 'X']
[('NOUN', 7959), ('DET', 7373), ('ADJ', 4761), ('ADP', 3781), ('.', 2796), ('VERB', 1842), ('CONJ', 938), ('NUM', 894), ('ADV', 186), ('PRT', 94), ('PRON', 19), ('X', 11)]


## Verbs
What are the most common verbs in news text?

In [45]:
# sort all the tagged verbs by frequency
wsj = nltk.corpus.treebank.tagged_words(tagset='universal')
word_tag_fd = nltk.FreqDist(wsj)
# words in frequency distribution whose tag is VERB
print([wt[0] for (wt,_) in word_tag_fd.most_common() if wt[1] == 'VERB'][:30])

['is', 'said', 'was', 'are', 'be', 'has', 'have', 'will', 'says', 'would', 'were', 'had', 'been', 'could', "'s", 'can', 'do', 'say', 'make', 'may', 'did', 'rose', 'made', 'does', 'expected', 'buy', 'take', 'get', 'might', 'sell']


Note that the items being counted in the frequency distribution are word-tag pairs. Since words and tags are paired, we can treat the word as a condition and the tag as an event, and initialize a conditional frequency distribution with a list of condition-event pairs. This lets us see a frequency-ordered list of tags given a word:

In [47]:
# conditional tag frequency distribution
cfd1 = nltk.ConditionalFreqDist(wsj)

In [None]:
#  for - 'yield'
print(cfd1['yield'].most_common())

### The word 'yield' is more commonly used as a verb than a noun

In [58]:
# for - 'cut'
print(cfd1['cut'].most_common())

[('VERB', 25), ('NOUN', 3)]


### The word 'cut' is more commonly used as a verb than a noun by far.

We can reverse the order of the pairs, so that the tags are the conditions, and the words are the events. Now we can see likely words for a given tag. We will do this for the WSJ tagset rather than the universal tagset:

In [52]:
# given a tag (e.g. VBN verb past participle) what are likely words
wsj = nltk.corpus.treebank.tagged_words()
cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
print(list(cfd2['VBN'])[:30])

['named', 'used', 'caused', 'exposed', 'reported', 'replaced', 'sold', 'died', 'expected', 'diagnosed', 'studied', 'industrialized', 'owned', 'found', 'classified', 'rejected', 'outlawed', 'imported', 'tracked', 'thought', 'considered', 'elected', 'based', 'lifted', 'ensnarled', 'voted', 'been', 'held', 'banned', 'renovated']


## Adjectives and Adverbs
Two other important word classes are adjectives and adverbs. Adjectives describe nouns, and can be used as modifiers (e.g. large in the large pizza), or in predicates (e.g. the pizza is large). English adjectives can have internal structure (e.g. fall+ing in the falling stocks). Adverbs modify verbs to specify the time, manner, place or direction of the event described by the verb (e.g. quickly in the stocks fell quickly). Adverbs may also modify adjectives (e.g. really in Mary's teacher was really nice).

English has several categories of closed class words in addition to prepositions, such as articles (also often called determiners) (e.g., the, a), modals (e.g., should, may), and personal pronouns (e.g., she, they). Each dictionary and grammar classifies these words differently.

## Unsimplified Tags

Let's find the most frequent nouns of each noun POS type

In [63]:
# function to get the top 5 most frequent words for a tag
# param tag_prefix: string of the first couple characters in tag-set e.g. 'NN'
# param tagged_text: tagged words from a corpus
# returns a dictionary of the actual tag with top 5 word frequencies
def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

In [64]:
tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories = 'news'))
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('man', 72)]
NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("nation's", 6), ("city's", 6)]
NN$-HL [("Golf's", 1), ("Navy's", 1)]
NN$-TL [("President's", 11), ("Administration's", 3), ("Army's", 3), ("League's", 3), ("University's", 3)]
NN-HL [('sp.', 2), ('problem', 2), ('Question', 2), ('cut', 2), ('party', 2)]
NN-NC [('ova', 1), ('eva', 1), ('aya', 1)]
NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)]
NN-TL-HL [('Fort', 2), ('Mayor', 1), ('Commissioner', 1), ('City', 1), ('Oak', 1)]
NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)]
NNS$ [("children's", 7), ("women's", 5), ("men's", 3), ("janitors'", 3), ("taxpayers'", 2)]
NNS$-HL [("Dealers'", 1), ("Idols'", 1)]
NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Princes'", 1), ("Bombers'", 1)]
NNS-HL [('Wards', 1), ('deputies', 1), ('bonds', 1), ('aspects', 1), ('Decisions', 1)]
NNS-TL [

## Exploring Tagged Corpora

Suppose we're studying the word *often* and want to see how it is used in text. We could ask to see the words that follow *often*

In [67]:
brown_learned_text = nltk.corpus.brown.words(categories = 'learned')
print(sorted(set(b for (a,b) in nltk.bigrams(brown_learned_text) if a == 'often')))

[',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming', 'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', 'classified', 'colorful', 'composed', 'contain', 'differed', 'difficult', 'encountered', 'enough', 'equate', 'extremely', 'found', 'happens', 'have', 'ignored', 'in', 'involved', 'more', 'needed', 'nightly', 'observed', 'of', 'on', 'out', 'quite', 'represent', 'responsible', 'revamped', 'seclude', 'set', 'shortened', 'sing', 'sounded', 'stated', 'still', 'sung', 'supported', 'than', 'to', 'when', 'work']


However, it's probably more instructive to use the tagged_words() method to look at the part-of-speech tag of the following words:

In [68]:
brown_learned_tagged = nltk.corpus.brown.tagged_words(categories = 'learned', tagset = 'universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_learned_tagged) if a[0] == 'often']
fd = nltk.FreqDist(tags)
fd.tabulate()

VERB  ADV  ADP  ADJ    .  PRT 
  37    8    7    6    4    2 


Of the words following the word *often*, VERB words are the most common.

## Ambiguous Words as to their POS tag

Finally, let's look for words that are highly ambiguous as to their part of speech tag. Understanding why such words are tagged as they are in each context can help us clarify the distinctions between the tags.

In [87]:
brown_news_tagged = nltk.corpus.brown.tagged_words(categories='news', tagset='universal')
data = nltk.ConditionalFreqDist((word.lower(), tag) for (word, tag) in brown_news_tagged)
for word in sorted(data.conditions()):
    if len(data[word]) > 3:
        tags = [tag for (tag, _) in data[word].most_common()]
        print(word, ' '.join(tags))

best ADJ ADV VERB NOUN
close ADV ADJ VERB NOUN
open ADJ VERB NOUN ADV
present ADJ ADV NOUN VERB
that ADP DET PRON ADV


## Mapping Words to Properties Using Python Dictionaries

In [93]:
pos = {}
pos['colorless'] = 'ADJ'
pos['ideas'] = 'N'
pos['sleep'] = 'V'
pos['furiously'] = 'ADV'
for word in sorted(pos):
    print(word + ":", pos[word])

colorless: ADJ
furiously: ADV
ideas: N
sleep: V


In [103]:
# indexing words according to their last two letters:
last_letters = defaultdict(list)
words = nltk.corpus.words.words('en')
for word in words:
    key = word[-2:] # last two letters of each word
    last_letters[key].append(word) # append words that have this last two letters

print(last_letters['ly'][:30])

['abactinally', 'abandonedly', 'abasedly', 'abashedly', 'abashlessly', 'abbreviately', 'abdominally', 'abhorrently', 'abidingly', 'abiogenetically', 'abiologically', 'abjectly', 'ableptically', 'ably', 'abnormally', 'abominably', 'aborally', 'aboriginally', 'abortively', 'aboundingly', 'abridgedly', 'abruptedly', 'abruptly', 'abscondedly', 'absently', 'absentmindedly', 'absolutely', 'absolutistically', 'absorbedly', 'absorbingly']


## Anagrams

In [102]:
# words that contain a mix of letters - Anagrams
anagrams = nltk.Index((''.join(sorted(w)), w) for w in words)
anagrams['aeilnrt']

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

## Inverting a Dictionary
Dictionaries support efficient lookup, so long as you want to get the value for any key. If d is a dictionary and k is a key, we type d[k] and immediately obtain the value. Finding a key given a value is slower and more cumbersome:

In [106]:
counts = defaultdict(int)
for word in nltk.corpus.gutenberg.words('milton-paradise.txt'):
    counts[word] += 1

print([key for (key, value) in counts.items() if value == 32])

['mortal', 'Against', 'Him', 'There', 'brought', 'King', 'virtue', 'every', 'been', 'thine']


In [107]:
pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'} # dictionary of word and POS tag
pos2 = dict((value, key) for (key, value) in pos.items()) # inverted dictionary
pos2

{'ADJ': 'colorless', 'N': 'ideas', 'V': 'sleep', 'ADV': 'furiously'}