## Categorizing and Tagging Words

The process of classifying words into their parts-of-speech and labeling them accord- ingly is known as part-of-speech tagging, POS tagging, or simply tagging. Parts- of-speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. Our emphasis in this chapter is on exploiting tags, and tagging text automatically.

In [24]:
import nltk
%matplotlib inline

In [2]:
text = nltk.word_tokenize("And now for something completely different")

In [3]:
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

In [4]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")

In [5]:
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

### Representing Tagged Tokens

In [6]:
tagged_token = nltk.tag.str2tuple('fly/NN')

In [7]:
tagged_token

('fly', 'NN')

In [8]:
tagged_token[0]

'fly'

In [11]:
sent = '''
The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
interest/NN of/IN both/ABX governments/NNS ''/'' ./.
'''

In [12]:
[nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'),
 ('grand', 'JJ'),
 ('jury', 'NN'),
 ('commented', 'VBD'),
 ('on', 'IN'),
 ('a', 'AT'),
 ('number', 'NN'),
 ('of', 'IN'),
 ('other', 'AP'),
 ('topics', 'NNS'),
 (',', ','),
 ('AMONG', 'IN'),
 ('them', 'PPO'),
 ('the', 'AT'),
 ('Atlanta', 'NP'),
 ('and', 'CC'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('purchasing', 'VBG'),
 ('departments', 'NNS'),
 ('which', 'WDT'),
 ('it', 'PPS'),
 ('said', 'VBD'),
 ('``', '``'),
 ('ARE', 'BER'),
 ('well', 'QL'),
 ('operated', 'VBN'),
 ('and', 'CC'),
 ('follow', 'VB'),
 ('generally', 'RB'),
 ('accepted', 'VBN'),
 ('practices', 'NNS'),
 ('which', 'WDT'),
 ('inure', 'VB'),
 ('to', 'IN'),
 ('the', 'AT'),
 ('best', 'JJT'),
 ('interest', 'NN'),
 ('of', 'IN'),
 ('both', 'ABX'),
 ('governments', 'NNS'),
 ("''", "''"),
 ('.', '.')]

In [13]:
nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

In [16]:
from nltk.corpus import brown

In [18]:
brown_news_tagged = brown.tagged_words(categories='news')

In [19]:
brown_news_tagged

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

In [20]:
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)

In [21]:
tag_fd.keys()

dict_keys(['AT', 'NP-TL', 'NN-TL', 'JJ-TL', 'VBD', 'NR', 'NN', 'IN', 'NP$', 'JJ', '``', "''", 'CS', 'DTI', 'NNS', '.', 'RBR', ',', 'WDT', 'HVD', 'VBZ', 'CC', 'IN-TL', 'BEDZ', 'VBN', 'NP', 'BEN', 'TO', 'VB', 'RB', 'DT', 'PPS', 'DOD', 'AP', 'BER', 'HV', 'DTS', 'VBG', 'PPO', 'QL', 'JJT', 'ABX', 'NN-HL', 'VBN-HL', 'WRB', 'CD', 'MD', 'BE', 'JJR', 'VBG-TL', 'BEZ', 'NN$-TL', 'HVZ', 'ABN', 'PN', 'PPSS', 'PP$', 'DO', 'NN$', 'NNS-HL', 'WPS', '*', 'EX', 'VB-HL', ':', '(', ')', 'NNS-TL', 'NPS', 'JJS', 'RP', '--', 'BED', 'OD', 'BEG', 'AT-HL', 'VBG-HL', 'AT-TL', 'PPL', 'DOZ', 'NP-HL', 'NR$', 'DOD*', 'BEDZ*', ',-HL', 'CC-TL', 'MD*', 'NNS$', 'PPSS+BER', "'", 'PPSS+BEM', 'CD-TL', 'RBT', '(-HL', ')-HL', 'MD-HL', 'VBZ-HL', 'IN-HL', 'JJ-HL', 'PPLS', 'CD-HL', 'WPO', 'JJS-TL', 'ABL', 'BER-HL', 'PPS+HVZ', 'VBD-HL', 'RP-HL', 'MD*-HL', 'AP-HL', 'CS-HL', 'DT$', 'HVN', 'FW-IN', 'FW-DT', 'VBN-TL', 'NR-TL', 'NNS$-TL', 'FW-NN', 'HVG', 'DTX', 'OD-TL', 'BEM', 'RB-HL', 'PPSS+MD', 'NPS-HL', 'NPS$', 'WP$', 'NN-TL-HL', '

In [28]:
word_tag_pairs = nltk.bigrams(brown_news_tagged)

In [29]:
list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N'))

[]

In [30]:
wsj = nltk.corpus.treebank.tagged_words()

In [31]:
word_tag_fd = nltk.FreqDist(wsj)

In [32]:
[word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')]

['join/VB',
 'is/VBZ',
 'publishing/VBG',
 'was/VBD',
 'named/VBN',
 'used/VBN',
 'make/VB',
 'has/VBZ',
 'caused/VBN',
 'exposed/VBN',
 'reported/VBD',
 'enters/VBZ',
 'causing/VBG',
 'show/VBP',
 'said/VBD',
 'makes/VBZ',
 'stopped/VBD',
 'using/VBG',
 'were/VBD',
 'reported/VBN',
 'appear/VBP',
 'bring/VB',
 "'re/VBP",
 'talking/VBG',
 'heard/VBD',
 'having/VBG',
 'studied/VBD',
 'have/VBP',
 'are/VBP',
 'led/VBD',
 'making/VBG',
 'replaced/VBN',
 'sold/VBN',
 'worked/VBD',
 'died/VBN',
 'expected/VBN',
 'surviving/VBG',
 'including/VBG',
 'diagnosed/VBN',
 'study/VBP',
 'appears/VBZ',
 'be/VB',
 'studied/VBN',
 'industrialized/VBN',
 'owned/VBN',
 'support/VB',
 'argue/VBP',
 'regulate/VB',
 'found/VBN',
 'does/VBZ',
 'have/VB',
 'classified/VBN',
 'according/VBG',
 'rejected/VBN',
 'explained/VBD',
 'imposed/VBD',
 'remaining/VBG',
 'outlawed/VBN',
 'made/VBD',
 'dumped/VBD',
 'imported/VBN',
 'poured/VBD',
 'mixed/VBD',
 'described/VBD',
 'hung/VBD',
 'ventilated/VBD',
 "'s/VBZ",

In [33]:
cfd1 = nltk.ConditionalFreqDist(wsj)

In [34]:
cfd1['yield'].keys()

dict_keys(['NN', 'VB'])

In [36]:
cfd1['cut'].keys()

dict_keys(['VBD', 'VB', 'VBN', 'NN'])

In [37]:
cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)

In [45]:
cfd2['NNP'].keys()

dict_keys(['Pierre', 'Vinken', 'Nov.', 'Mr.', 'Elsevier', 'N.V.', 'Dutch', 'Rudolph', 'Agnew', 'Consolidated', 'Gold', 'Fields', 'PLC', 'Kent', 'Lorillard', 'Inc.', 'Loews', 'Corp.', 'New', 'England', 'Journal', 'Medicine', 'James', 'A.', 'Talcott', 'Boston', 'Dana-Farber', 'Cancer', 'Institute', 'Dr.', 'National', 'Harvard', 'University', 'West', 'Groton', 'Mass.', 'Hollingsworth', 'Vose', 'Co.', 'U.S.', 'Brooke', 'T.', 'Mossman', 'Vermont', 'College', 'July', 'Environmental', 'Protection', 'Agency', 'Darrell', 'Phillips', 'IBC', 'Money', 'Fund', 'Report', 'Tuesday', 'August', 'Donoghue', 'Brenda', 'Malizia', 'Negus', 'Treasury', 'Monday', 'Dreyfus', 'World-Wide', 'Dollar', 'J.P.', 'Bolduc', 'W.R.', 'Grace', 'Terrence', 'D.', 'Daniels', 'Energy', 'Pacific', 'First', 'Financial', 'Royal', 'Trustco', 'Ltd.', 'Toronto', 'McDermott', 'International', 'Babcock', 'Wilcox', 'Bailey', 'Controls', 'Operations', 'Finmeccanica', 'S.p', 'Wickliffe', 'Ohio', 'Congress', 'House', 'Senate', 'Clark',

In [46]:
cfd2.keys()

dict_keys(['NNP', ',', 'CD', 'NNS', 'JJ', 'MD', 'VB', 'DT', 'NN', 'IN', '.', 'VBZ', 'VBG', 'CC', 'VBD', 'VBN', '-NONE-', 'RB', 'TO', 'PRP', 'RBR', 'WDT', 'VBP', 'RP', 'PRP$', 'JJS', 'POS', '``', 'EX', "''", 'WP', ':', 'JJR', 'WRB', '$', 'NNPS', 'WP$', '-LRB-', '-RRB-', 'PDT', 'RBS', 'FW', 'UH', 'SYM', 'LS', '#', 'VN', 'V', 'N'])

### Unsimplified Tags

In [49]:
def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())


In [50]:
tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))

TypeError: 'dict_keys' object is not subscriptable