Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs. These "word classes" are not just the idle invention of grammarians, but are useful categories for many language processing tasks. As we will see, they arise from simple analysis of the distribution of words in text. The goal of this chapter is to answer the following questions:

    1. What are lexical categories and how are they used in natural language processing?
    2. What is a good Python data structure for storing words and their categories?
    3. How can we automatically tag each word of a text with its word class?

Along the way, we'll cover some fundamental techniques in NLP, including sequence labeling, n-gram models, backoff, and evaluation. These techniques are useful in many areas, and tagging gives us a simple context in which to present them. We will also see how tagging is the second step in the typical NLP pipeline, following tokenization.

The process of classifying words into their parts of speech and labeling them accordingly is known as **part-of-speech tagging**, **POS-tagging**, or **simply tagging**. Parts of speech are also known as **word classes** or **lexical categories**. The collection of tags used for a particular task is known as a **tagset**. Our emphasis in this chapter is on exploiting tags, and tagging text automatically.

## Using a Tagger
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word (don't forget to import nltk):

In [18]:
import nltk, re

text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [6]:
# Documentation for each tags
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


Let's look at another example, this time including some homonyms:

In [4]:
text = nltk.word_tokenize("They refuse to permit us obtain the refuse permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

Lexical categories like "noun" and part-of-speech tags like NN seem to have their uses, but the details will be obscure to many readers. You might wonder what justification there is for introducing this extra level of information. Many of these categories arise from superficial analysis the distribution of words in text. Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner). The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w' that appear in the same context, i.e. w1w'w2.

In [120]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar("woman")

man time day year car moment world house family child country boy
state job place way war girl work word


In [8]:
text.similar("bought")

made said done put had seen found given left heard was been brought
set got that took in told felt


In [121]:
text.similar("over")

in on to of and for with from at by that into as up out down through
is all about


Observe that searching for woman finds nouns; searching for bought mostly finds verbs; searching for over generally finds prepositions; searching for the finds several determiners. A tagger can correctly identify the tags on these words in the context of a sentence, e.g. The woman bought over $150,000 worth of clothes.

A tagger can also model our knowledge of unknown words, e.g. we can guess that scrobbling is probably a verb, with the root scrobble, and likely to occur in contexts like he was scrobbling.

In [40]:
with open("ekonomi_wiki.txt", "rt") as file:
    ekonomi = file.read()

# Tokenization
token = re.split("[ \t\n]+", ekonomi)
token

['Ekonomi',
 'adalah',
 'ilmu',
 'sosial',
 'yang',
 'mempelajari',
 'aktivitas',
 'manusia',
 'yang',
 'berhubungan',
 'dengan',
 'produksi,',
 'distribusi,',
 'dan',
 'konsumsi',
 'terhadap',
 'barang',
 'dan',
 'jasa.',
 'Istilah',
 '"ekonomi"',
 'sendiri',
 'berasal',
 'dari',
 'bahasa',
 'Yunani,',
 'yaitu',
 'Î¿á¼¶ÎºÎ¿Ï‚',
 '(oikos)',
 'yang',
 'berarti',
 '"keluarga,',
 'rumah',
 'tangga"',
 'dan',
 'Î½ÏŒÎ¼Î¿Ï‚',
 '(nomos)',
 'yang',
 'berarti',
 '"peraturan,',
 'aturan,',
 'hukum".',
 'Secara',
 'garis',
 'besar,',
 'ekonomi',
 'diartikan',
 'sebagai',
 '"aturan',
 'rumah',
 'tangga"',
 'atau',
 '"manajemen',
 'rumah',
 'tangga".',
 'Sementara',
 'yang',
 'dimaksud',
 'dengan',
 'ahli',
 'ekonomi',
 'atau',
 'ekonom',
 'adalah',
 'orang',
 'menggunakan',
 'konsep',
 'ekonomi',
 'dan',
 'data',
 'dalam',
 'bekerja.',
 'Kata',
 '"ekonomi"',
 'merupakan',
 'kata',
 'serapan',
 'dari',
 'bahasa',
 'Yunani',
 'Kuno',
 'Î¿á¼°ÎºÎ¿Î½ÏŒÎ¼Î¿Ï‚',
 'yang',
 'bermakna',
 '"pengelolaan',
 'r

### Side-Project Ekonomi Wikipedia Indonesia

In [110]:
# contracted word
patterns = r"\w+'\w+|\w+-\d+|\w+"
compiler = re.compile(patterns)

def extract_match(text):
    """Extract matching text pattern and lowercase the text"""
    result = compiler.search(text.lower())
    return result.group() 


converted = list(map(extract_match, token)) # Normalize matching word
converted

['ekonomi',
 'adalah',
 'ilmu',
 'sosial',
 'yang',
 'mempelajari',
 'aktivitas',
 'manusia',
 'yang',
 'berhubungan',
 'dengan',
 'produksi',
 'distribusi',
 'dan',
 'konsumsi',
 'terhadap',
 'barang',
 'dan',
 'jasa',
 'istilah',
 'ekonomi',
 'sendiri',
 'berasal',
 'dari',
 'bahasa',
 'yunani',
 'yaitu',
 'î',
 'oikos',
 'yang',
 'berarti',
 'keluarga',
 'rumah',
 'tangga',
 'dan',
 'î½ïœî¼î',
 'nomos',
 'yang',
 'berarti',
 'peraturan',
 'aturan',
 'hukum',
 'secara',
 'garis',
 'besar',
 'ekonomi',
 'diartikan',
 'sebagai',
 'aturan',
 'rumah',
 'tangga',
 'atau',
 'manajemen',
 'rumah',
 'tangga',
 'sementara',
 'yang',
 'dimaksud',
 'dengan',
 'ahli',
 'ekonomi',
 'atau',
 'ekonom',
 'adalah',
 'orang',
 'menggunakan',
 'konsep',
 'ekonomi',
 'dan',
 'data',
 'dalam',
 'bekerja',
 'kata',
 'ekonomi',
 'merupakan',
 'kata',
 'serapan',
 'dari',
 'bahasa',
 'yunani',
 'kuno',
 'î',
 'yang',
 'bermakna',
 'pengelolaan',
 'rumah',
 'tangga',
 'kata',
 'ini',
 'merupakan',
 'gabungan

In [118]:
ekonomi = nltk.Text(converted)
ekonomi.collocation_list() # pair of words that often showed up

['rumah tangga',
 'antara lain',
 'daya alam',
 'menciptakan produk',
 'sektor tersier',
 'amerika serikat',
 'bahasa yunani',
 'tenaga kerja',
 'ini biasanya',
 'sektor ini',
 'sektor primer',
 'disebut sebagai',
 'produksi distribusi',
 'sumber daya',
 'dari bahasa',
 'dibahas dalam',
 'sektor quaterner',
 'sektor sekunder',
 'negara berkembang',
 'dan konsumsi']

## Tagged Corpora

### Representing Tagged Tokens
By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

In [122]:
tagged_token = nltk.tag.str2tuple("fly/NN")
tagged_token

('fly', 'NN')

In [123]:
sent = '''
The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
interest/NN of/IN both/ABX governments/NNS ''/'' ./.
'''

[nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'),
 ('grand', 'JJ'),
 ('jury', 'NN'),
 ('commented', 'VBD'),
 ('on', 'IN'),
 ('a', 'AT'),
 ('number', 'NN'),
 ('of', 'IN'),
 ('other', 'AP'),
 ('topics', 'NNS'),
 (',', ','),
 ('AMONG', 'IN'),
 ('them', 'PPO'),
 ('the', 'AT'),
 ('Atlanta', 'NP'),
 ('and', 'CC'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('purchasing', 'VBG'),
 ('departments', 'NNS'),
 ('which', 'WDT'),
 ('it', 'PPS'),
 ('said', 'VBD'),
 ('``', '``'),
 ('ARE', 'BER'),
 ('well', 'QL'),
 ('operated', 'VBN'),
 ('and', 'CC'),
 ('follow', 'VB'),
 ('generally', 'RB'),
 ('accepted', 'VBN'),
 ('practices', 'NNS'),
 ('which', 'WDT'),
 ('inure', 'VB'),
 ('to', 'IN'),
 ('the', 'AT'),
 ('best', 'JJT'),
 ('interest', 'NN'),
 ('of', 'IN'),
 ('both', 'ABX'),
 ('governments', 'NNS'),
 ("''", "''"),
 ('.', '.')]