# Part-of-Speech Tagging

In [1]:
import nltk
from nltk import word_tokenize

In [2]:
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to /home/manuj/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

In [3]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/manuj/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [4]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [5]:
text = "I refuse to let this refuse get me down"

tokenized_words = word_tokenize(text)
tagged_words = nltk.pos_tag(tokenized_words)
tagged_words

[('I', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('let', 'VB'),
 ('this', 'DT'),
 ('refuse', 'NN'),
 ('get', 'VB'),
 ('me', 'PRP'),
 ('down', 'RP')]

In [6]:
text = """Bear with me, this effort with soon bear fruit, 
          otherwise we'll have to run from the bear"""

tokenized_words = word_tokenize(text)
tagged_words = nltk.pos_tag(tokenized_words)
tagged_words

[('Bear', 'NNP'),
 ('with', 'IN'),
 ('me', 'PRP'),
 (',', ','),
 ('this', 'DT'),
 ('effort', 'NN'),
 ('with', 'IN'),
 ('soon', 'RB'),
 ('bear', 'JJ'),
 ('fruit', 'NN'),
 (',', ','),
 ('otherwise', 'RB'),
 ('we', 'PRP'),
 ("'ll", 'MD'),
 ('have', 'VB'),
 ('to', 'TO'),
 ('run', 'VB'),
 ('from', 'IN'),
 ('the', 'DT'),
 ('bear', 'NN')]

In [7]:
text = "A bird in hand is worth two in the bush. " +\
       "Good things come to those who wait. " +\
       "There are other fish in the sea. " +\
       "The ball is in your court."

tokenized_words = word_tokenize(text)
tagged_words = nltk.pos_tag(tokenized_words)
tagged_words

[('A', 'DT'),
 ('bird', 'NN'),
 ('in', 'IN'),
 ('hand', 'NN'),
 ('is', 'VBZ'),
 ('worth', 'JJ'),
 ('two', 'CD'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('bush', 'NN'),
 ('.', '.'),
 ('Good', 'JJ'),
 ('things', 'NNS'),
 ('come', 'VBP'),
 ('to', 'TO'),
 ('those', 'DT'),
 ('who', 'WP'),
 ('wait', 'VBP'),
 ('.', '.'),
 ('There', 'EX'),
 ('are', 'VBP'),
 ('other', 'JJ'),
 ('fish', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('sea', 'NN'),
 ('.', '.'),
 ('The', 'DT'),
 ('ball', 'NN'),
 ('is', 'VBZ'),
 ('in', 'IN'),
 ('your', 'PRP$'),
 ('court', 'NN'),
 ('.', '.')]

In [8]:
from nltk.probability import FreqDist

fd = FreqDist(tagged_words)
fd_tagged = FreqDist(tag for (word, tag) in tagged_words)
fd_tagged.most_common(10)

[('NN', 7),
 ('DT', 5),
 ('IN', 4),
 ('.', 4),
 ('JJ', 3),
 ('VBP', 3),
 ('VBZ', 2),
 ('CD', 1),
 ('NNS', 1),
 ('TO', 1)]

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. 1.1 gives an example of each genre (for a complete list, see http://icame.uib.no/brown/bcm-los.html).

In [9]:
nltk.download('brown')

[nltk_data] Downloading package brown to /home/manuj/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [10]:
nltk.corpus.brown.words()[:10]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [11]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

Lexical categories like "noun" and part-of-speech tags like NN seem to have their uses, but the details will be obscure to many readers. You might wonder what justification there is for introducing this extra level of information. Many of these categories arise from superficial analysis the distribution of words in text. Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner). The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w' that appear in the same context, i.e. w1w'w2.

#### Similar words here belong to the same part of speech

In [12]:
text.similar('boy')

man time day way girl year house people world city family state room
country car woman program church government job


In [13]:
text.similar('run')

get be do in see work go have take make put and find time look day say
use come show


In [14]:
text.similar('over')

in on to of and for with from at by that into as up out down through
is all about
