# POS Tagging

- The method of categorizing words into their parts of speech and then labeling them respectively is called POS Tagging.

# POS Tagger
- A POS Tagger processes a sequence of words and tags a part of speech to each word.

- pos_tag is the simplest tagger available in nltk.

- The below example shows usage of pos_tag.



In [2]:
import nltk
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
nltk.pos_tag(words)

[('Python', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ'), ('.', '.')]

- The words Python, is and awesome are tagged to Proper Noun (NNP), Present Tense Verb (VB), and adjective (JJ) respectively.

- You can read more about the pos tags with the below help command

In [3]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [4]:
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


# Tagging Text
- Constructing a list of tagged words from a string is possible.

- A tagged word or token is represented in a tuple, having the word and the tag.

- In the input text, each word and tag are separated by /.

In [5]:
text = 'Python/NN is/VB awesome/JJ ./.'
[ nltk.tag.str2tuple(word) for word in text.split() ]

[('Python', 'NN'), ('is', 'VB'), ('awesome', 'JJ'), ('.', '.')]

# Tagged Corpora
- Many of the text corpus available in nltk, are already tagged to their respective parts of speech.

- tagged_words method can be used to obtain tagged words of a corpus.

- The following example fetches tagged words of brown corpus and displays few.

In [6]:
from nltk.corpus import brown

In [7]:
brown_tagged = brown.tagged_words()

In [8]:
brown_tagged[:3]

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]

# Lookup Tagger
- You can define a custom tagger and use it to tag words present in any text.

- The below-shown example defines a dictionary defined_tags, with three words and their respective tags.

In [9]:
import nltk
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
defined_tags = {'is':'BEZ', 'over':'IN', 'who': 'WPS'}

In [10]:
baseline_tagger = nltk.UnigramTagger(model=defined_tags)

In [11]:
baseline_tagger.tag(words)

[('Python', None), ('is', 'BEZ'), ('awesome', None), ('.', None)]

- Since the words Python and awesome are not found in defined_tags dictionary, they are tagged to None.

# Unigram Tagger
- UnigramTagger provides you the flexibility to create your taggers.

- Unigram taggers are built based on statistical information. i.e., they tag each word or token to most likely tag for that particular word.

- You can build a unigram tagger through a process known as training.

- Then use the tagger to tag words in a test set and evaluate the performance.



- Let's consider the tagged sentences of brown corpus collections, associated with government genre.

- Let's also compute the training set size, i.e., 80%.



In [14]:
from nltk.corpus import brown

In [15]:
brown_tagged_sents = brown.tagged_sents(categories='government')

In [16]:
brown_sents = brown.sents(categories='government')

In [17]:
len(brown_sents)

3032

In [18]:
train_size = int(len(brown_sents)*0.8)

In [19]:
train_size

2425

In [20]:
train_sents = brown_tagged_sents[:train_size]
test_sents = brown_tagged_sents[train_size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

0.7799495586380832

- unigram_tagger is built by passing trained tagged sentences as argument to UnigramTagger.

- The built unigram_tagger is further evaluated with test sentences.

In [21]:
unigram_tagger.tag(brown_sents[3000])

[('The', 'AT'),
 ('first', 'OD'),
 ('step', 'NN'),
 ('is', 'BEZ'),
 ('a', 'AT'),
 ('comprehensive', 'JJ'),
 ('self', None),
 ('study', 'NN'),
 ('made', 'VBN'),
 ('by', 'IN'),
 ('faculty', None),
 (',', ','),
 ('by', 'IN'),
 ('outside', 'IN'),
 ('consultants', 'NNS'),
 (',', ','),
 ('or', 'CC'),
 ('by', 'IN'),
 ('a', 'AT'),
 ('combination', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('two', 'CD'),
 ('.', '.')]

In [25]:
import nltk
from nltk.corpus import brown

In [26]:
brown_tagged_words=brown.tagged_words()

In [28]:
brown_tagged_trigrams=list(nltk.trigrams(brown_tagged_words))

In [30]:
brown_trigram_pos_tags = [(a[1],b[1],c[1]) for a,b,c in brown_tagged_trigrams]

In [32]:
brown_trigram_pos_tags

[('AT', 'NP-TL', 'NN-TL'),
 ('NP-TL', 'NN-TL', 'JJ-TL'),
 ('NN-TL', 'JJ-TL', 'NN-TL'),
 ('JJ-TL', 'NN-TL', 'VBD'),
 ('NN-TL', 'VBD', 'NR'),
 ('VBD', 'NR', 'AT'),
 ('NR', 'AT', 'NN'),
 ('AT', 'NN', 'IN'),
 ('NN', 'IN', 'NP$'),
 ('IN', 'NP$', 'JJ'),
 ('NP$', 'JJ', 'NN'),
 ('JJ', 'NN', 'NN'),
 ('NN', 'NN', 'VBD'),
 ('NN', 'VBD', '``'),
 ('VBD', '``', 'AT'),
 ('``', 'AT', 'NN'),
 ('AT', 'NN', "''"),
 ('NN', "''", 'CS'),
 ("''", 'CS', 'DTI'),
 ('CS', 'DTI', 'NNS'),
 ('DTI', 'NNS', 'VBD'),
 ('NNS', 'VBD', 'NN'),
 ('VBD', 'NN', '.'),
 ('NN', '.', 'AT'),
 ('.', 'AT', 'NN'),
 ('AT', 'NN', 'RBR'),
 ('NN', 'RBR', 'VBD'),
 ('RBR', 'VBD', 'IN'),
 ('VBD', 'IN', 'NN'),
 ('IN', 'NN', 'NNS'),
 ('NN', 'NNS', 'CS'),
 ('NNS', 'CS', 'AT'),
 ('CS', 'AT', 'NN-TL'),
 ('AT', 'NN-TL', 'JJ-TL'),
 ('NN-TL', 'JJ-TL', 'NN-TL'),
 ('JJ-TL', 'NN-TL', ','),
 ('NN-TL', ',', 'WDT'),
 (',', 'WDT', 'HVD'),
 ('WDT', 'HVD', 'JJ'),
 ('HVD', 'JJ', 'NN'),
 ('JJ', 'NN', 'IN'),
 ('NN', 'IN', 'AT'),
 ('IN', 'AT', 'NN'),
 ('AT', 'N

In [33]:
brown_trigram_pos_tags_freq=nltk.FreqDist(brown_trigram_pos_tags)

In [34]:
brown_trigram_pos_tags_freq

FreqDist({('IN', 'AT', 'NN'): 21116, ('AT', 'NN', 'IN'): 17423, ('NN', 'IN', 'AT'): 14699, ('AT', 'JJ', 'NN'): 13480, ('JJ', 'NN', 'IN'): 8424, ('IN', 'AT', 'JJ'): 8015, ('NN', 'IN', 'NN'): 7401, ('NNS', 'IN', 'AT'): 4829, ('AT', 'NN', 'NN'): 4667, ('AT', 'NN', '.'): 4567, ...})

In [35]:
brown_trigram_pos_tags_freq[('JJ','NN','IN')]

8424

In [37]:
brown_tagged_sents=brown.tagged_sents()

In [38]:
total_size = len(brown_tagged_sents)

In [40]:
train_size=int(total_size*0.8)

In [41]:
train_sents = brown_tagged_sents[:train_size]

In [42]:
test_sents = brown_tagged_sents[train_size:]

In [44]:
unigram_tagger=nltk.UnigramTagger(train_sents)

In [46]:
tag_performace=unigram_tagger.evaluate(test_sents)

In [47]:
tag_performace

0.8773754310202373

In [71]:
news_words = brown.words()

In [72]:
nltk.pos_tag(news_words)

[('The', 'DT'),
 ('Fulton', 'NNP'),
 ('County', 'NNP'),
 ('Grand', 'NNP'),
 ('Jury', 'NNP'),
 ('said', 'VBD'),
 ('Friday', 'NNP'),
 ('an', 'DT'),
 ('investigation', 'NN'),
 ('of', 'IN'),
 ("Atlanta's", 'NNP'),
 ('recent', 'JJ'),
 ('primary', 'JJ'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', '``'),
 ('no', 'DT'),
 ('evidence', 'NN'),
 ("''", "''"),
 ('that', 'IN'),
 ('any', 'DT'),
 ('irregularities', 'NNS'),
 ('took', 'VBD'),
 ('place', 'NN'),
 ('.', '.'),
 ('The', 'DT'),
 ('jury', 'NN'),
 ('further', 'RB'),
 ('said', 'VBD'),
 ('in', 'IN'),
 ('term-end', 'JJ'),
 ('presentments', 'NNS'),
 ('that', 'IN'),
 ('the', 'DT'),
 ('City', 'NNP'),
 ('Executive', 'NNP'),
 ('Committee', 'NNP'),
 (',', ','),
 ('which', 'WDT'),
 ('had', 'VBD'),
 ('over-all', 'JJ'),
 ('charge', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('election', 'NN'),
 (',', ','),
 ('``', '``'),
 ('deserves', 'VBZ'),
 ('the', 'DT'),
 ('praise', 'NN'),
 ('and', 'CC'),
 ('thanks', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('City',

In [75]:
cfdd=nltk.FreqDist(nltk.pos_tag(news_words))

In [77]:
cfdd[('The','AT')]

0

In [57]:
s = 'Python is awesome'
print(nltk.pos_tag(nltk.word_tokenize(s)))

[('Python', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ')]


In [53]:
import nltk
tagged_token = nltk.tag.str2tuple('fly/NN')
print(tagged_token)

('fly', 'NN')


In [78]:
news_words=brown.words(categories='news')

In [79]:
nltk.FreqDistnltk.pos_tag(news_words)

[('The', 'DT'),
 ('Fulton', 'NNP'),
 ('County', 'NNP'),
 ('Grand', 'NNP'),
 ('Jury', 'NNP'),
 ('said', 'VBD'),
 ('Friday', 'NNP'),
 ('an', 'DT'),
 ('investigation', 'NN'),
 ('of', 'IN'),
 ("Atlanta's", 'NNP'),
 ('recent', 'JJ'),
 ('primary', 'JJ'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', '``'),
 ('no', 'DT'),
 ('evidence', 'NN'),
 ("''", "''"),
 ('that', 'IN'),
 ('any', 'DT'),
 ('irregularities', 'NNS'),
 ('took', 'VBD'),
 ('place', 'NN'),
 ('.', '.'),
 ('The', 'DT'),
 ('jury', 'NN'),
 ('further', 'RB'),
 ('said', 'VBD'),
 ('in', 'IN'),
 ('term-end', 'JJ'),
 ('presentments', 'NNS'),
 ('that', 'IN'),
 ('the', 'DT'),
 ('City', 'NNP'),
 ('Executive', 'NNP'),
 ('Committee', 'NNP'),
 (',', ','),
 ('which', 'WDT'),
 ('had', 'VBD'),
 ('over-all', 'JJ'),
 ('charge', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('election', 'NN'),
 (',', ','),
 ('``', '``'),
 ('deserves', 'VBZ'),
 ('the', 'DT'),
 ('praise', 'NN'),
 ('and', 'CC'),
 ('thanks', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('City',