# Chapter 5  Categorizing and Tagging Words

Read through the chapter, completing the "Your Turn" activities and exercises below for each section.  

## Section 1 - Using a Tagger

**Your Turn:**
Many words, like ski and race, can be used as nouns or verbs with no difference in pronunciation. Can you think of others? Hint: think of a commonplace object and try to put the word to before it to see if it can also be a verb, or think of an action and try to put the before it to see if it can also be a noun. Now make up a sentence with both uses of this word, and run the POS-tagger on this sentence.

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
import nltk
text1 = nltk.word_tokenize("I need some space to garage my cars")
text2 = nltk.word_tokenize("I require a place to build the garage")
text3 = nltk.word_tokenize("I will garage my car in the garage")
nltk.pos_tag(text1)
nltk.pos_tag(text2)
nltk.pos_tag(text3)

[('I', 'PRP'),
 ('need', 'VBP'),
 ('some', 'DT'),
 ('space', 'NN'),
 ('to', 'TO'),
 ('garage', 'VB'),
 ('my', 'PRP$'),
 ('cars', 'NNS')]

[('I', 'PRP'),
 ('require', 'VBP'),
 ('a', 'DT'),
 ('place', 'NN'),
 ('to', 'TO'),
 ('build', 'VB'),
 ('the', 'DT'),
 ('garage', 'NN')]

[('I', 'PRP'),
 ('will', 'MD'),
 ('garage', 'VB'),
 ('my', 'PRP$'),
 ('car', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('garage', 'NN')]

## Section 2 - Tagged Corpora

**Your Turn:**
Given the list of past participles produced by `list(cfd2['VBN'])`, try to collect a list of all the word-tag pairs that immediately precede items in that list.

*Note that there is a typo in the book exercise, it should be `VBN` not `VN`.  Also, this is a challenging exercise.  In the code below, I've tried to make it more understandable by using a loop instead of a list comprehension.  Feel free to use the starter code or replace it with your own solution*.

In [None]:
'''import nltk
wsj = nltk.corpus.treebank.tagged_words()
cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
participles = list(cfd2['VBN'])
prior_list = [] # list to hold the word-tag pairs that immediately precede items in that list
for w in participles: # for each word in the list
    idx1=wsj.index((w, 'VBN'))  # find the integer location of the word/tag tuple in the corpus
    # prior_wt = wsj[??] retrieve the prior word by replacing ?? with the index of the prior word/tag pair
    # append to the word/tag pair to the list declared above
print(prior_list)'''


In [3]:
# using wsj corpus
wsj = nltk.corpus.treebank.tagged_words()
cfd = nltk.ConditionalFreqDist((tag,word) for (word,tag) in wsj)
participles = list(cfd['VBN'])
# [w for w in cfd.conditions() if 'VBD' in cfd[w]] - this and the previous cell yield the same result
prior = []
for w in participles:
    idx1 = wsj.index((w,'VBN'))
    word_tags = wsj[idx1-1]
    prior.append((word_tags, wsj[idx1]))
prior

# Con: Code takes very long to run [about 10 minutes without hardware acceleration]

[(('was', 'VBD'), ('named', 'VBN')),
 (('once', 'RB'), ('used', 'VBN')),
 (('has', 'VBZ'), ('caused', 'VBN')),
 (('workers', 'NNS'), ('exposed', 'VBN')),
 (('were', 'VBD'), ('reported', 'VBN')),
 (('and', 'CC'), ('replaced', 'VBN')),
 (('were', 'VBD'), ('sold', 'VBN')),
 (('have', 'VBP'), ('died', 'VBN')),
 (('the', 'DT'), ('expected', 'VBN')),
 (('recently', 'RB'), ('diagnosed', 'VBN')),
 (('workers', 'NNS'), ('studied', 'VBN')),
 (('Western', 'JJ'), ('industrialized', 'VBN')),
 (('is', 'VBZ'), ('owned', 'VBN')),
 ((',', ','), ('found', 'VBN')),
 (('are', 'VBP'), ('classified', 'VBN')),
 (('easily', 'RB'), ('rejected', 'VBN')),
 (('be', 'VB'), ('outlawed', 'VBN')),
 (('the', 'DT'), ('imported', 'VBN')),
 (('funds', 'NNS'), ('tracked', 'VBN')),
 (('are', 'VBP'), ('thought', 'VBN')),
 (('are', 'VBP'), ('considered', 'VBN')),
 (('was', 'VBD'), ('elected', 'VBN')),
 ((',', ','), ('based', 'VBN')),
 (("n't", 'RB'), ('lifted', 'VBN')),
 (('is', 'VBZ'), ('ensnarled', 'VBN')),
 (('has', 'VBZ'

## Section 3 -  Mapping Words to Properties Using Python Dictionaries

**Exercise:**
Create two dictionaries, d1 and d2, and add some entries to each. Now issue the command `d1.update(d2)`. What did this do? What might it be useful for?

*Be sure to include both the code and answers to both questions.  Add additional cells if necessary*

In [4]:
d1 = dict({1:'This',2:'is',3:'a',4:'sentence'})
d2 = dict({1:'Add',2:'Russian',3:'Collusion'})
d1.update(d2)

print(d1)

{1: 'Add', 2: 'Russian', 3: 'Collusion', 4: 'sentence'}


#### Explanation:
The update function joins both dictionaries together and replaces the values that have the same keys
to the dictionary being updated. This can also be used to add more values or replace some values 
to a dictionary by creating a new one with different values. If d1 has keys 1-4 and d2 has 
keys 5-8, then d1 will be updated with keys 1-8 and the values accompanying them.

## Section 4 - Automatic Tagging

**Your Turn:**
Come up with at least two patterns to improve the performance of the regular expression tagger presented in this chapter (and duplicated below). *Note that section 1 of Chapter 6 describes a way partially automate such work.*

**Your new rules had no impact on tagger performance**

In [2]:
import re
import nltk
from nltk.corpus import brown
patterns = [
        (r'.*ing$', 'VBG'),               # gerunds
        (r'.*ed$', 'VBD'),                # simple past
        (r'.*es$', 'VBZ'),                # 3rd singular present
        (r'.*ould$', 'MD'),               # modals
        (r'.*\'s$', 'NN$'),               # possessive nouns
        (r'.*s$', 'NNS'),                 # plural nouns
        (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
        (r'.*', 'NN'),                    # nouns (default)
        (r'/\b([A-Z][a-z]+)\b/','NNP'),   # proper nouns 
        (r'[Tt]o', 'TO'),                 # TO
        (r'/[\W\S]/', 'SYM')]             #Symbols
brown_tagged_sents = brown.tagged_sents(categories='news')
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.evaluate(brown_tagged_sents)

0.20326391789486245

## Section 5  -  N-Gram Tagging

** Exercise:**
Train a unigram tagger and run it on some new text. Observe that some words are not assigned a tag. Why not?


In [7]:
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])
unigram_tagger.evaluate(brown_tagged_sents)

size = int(len(brown_tagged_sents)*0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

# input humor category and attempt to test on news category. 
brown_tagged_sents2 = brown.tagged_sents(categories='humor')
brown_sents2 = brown.sents(categories='humor')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents2)
unigram_tagger.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'QL'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

0.9349006503968017

0.8121200039868434

[('Various', None),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', None),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', None),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'QL'),
 ('that', 'CS'),
 ('entrance', None),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

#### Conclusion:
Testing the same unigram tagger on a new set of text shows us that some words do not have a tag. This is because they are unknown - since the tagger has not seen them before. 

In [None]:
# testing pos tagger - unrelated
from nltk import word_tokenize, pos_tag
text = "The development of the T-72 was a direct result of the introduction of the T-64 tank. The T-64 (Object 432) was a very ambitious project to build a competitive well-armoured tank with a weight of not more than 36 tons. Under the direction of Alexander Morozov in Kharkiv a new design emerged with the hull reduced to the minimum size possible. To do this, the crew was reduced to three soldiers, removing the loader by introducing an automated loading system"
print(pos_tag(word_tokenize(text)))

**Your Turn:**
Extend the example below by defining a TrigramTagger called t3, which backs off to t2.

In [9]:
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)
t3 = nltk.TrigramTagger(train_sents, backoff=t2)
t3.evaluate(test_sents)

0.8452108043456593

0.843317053722715

**Exercise:**
Preprocess the Brown News data by replacing low frequency words with UNK, but leaving the tags untouched. Now train and evaluate a bigram tagger on this data. How much does this help? What is the contribution of the unigram tagger and default tagger now?

In [6]:
from nltk.corpus import brown
def modify_words(corpus, a):
    nc = []
    for sent in corpus:
        nc.append([(w,a(w,t)) for (w,t) in sent])
    return nc
news_sents = brown.tagged_sents(categories = 'news')
freqD = nltk.FreqDist(brown.words())
most_common = [tag for (tag, __) in freqD.most_common(100)]

def unknown(word,tag):
    if word in most_common:
        return word
    else:
        return 'UNK'
news_sents = modify_words(news_sents, unknown)

size = int(len(news_sents)*0.9)
train = news_sents[:size]
test = news_sents[size:]

defTag = nltk.DefaultTagger('NN')
UniTag = nltk.UnigramTagger(train, backoff=defTag)
UniTag2 = nltk.UnigramTagger(train)
BiTag = nltk.BigramTagger(train, backoff = UniTag)
BiTag2 = nltk.BigramTagger(train)

print('Default tagger performance, alone:', defTag.evaluate(test))
print('Unigram tagger, no backoff:', UniTag2.evaluate(test))
print('Unigram tagger, with backoff to default tagger:', UniTag.evaluate(test))
print('Bigram tagger, no backoff:', BiTag2.evaluate(test))
print('Bigram tagger, with backoff:', BiTag.evaluate(test))

Default tagger performance, alone: 0.0
Unigram tagger, no backoff: 0.8857769361108343
Unigram tagger, with backoff to default tagger: 0.8857769361108343
Bigram tagger, no backoff: 0.11412339280374763
Bigram tagger, with backoff: 0.8857769361108343


### Analysis:
Bigram tagger displays a performance of 11%. However, the tagger with backoff to unigram tagger shows a result of 88%. This shows us that the default tagger and unigram tagger contribute significantly to the performance the Bigram tagger when using backoff. 

Note: Reference and credit for modify_words function: [GitHub](https://github.com/walshbr/nltk/blob/db88a9887751fd65333472a3a15519862a066a08/ch_five/modify.py#L8)

## Section 6 - Transformation-Based Tagging

**Exercise:**
Run the Brill Tagger demo using the code below.  Select 5 of the useful rules and rewrite them using an English sentence.  For example, if the rule is `NN->VB if Pos:TO@[-1]` the related sentence would be something like, change a noun to a verb if the preceding word is tagged TO.

In [29]:
from nltk.tbl import demo as brill_tagger
brill_tagger.demo()

Loading tagged data from treebank... 
Read testing data (200 sents/5251 wds)
Read training data (800 sents/19933 wds)
Read baseline data (800 sents/19933 wds) [reused the training set]
Trained baseline tagger
    Accuracy on test set: 0.8366
Training tbl tagger...
TBL train (fast) (seqs: 800; tokens: 19933; tpls: 24; min score: 3; min acc: None)
Finding initial useful rules...
    Found 12799 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  23  23   0   0  | POS->VBZ if Pos:PRP@[-2,-1]
  18  19   1   0  | NN->VB if Pos:-NONE-@[-2] & Pos:TO@[-1]
  14  14   0   0  | VBP->VB if Pos:MD@[-2,-1]
  12  12   0   0  | VBP->VB if Pos:TO@[-1]
  

### Solution here.
- Change a noun to a verb if the preceding word is tagged as 'MD' or modal.
- Change a noun to a verb if the preceding two words are NONE and the previous word is tagged as TO.
- Change a possessive ending to a present tense (third person) of a verb if the preceding words are personal pronouns.
- Change non third person present tense of a verb to a base verb if the previous word is tagged as TO.
- Change a past tense of a verb into a past participle of the verb if the previous word is a VBD, or past tense of a verb.
- Change a preposition into a wh-type determiner if the previous word is classified as a NONE and if the next word in index 2 is a non third person verb in its present tense. 