# Introduction

In this exercise, we will 
- Run a POS tagger in Python
    - On a sentence longer than 10 words
    - On a shorter sentence shorter than 10 words
    - Explain why the tagger may have been less than perfect with this this sentence. 
- Run a different POS tagger in Python to tag the above 2 sentences
    - Check if it provides different output
    - Explain any difference in the outputs
- Take a random sentence from a NEWS article
    - Using Penn tag set, manually POS tag the sentence yourself
    - Run the same sentence through both taggers that is implemented in the previous questions and check if the results were different.
    - Explain the differences between the 2 taggers and the manual tagging.

#### Preparation Steps

- Import the necessary packages
- Select random lines from 5th grade Text rather than a made up sentence. 
    - Sentence 1 is greater than 10 words
    - Sentence 2 is lesser than 10 words

In [1]:
from IPython.display import Image
import nltk, re, pprint
import pandas as pd
import numpy as np
from urllib import request
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.preprocessing import minmax_scale
from nltk.stem import WordNetLemmatizer

In [2]:
url_grade5_text = "http://www.gutenberg.org/cache/epub/15040/pg15040.txt"
grade5_text_url_open = request.urlopen(url_grade5_text)
text_grade5_raw = grade5_text_url_open.read().decode('utf-8-sig')
text_grade5_raw_start = text_grade5_raw.find("McGuffey\'s Fifth Reader")
text_grade5_raw_end = text_grade5_raw.rfind("End of the Project Gutenberg EBook of McGuffey\'s Fifth Eclectic Reader")

In [3]:
text_sub = text_grade5_raw[text_grade5_raw_start:text_grade5_raw_end]
text_sent_tokenize = nltk.sent_tokenize(text_sub)
text_sent_tokenize[1:20]

['THE GOOD READER.',
 '1.',
 'It is told of Frederick the Great, King of Prussia, that, as he was\r\nseated one day in his private room, a written petition was brought to him\r\nwith the request that it should be immediately read.',
 'The King had just\r\nreturned from hunting, and the glare of the sun, or some other cause, had\r\nso dazzled his eyes that he found it difficult to make out a single word\r\nof the writing.',
 '2.',
 'His private secretary happened to be absent; and the soldier who\r\nbrought the petition could not read.',
 'There was a page, or favorite boy\r\nservant, waiting in the hall, and upon him the King called.',
 'The page was a\r\nson of one of the noblemen of the court, but proved to be a very poor\r\nreader.',
 '3.',
 'In the first place, he did not articulate distinctly.',
 'He huddled his\r\nwords together in the utterance, as if they were syllables of one long\r\nword, which he must get through with as speedily as possible.',
 'His\r\npronunciation was bad

In [4]:
sent_1 = "Is it an auctioneer's list of goods to be sold that you are hurrying over?"
sent_1_word_tokenize = word_tokenize(sent_1)

In [5]:
sent_2 = "Send your companion to me."
sent_2_word_tokenize = word_tokenize(sent_2)

### 1. POS tagging using Python (Unigram Tagger)

Unigram uses only single word for determining POS tag. The other variations to the N-gram taggers are Bigram and Trigram taggers.

In [6]:
from nltk.tag import UnigramTagger
from nltk.corpus import treebank

In [7]:
train_sents = treebank.tagged_sents()[:10000]
tagger = UnigramTagger(train_sents)

In [8]:
tagger.tag(sent_1_word_tokenize)

[('Is', 'VBZ'),
 ('it', 'PRP'),
 ('an', 'DT'),
 ('auctioneer', None),
 ("'s", 'POS'),
 ('list', 'NN'),
 ('of', 'IN'),
 ('goods', 'NNS'),
 ('to', 'TO'),
 ('be', 'VB'),
 ('sold', 'VBN'),
 ('that', 'IN'),
 ('you', 'PRP'),
 ('are', 'VBP'),
 ('hurrying', None),
 ('over', 'IN'),
 ('?', '.')]

#### 1a. Interpretation

The words auctioneer, hurrying were assigned a tag of None because they were not among the frist 10000 words that were used to train the tagger. However, the other words have been tagged appropriately.

In [9]:
tagger.tag(sent_2_word_tokenize)

[('Send', 'VB'),
 ('your', 'PRP$'),
 ('companion', 'NN'),
 ('to', 'TO'),
 ('me', 'PRP'),
 ('.', '.')]

#### 1b. Interpretation

All the words were assigned a tag as they were all among the frist 10000 words.

### 2. POS tagging using Python (nltk.pos_tag Tagger)

In [10]:
nltk.pos_tag(sent_1_word_tokenize)

[('Is', 'VBZ'),
 ('it', 'PRP'),
 ('an', 'DT'),
 ('auctioneer', 'NN'),
 ("'s", 'POS'),
 ('list', 'NN'),
 ('of', 'IN'),
 ('goods', 'NNS'),
 ('to', 'TO'),
 ('be', 'VB'),
 ('sold', 'VBN'),
 ('that', 'IN'),
 ('you', 'PRP'),
 ('are', 'VBP'),
 ('hurrying', 'VBG'),
 ('over', 'IN'),
 ('?', '.')]

In [11]:
nltk.pos_tag(sent_2_word_tokenize)

[('Send', 'VB'),
 ('your', 'PRP$'),
 ('companion', 'NN'),
 ('to', 'TO'),
 ('me', 'PRP'),
 ('.', '.')]

#### 2a. Compare the outputs

The nltk pos_tag package has tagged all the words in both the sentences.  

#### 2b. Explaining the differences

nltk.pos_tag is a pre-trained PerceptronTagger model whereas the Unigram tagger contains no pre-trained model. We have trained the Unigram tagger with the first 10000 words in treebank.tagged_sents(). It is very likely that some of the words in our sentence is not part of the 10000 words used in training the Unigram tagger and hence gets tagged as 'None'.

### 3. POS Tag on this weeks news article

We take a random sentence more than 10 words from cnn.com. The sentence used for tagging would be 
- Anyone, including WH aides, could be fired depending on coverage.

#### 3a. Manual Tagging using Penn Treebank

The first method involved using the tagging the POS manually using the Penn Treebank tags https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

The manual POS tags are listed below.

| Word | Tag | Description |
| --- | --- | --- |
| Anyone | NN | Noun, singular or mass |
| , | , | , |
| including | VBG | Verb, gerund or present participle |
| WH | NNP | Proper noun, singular |
| aides | NNS | Noun, plural |
| , | , | , |
| could | MD | Modal |
| be | VB | Verb, base form |
| fired | VBN | Verb, past participle |
| depending | VBG | Verb, gerund or present participle |
| on | IN | Preposition or subordinating conjunction |
| coverage | NN | Noun, singular or mass |
| .	| . | . |

In [12]:
sent_news = "Anyone, including WH aides, could be fired depending on coverage."
sent_news_word_tokenize = word_tokenize(sent_news)

In [13]:
tagger.tag(sent_news_word_tokenize)

[('Anyone', None),
 (',', ','),
 ('including', 'VBG'),
 ('WH', None),
 ('aides', 'NNS'),
 (',', ','),
 ('could', 'MD'),
 ('be', 'VB'),
 ('fired', 'VBD'),
 ('depending', 'VBG'),
 ('on', 'IN'),
 ('coverage', 'NN'),
 ('.', '.')]

In [14]:
nltk.pos_tag(sent_news_word_tokenize)

[('Anyone', 'NN'),
 (',', ','),
 ('including', 'VBG'),
 ('WH', 'NNP'),
 ('aides', 'NNS'),
 (',', ','),
 ('could', 'MD'),
 ('be', 'VB'),
 ('fired', 'VBN'),
 ('depending', 'VBG'),
 ('on', 'IN'),
 ('coverage', 'NN'),
 ('.', '.')]

#### 3b. Compare the outputs

The nltk.pos_tag produced the same result as the Penn Treebank. 

#### 3c. Explain differences
Similar the first question, some of the words were not tagged using the Unigram tagger. This can be attributed to the word now showing up in the training set.

### Conclusion

In this homework, we started by using Unigram and nltk.pos_tag taggers in Python to tag a simple and a complex sentence. We compared the output between the 2 taggers and identified untagged words. The words weren't tagged because they weren't part of the training set of the Unigram tagger.

We then took a random sentence from this week's news articles and tagged them manually. We compared them with the output of the 2 taggers we used in question 1 and 2 and explained differences.