# Part of Speech (POS) Tagging #

In our previous class we learned about the part of speech (POS) tagging problem. We formulated the problem as a supervised learning problem. We have a set of training examples $(x_1,y_1),(x_2,y_2),(x_3,y_3), ..., (x_n,y_n)$ where each example is a tuple of an input word $x_i$ and a label $y_i$. In POS tagging, the task is to learn a function $f$ which would map a sentences to tag sequences, $f(x)=y$.  
We defined $f(x)$ through the generative model where we model the joint probability:  

$$ \Large p(x,y)=p(x)p(x|y)$$  

We use the training examples $(x_1,y_1),(x_2,y_2),(x_3,y_3), ..., (x_n,y_n)$ to estimate the model parameters. In this model the two components are defined as:  

* $ \Large p(x) $ - prior probability of the labels
* $ \Large p(x|y) $ - probability of generating the word $x$ given that the POS tag is $y$

Model parameters are typically estimated using maximum likelihood.  
Given the generative model we use Bayes' rule to assign tags to test sentences:  

$$ \Large f(x) = arg\: max_{y}p(y|x)\\
            \Large \qquad\qquad\qquad=arg\: max_{y}\frac{p(y)p(x|y)}{p(x)}\\
            \Large \qquad\qquad\qquad=arg\: max_{y}p(y)p(x|y)\\
            $$
We covered in detail the trigram Hidden Markov Model (HMM) for POS tagging. This model has the following two parameters:
* $ \Large q(t_j|t_{j-2},t_{j-1}) $ - probability of seeing tag $t_j$ immediately after tags $t_{j-2}$ and $t_{j-1}$
* $ \Large q(w_j|t_j) $ - probability of generating the word $w_j$ given that the POS tag is $t_j$

Given a sentence $S$ with words $w_1,w_2,w_3,...,w_n$ the probability of assigning a set of tags $t_1,t_2,t_3,...,t_n$ is defined as:
$$ \Large p(w_1,...,w_n,y_1,...,y_n)=\prod_{i=1}^{n+1}q(t_j|t_{j-2},t_{j-1})\prod_{i=1}^{n}q(w_i|t_i)
            $$
The maximum likelihood estimates of the model parameters are computed using counts of the number of times the sequence of three $count(t_{j-2},t_{j-1},t_j)$ and two $count(t_{j-2},t_{j-1},t_j)$ as well as the number of times word $count(w_i)$ is seen alone  and in pair with tag $t_i$, $count(w_i,t_i)$:
$$ \Large q(t_j|t_{j-2},t_{j-1}) =\frac{count(t_{j-2},t_{j-1},t_j)}{count(t_{j-2},t_{j-1},t_j)}$$  
$$ \Large q(w_j|t_j)=\frac{count(w_i)}{count(w_i,t_i)}$$ 

We finally covered the Viterbi algorithm that helps us finding the most likely tag sequence for an input sentence $w_1,...,w_n$.  


In this lab session we are going to use the NLTK POS tagger to label sentences in our collection of Amazon reviews.  

First we are going to download several NLTK POS taggers. 

In [1]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('hmm_treebank_pos_tagger')
nltk.download('tagsets')

[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed (_ssl.c:720)>
[nltk_data] Error loading maxent_treebank_pos_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed (_ssl.c:720)>
[nltk_data] Error loading hmm_treebank_pos_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed (_ssl.c:720)>
[nltk_data] Error loading tagsets: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed
[nltk_data]     (_ssl.c:720)>


False

Let's load the Amazon product reviews data. Again, as a reminder the reviews data is semi-structured and is in a json file format. Below is a preview of this data which contains the entry for one review:  
`
{
  "reviewerID": "A3HVRXV0LVJN7",
  "asin": "0110400550",
  "reviewerName": "BiancaNicole",
  "helpful": [
    4,
    4
  ],
  "reviewText": "Best phone case ever . Everywhere I go I get a ton of compliments on it. It was in perfect condition as well.",
  "overall": 5.0,
  "summary": "A++++",
  "unixReviewTime": 1358035200,
  "reviewTime": "01 13, 2013"
}
`
This dataset comes with a set of python functions that will help us convert the reviews from json format to Pandas dataframes. 

In [2]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

With these helper functions we'll extract the "reviewText" field from each review:

In [3]:
import pandas as pd
import gzip

review_file = "../../../data/amazon_reviews/cp/reviews_Cell_Phones_and_Accessories_h1k.json.gz"

df = getDF(review_file)
print (df['reviewText'])

0      The case pictured is a soft violet color, but ...
1      Saw this same case at a theme park store for 2...
2      case fits perfectly and I always gets complime...
3      Best phone case ever . Everywhere I go I get a...
4      It may look cute. This case started off pretty...
5      ITEM NOT SENT from Blue Top Company in Hong Ko...
6      this is a cute case, I bought it for my wife b...
7      it came in ok but there was a crack on the lef...
8      The case is good, but the two pieces do not fi...
9      I got this for my 14 year old sister.  She lov...
10     The case is super cute, durable, and a convers...
11     I had to super glue the two parts together bec...
12     This case is extremely durabl;e. I've dropped ...
13     I ordered this as a birthday present for my si...
14     I ordered this and received it within two week...
15     I like the case for its colors, but the lower ...
16     As excited as I was to purchase this item, the...
17     I got the case very quic

Now that we've extracted the reviews we'll proceed by tokenizing them. In this next step we'll perform the following:  
* Extract sentences
* Tokenize words
* Remove stopwords
* Remove punctuation marks

We'll feed each sentence through the NLTK POS tagger using the __nltk.pos_tag__ method. This method takes as an input a tokenized sentence and returns a list of POS tags assigned to each word. The list consists of tuples where in each tuple we have original word and its assigned POS tag.

In [48]:
import string
from nltk.tree import Tree
stopwords_list = nltk.corpus.stopwords.words('english')
#Create a list for the tokenized sentences:
tok_sentences = list()
#Create a list for the sentence assigned POS tags:
pos_sentences = list()
pos_sentences_wsw = list()
#Create a translation table for removing the punctuation marks:
translator=str.maketrans('','',string.punctuation)

all_tags = list()
r_count=0
for (review, rating) in zip(df['reviewText'], df['overall']):
    if (rating!=5.0):
        continue
    r_count+=1
    if (r_count%10==0):
        print (r_count)
    sentences = nltk.sent_tokenize(review)
    for sentence in sentences:
        sent_words = nltk.word_tokenize(sentence)
        sent_words_tok = [word for word in sent_words if word not in stopwords_list and word.isalpha()]
        sent_words_tok_wsw = [word for word in sent_words if word.isalpha()]
        tok_sentences.append(sent_words_tok)
        
        sent_tags = nltk.pos_tag(sent_words_tok)
        pos_sentences.append(sent_tags)
        
        sent_tags_wsw = nltk.pos_tag(sent_words_tok_wsw)
        pos_sentences_wsw.append(sent_tags_wsw)
        
        for tuple in sent_tags:
            all_tags.append(tuple)

10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420


Let's print the tagged version of the sentences:

In [49]:
for tagged_sent in pos_sentences:
    print (tagged_sent)

[('Saw', 'NNP'), ('case', 'NN'), ('theme', 'NN'), ('park', 'NN'), ('store', 'NN'), ('dollars', 'NNS')]
[('This', 'DT'), ('good', 'JJ'), ('quality', 'NN'), ('great', 'JJ'), ('price', 'NN')]
[('case', 'NN'), ('fits', 'VBZ'), ('perfectly', 'RB'), ('I', 'PRP'), ('always', 'RB'), ('gets', 'VBZ'), ('compliments', 'NNS'), ('cracked', 'VBD'), ('I', 'PRP'), ('dropped', 'VBD')]
[('wonderful', 'JJ'), ('protective', 'NN')]
[('Best', 'NNP'), ('phone', 'NN'), ('case', 'NN'), ('ever', 'RB')]
[('Everywhere', 'RB'), ('I', 'PRP'), ('go', 'VBP'), ('I', 'PRP'), ('get', 'VBP'), ('ton', 'JJ'), ('compliments', 'NNS')]
[('It', 'PRP'), ('perfect', 'JJ'), ('condition', 'NN'), ('well', 'RB')]
[('This', 'DT'), ('case', 'NN'), ('extremely', 'RB'), ('durabl', 'JJ'), ('I', 'PRP'), ('dropped', 'VBD'), ('phone', 'NN'), ('atleast', 'NN'), ('ten', 'IN'), ('times', 'NNS'), ('I', 'PRP'), ('clumsy', 'VBP'), ('since', 'IN'), ('I', 'PRP'), ('got', 'VBD'), ('case', 'NN')]
[('Still', 'RB'), ('cracks', 'VBZ'), ('case', 'NN')]
[

In [50]:
for tagged_sent in pos_sentences_wsw:
    print (tagged_sent)

[('Saw', 'NN'), ('this', 'DT'), ('same', 'JJ'), ('case', 'NN'), ('at', 'IN'), ('a', 'DT'), ('theme', 'NN'), ('park', 'NN'), ('store', 'NN'), ('for', 'IN'), ('dollars', 'NNS')]
[('This', 'DT'), ('is', 'VBZ'), ('very', 'RB'), ('good', 'JJ'), ('quality', 'NN'), ('for', 'IN'), ('a', 'DT'), ('great', 'JJ'), ('price', 'NN')]
[('case', 'NN'), ('fits', 'VBZ'), ('perfectly', 'RB'), ('and', 'CC'), ('I', 'PRP'), ('always', 'RB'), ('gets', 'VBZ'), ('compliments', 'NNS'), ('on', 'IN'), ('it', 'PRP'), ('its', 'PRP$'), ('has', 'VBZ'), ('cracked', 'VBN'), ('when', 'WRB'), ('I', 'PRP'), ('dropped', 'VBD'), ('it', 'PRP')]
[('wonderful', 'JJ'), ('and', 'CC'), ('protective', 'JJ')]
[('Best', 'NNP'), ('phone', 'NN'), ('case', 'NN'), ('ever', 'RB')]
[('Everywhere', 'RB'), ('I', 'PRP'), ('go', 'VBP'), ('I', 'PRP'), ('get', 'VBP'), ('a', 'DT'), ('ton', 'NN'), ('of', 'IN'), ('compliments', 'NNS'), ('on', 'IN'), ('it', 'PRP')]
[('It', 'PRP'), ('was', 'VBD'), ('in', 'IN'), ('perfect', 'JJ'), ('condition', 'NN'),

NLTK provides information on each tag. You could obtain that information using the following code:

In [51]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

**[Assignment 1]**
Use this method to familiarize yourself with the various POS tags. 

**[Assignment 2]**
In our tokenization step we removed stop words which affect the assignment of POS tags over words. In this part of the lab session you are asked to redo the POS tagging over a tokenized version of the reviews that would include the stop words. Once you redo the tagging you are asked to compare the POS tags obtained using the two tokenized versions of the reviews. Pay particular attention to adjectives, nouns and verbs and analyze how the stopwords affect the POS tagging of the words. See if you could notice a difference. In order to compare the two outputs you should create two tokenized versions of the reviews and run the POS tagger over both of them.   

**[Assignment 3]** Write a code that finds the most likely adjectives and nouns in the collection. As we saw yesterday, one useful method for generating a sorted list of frequency counts is nltk.FreqDist(). The output of the POS tagger is a set of tuples. One way to filter out the tuples and keep the words with a particular tag is through the code below. Also note that during the POS tagging process we accumulated all POS tags into a single list which we named __all_tags__. 

In [52]:
[word for (word, tag) in all_tags if tag == 'JJ']

['good',
 'great',
 'wonderful',
 'ton',
 'perfect',
 'durabl',
 'favorite',
 'many',
 'drops',
 'bottom',
 'top',
 'whole',
 'great',
 'quick',
 'perfect',
 'easy',
 'many',
 'concrete',
 'item',
 'right',
 'extra',
 'impressed',
 'much',
 'enoyed',
 'great',
 'minimal',
 'awesome',
 'whole',
 'good',
 'convenient',
 'durable',
 'easy',
 'open',
 'much',
 'easel',
 'much',
 'free',
 'many',
 'many',
 'easy',
 'free',
 'sturdy',
 'good',
 'laptop',
 'happy',
 'new',
 'little',
 'sister',
 'nice',
 'quick',
 'bit',
 'fit',
 'new',
 'great',
 'new',
 'review',
 'useful',
 'original',
 'noticeable',
 'easy',
 'necessary',
 'internal',
 'able',
 'great',
 'handset',
 'pretty',
 'sweet',
 'piece',
 'revamp',
 'little',
 'great',
 'easy',
 'great',
 'slow',
 'extra',
 'excellent',
 'new',
 'great',
 'long',
 'great',
 'sure',
 'dead',
 'favorite',
 'USB',
 'videos',
 'myriad',
 'different',
 'sure',
 'compatible',
 'tech',
 'mobile',
 'full',
 'potential',
 'able',
 'handy',
 'flat',
 'analo

Use the above code to obtain the ranked list of the most frequent adjectives and nouns in the collection.

**[Solution 3]**

In [33]:
adj_list = nltk.FreqDist([word for (word, tag) in all_tags if tag == 'JJ'])
adj_list.most_common()

[('worth', 20),
 ('good', 12),
 ('cheap', 12),
 ('bad', 11),
 ('great', 11),
 ('much', 11),
 ('first', 11),
 ('little', 10),
 ('last', 9),
 ('light', 8),
 ('able', 7),
 ('top', 7),
 ('different', 7),
 ('nice', 7),
 ('new', 7),
 ('current', 6),
 ('hot', 6),
 ('big', 6),
 ('fit', 6),
 ('usb', 6),
 ('durable', 6),
 ('extra', 6),
 ('fine', 6),
 ('multiple', 5),
 ('many', 5),
 ('full', 5),
 ('ear', 5),
 ('screen', 5),
 ('disappointed', 5),
 ('low', 5),
 ('second', 5),
 ('hard', 5),
 ('enough', 5),
 ('signal', 4),
 ('right', 4),
 ('short', 4),
 ('terrible', 4),
 ('clear', 4),
 ('loose', 4),
 ('useless', 4),
 ('wrong', 4),
 ('small', 4),
 ('poor', 4),
 ('high', 4),
 ('couple', 4),
 ('easy', 3),
 ('connect', 3),
 ('whole', 3),
 ('available', 3),
 ('sure', 3),
 ('dead', 3),
 ('blue', 3),
 ('several', 3),
 ('happy', 3),
 ('instant', 3),
 ('long', 3),
 ('slow', 3),
 ('receive', 3),
 ('bottom', 3),
 ('soft', 3),
 ('battery', 3),
 ('stuck', 3),
 ('defective', 3),
 ('expensive', 3),
 ('true', 3),
 (

**[Assignment 4 (Optional)]** In the homework assignment we provided you with code that lets traverse over the reviews with a particular rating. Use this code to obtain the top most adjectives, nouns and verbs for the reviews with ratings 1 and 5. Below is the code snippet:

In [None]:
for (review, rating) in zip(df[’reviewText’], df[’overall’]):

**[Solution 4]**