# Exercise Sheet 5 - POS Tagging

## Learning Objectives

In this lab we are going to:

- Explore POS Tagging using NLTK <br>
- Hidden Markov Models (HMM) <br>
- Learn POS tagging with HMM

----------------
## POS Tagging 

### Approaches

In POS tagging, we have a sentence X, and want to predict the part of speech of each word in the sentence Y. This can be done in different ways:
 
1- Pointwise prediction: a classifier that predicts each word individually such as perceptron. <br>
2- Generative sequence models: a probabilistic model that assigns probabilities to sequence of words such as Hidden Markov Model.** [the focus of this lab]** <br>
3- Discriminative sequence models: predict whole sequence with a classifier such as conditional random fields (CRF). <br>

### Tags Set
The most common tags sets are:

1- <a href= "http://ucrel.lancs.ac.uk/claws5tags.html">Claws5</a>: 62 different tags <br>
2- <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn Treebank</a>: 45 different tags (Most widely used currently) <br>
3- <a href = "http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html">The Brown Corpus tagset</a>: (87 tags)




### NLTK POS Tagging

The NLTK tagger can be used as follows:


In [None]:
#setting the stage ;)
# if you encounter some errors related to missing nltk packages run the following commands

import nltk
nltk.download('all')

In [None]:

from nltk.tokenize import word_tokenize

text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

The brown corpus has been manually tagged with part-of-speech tags which is useful for testing taggers and for training statistical taggers. In order to read a tagged corpus we can use:

In [None]:
from nltk.corpus import brown

print (brown.tagged_words())

**Exercise 1:**
Count each POS tag assigned to the word **(ignore case)** "world" in the **news** category of the brown corpus.

In [None]:
#your code goes here; output should be: NN: 37, NN-TL: 9

**Exercise 2:**
can you get the frequency distribution of each tag in the brown corpus?  

In [None]:
#your code goes here; output should be 
#[('NN', 152470),('IN', 120557),('AT', 97959),....]


**Exercise 3:**
What are the most common verbs in **fiction** category in the brown corpus? 

In [None]:
#your code goes here; output should be 
#['came': 'VBD', 'curled': 'VBD', 'ki-yi-ing': 'VBG',....]

-----------------------
## Hidden Markov Model

The sequence of tags can be veiwed as a Markov chain so let us explore the construction and solution of a Hidden Markov Model. Consider that we have an HMM with hidden states Noun, Verb, Adj and the following transition probability where $p(Y_{i+1}|Y_i)$ is the probability of state $Y_{i+1}$ occuring after $Y_i$ and the table of probabilities is as follows:

| $p(Y_{i+1}|Y_i)$ | $Y_{i+1}$=Noun | $Y_{i+1}$=Verb | $Y_{i+1}$=Adj |
|:-----------------|:--------------:|:--------------:|:-------------:|
| $Y_i$=Start      |  0.5           |  0.4           | 0.1           |
| $Y_i$=Noun       |  0.3           |  0.5           | 0.2           |
| $Y_i$=Verb       |  0.7           |  0.2           | 0.1           |
| $Y_i$=Adj        |  0.8           |  0.1           | 0.1           |

Furthermore, consider that the model has a vocabulary as follows, with the probability of $p(X_i|Y_i)$ as follows 

| $p(X_i|Y_i)$ | cats | dogs | drink | water | milk | fresh |
|:-------------|:----:|:----:|:-----:|:-----:|:----:|:-----:|
| $Y_i$=Noun   | 0.2  | 0.2  |  0.2  | 0.2   | 0.1  | 0.0   |
| $Y_i$=Verb   | 0.1  | 0.1  | 0.4   | 0.2   | 0.1  | 0.1   |
| $Y_i$=Adj    | 0.0  | 0.0  | 0.2   | 0.0   | 0.2  | 0.8   |


**Exercise 4:**

Implement the above table and write a function that takes a sequence of words and a sequence of part-of-speech tags and returns the probability using the above model. Calculate the probability of the sentence "cats drink fresh milk" given the tags "noun verb adj noun"

In [None]:
tags = ["start","noun","verb","adj"]
words = ["cats","dogs","drink","water","milk","fresh"]

In [None]:
def hmm_prob_with_state(words, tags):
    prob = 1.0
    # TODO
    return prob

print(hmm_prob_with_state(["cats","drink","fresh","milk"],
                          ["noun","verb","adj","noun"]))
#expected output should be 0.000128
    

**Exercise 5:**

Using the Forward (dynamic programming) algorithm, write a function that calculates the likelihood of a sequence of words. Find the probability of the sentence "Cats drink fresh milk"

In [None]:
def hmm_lm(words):
    prob = 1.0
    # TODO
    return prob

print(hmm_lm(["cats","drink","fresh","milk"]))
#expected output should be 0.00057068

**Exercise 6:**

Write a function that finds the most likely sequence of part-of-speech tags for a given sequence of words using the Viterbi algorithm.

In [None]:
def hmm_map(words):
    seq = ["noun","noun","noun","noun"]
    # TODO
    return seq

print(hmm_map(["cats","drink","fresh","milk"]))
#expected ouptut should be ['noun', 'verb', 'adj', 'noun']

**Exercise 7:**

Consider the following corpus:

In [None]:
sentences = [
    ["cats","drink","milk"],
    ["dogs","drink","water"],
    ["fresh","milk"],
    ["dogs","drink","fresh","milk"],
    ["cats","milk"]
]

tagged = [
    ["noun","verb","noun"],
    ["noun","verb","noun"],
    ["adj","noun"],
    ["noun","verb","adj","noun"],
    ["noun","noun"]
]

Write a function that learns the emission and transition probabilities for the Hidden Markov Model

In [None]:
def hmm_learn(sentences, tagged):
    transitions = {t:{t2:0.0 for t2 in tags} for t in tags}
    emissions    = {t:{w:0.0 for w in words} for t in tags}
    # TODO
    return transitions, emissions

print(hmm_learn(sentences, tagged))

**Exercise 8:**

Using the probability matrices you calculated in exercise 7, show that the probability of the sentence "fresh fresh milk" is zero. Suggest how you could change your calculation in exercise 7 to ensure that no sentence produces zero probability?

In [None]:
transitions, emissions = hmm_learn(sentences, tagged)

print(hmm_lm(["fresh","fresh","milk"]))

def hmm_learn2(sentences, tagged):
    transitions = {t:{t2:0.0 for t2 in tags} for t in tags}
    emissions    = {t:{w:0.0 for w in words} for t in tags}
    # TODO
    return transitions, emissions

transitions, emissions = hmm_learn2(sentences, tagged)

print(hmm_lm(["fresh","fresh","milk"]))

