# Exercise Sheet 5 - POS Tagging

## Learning Objectives

In this lab we are going to:

- Explore POS Tagging using NLTK <br>
- Hidden Markov Models (HMM) <br>
- Learn POS tagging with HMM

----------------
## POS Tagging 

### Approaches

In POS tagging, we have a sentence X, and want to predict the part of speech of each word in the sentence Y. This can be done in different ways:
 
1- Pointwise prediction: a classifier that predicts each word individually such as perceptron. <br>
2- Generative sequence models: a probabilistic model that assigns probabilities to sequence of words such as Hidden Markov Model.** [the focus of this lab]** <br>
3- Discriminative sequence models: predict whole sequence with a classifier such as conditional random fields (CRF). <br>

### Tags Set
The most common tags sets are:

1- <a href= "http://ucrel.lancs.ac.uk/claws5tags.html">Claws5</a>: 62 different tags <br>
2- <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn Treebank</a>: 45 different tags (Most widely used currently) <br>
3- <a href = "http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html">The Brown Corpus tagset</a>: (87 tags)




### NLTK POS Tagging

The NLTK tagger can be used as follows:


In [1]:
#setting the stage ;)
# if you encounter some errors related to missing nltk packages run the following commands

import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to-date!
[nltk_data]    | Downloading packag

True

In [2]:
from nltk.tokenize import word_tokenize

text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

The brown corpus has been manually tagged with part-of-speech tags which is useful for testing taggers and for training statistical taggers. In order to read a tagged corpus we can use:

In [3]:
from nltk.corpus import brown

print (brown.tagged_words())

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]


**Exercise 1:**
Count each POS tag assigned to the word **(ignore case)** "world" in the **news** category of the brown corpus.

In [4]:
#your code goes here; output should be: NN: 37, NN-TL: 9
tagged_words = brown.tagged_words(categories=['news'])
counts = dict()

for word, tag in tagged_words:
  if word.lower() == 'world':
    if tag in counts:
      counts[tag] += 1
    else:
      counts[tag] = 1

print(counts)

{'NN': 37, 'NN-TL': 9}


In [5]:
# one-liner implementation using Counter
from collections import Counter
Counter([tag for (word, tag) in brown.tagged_words(categories=['news']) if word.lower() == 'world'])

Counter({'NN': 37, 'NN-TL': 9})

In [6]:
# using ConditionalFreqDist in nltk
tagged_words = brown.tagged_words(categories=['news'])
# lower case 'world'
tagged_words = [(word.lower(), tag) for (word, tag) in tagged_words]
# use cond. freq. dist. given the tag
cfd = nltk.ConditionalFreqDist(tagged_words)
cfd['world']

FreqDist({'NN': 37, 'NN-TL': 9})

**Exercise 2:**
can you get the frequency distribution of each tag in the brown corpus?  

In [7]:
#your code goes here; output should be 
#[('NN', 152470),('IN', 120557),('AT', 97959),....]
tagged_words = brown.tagged_words()
counts = dict()

for word, tag in tagged_words:
  if tag in counts:
    counts[tag] += 1
  else:
    counts[tag] = 1

sorted_counts = sorted(counts.items(), key=lambda c: c[1], reverse=True)
sorted_counts[:5]

[('NN', 152470), ('IN', 120557), ('AT', 97959), ('JJ', 64028), ('.', 60638)]

In [8]:
# using Counter and map
Counter(map(lambda x: x[1], brown.tagged_words())).most_common(5)

[('NN', 152470), ('IN', 120557), ('AT', 97959), ('JJ', 64028), ('.', 60638)]

In [9]:
# using FreqDist in nltk
tagged_words = brown.tagged_words()
words, tags = zip(*tagged_words)
freq_dist = nltk.FreqDist(tags)
freq_dist.most_common(5)

[('NN', 152470), ('IN', 120557), ('AT', 97959), ('JJ', 64028), ('.', 60638)]

**Exercise 3:**
What are the most common verbs in **fiction** category in the brown corpus? 

In [10]:
#your code goes here; output should be 
#['came': 'VBD', 'curled': 'VBD', 'ki-yi-ing': 'VBG',....]

tagged_words = brown.tagged_words(categories=['fiction'])
counts = dict()

for word, tag in tagged_words:
  if tag.startswith('VB'):
    if (word, tag) in counts:
      counts[(word, tag)] += 1
    else:
      counts[(word, tag)] = 1

sorted_counts = sorted(counts.items(), key=lambda c: c[1], reverse=True)
sorted_counts[:5]

[(('said', 'VBD'), 177),
 (('came', 'VBD'), 91),
 (('went', 'VBD'), 79),
 (('get', 'VB'), 78),
 (('know', 'VB'), 74)]

In [11]:
# one-liner using Counter and list comprehension
Counter([(word, tag) for (word, tag) in brown.tagged_words(categories=['fiction']) if tag.startswith('VB')]).most_common(5)

[(('said', 'VBD'), 177),
 (('came', 'VBD'), 91),
 (('went', 'VBD'), 79),
 (('get', 'VB'), 78),
 (('know', 'VB'), 74)]

In [12]:
# using FreqDist in nltk
tagged_words = brown.tagged_words(categories='fiction')
tagged_words = [(word, tag) for (word, tag) in tagged_words if tag.startswith('VB')]
freq_dist = nltk.FreqDist(tagged_words)
freq_dist.most_common(5)

[(('said', 'VBD'), 177),
 (('came', 'VBD'), 91),
 (('went', 'VBD'), 79),
 (('get', 'VB'), 78),
 (('know', 'VB'), 74)]

-----------------------
## Hidden Markov Model

The sequence of tags can be veiwed as a Markov chain so let us explore the construction and solution of a Hidden Markov Model. Consider that we have an HMM with hidden states Noun, Verb, Adj and the following transition probability where $p(Y_{i+1}|Y_i)$ is the probability of state $Y_{i+1}$ occuring after $Y_i$ and the table of probabilities is as follows:



| $p(Y_{i+1}|Y_i)$ | $Y_{i+1}$=Noun | $Y_{i+1}$=Verb | $Y_{i+1}$=Adj |
|:-----------------|:--------------:|:--------------:|:-------------:|
| $Y_i$=Start      |  0.5           |  0.4           | 0.1           |
| $Y_i$=Noun       |  0.3           |  0.5           | 0.2           |
| $Y_i$=Verb       |  0.7           |  0.2           | 0.1           |
| $Y_i$=Adj        |  0.8           |  0.1           | 0.1           |

Furthermore, consider that the model has a vocabulary as follows, with the probability of $p(X_i|Y_i)$ as follows 

| $p(X_i|Y_i)$ | cats | dogs | drink | water | milk | fresh |
|:-------------|:----:|:----:|:-----:|:-----:|:----:|:-----:|
| $Y_i$=Noun   | 0.2  | 0.2  |  0.2  | 0.2   | 0.1  | 0.0   |
| $Y_i$=Verb   | 0.1  | 0.1  | 0.4   | 0.2   | 0.1  | 0.1   |
| $Y_i$=Adj    | 0.0  | 0.0  | 0.2   | 0.0   | 0.2  | 0.8   |


In [13]:
all_tags = ["start","noun","verb","adj"]
all_words = ["cats","dogs","drink","water","milk","fresh"]

In [14]:
transitions = {
  'start': {'noun': 0.5, 'verb': 0.4, 'adj': 0.1, 'start': 0.0},
  'noun': {'noun': 0.3, 'verb': 0.5, 'adj': 0.2, 'start': 0.0},
  'verb': {'noun': 0.7, 'verb': 0.2, 'adj': 0.1, 'start': 0.0},
  'adj': {'noun': 0.8, 'verb': 0.1, 'adj': 0.1, 'start': 0.0},
}

emissions = {
  'noun': {'cats': 0.2, 'dogs': 0.2, 'drink': 0.2, 'fresh': 0.0, 'milk': 0.1, 'water': 0.2},
  'verb': {'cats': 0.1, 'dogs': 0.1, 'drink': 0.4, 'fresh': 0.1, 'milk': 0.1, 'water': 0.2},
  'adj': {'cats': 0.0, 'dogs': 0.0, 'drink': 0.2, 'fresh': 0.8, 'milk': 0.2, 'water': 0.0},
  'start': {'cats': 0.0, 'dogs': 0.0, 'drink': 0.0, 'fresh': 0.0, 'milk': 0.0, 'water': 0.0},
}


**Exercise 4:**

Implement the above table and write a function that takes a sequence of words and a sequence of part-of-speech tags and returns the probability using the above model. Calculate the probability of the sentence "cats drink fresh milk" given the tags "noun verb adj noun"

In [15]:
# refer to slides 12-17 in Week 5 Lecture

def hmm_prob_with_state(words, tags):
    prob = 1.0
    
    prev_tag = 'start'   # same as t0 on slide 14

    for tag, word in zip(tags, words):
      prob = prob * transitions[prev_tag][tag] * emissions[tag][word]
      prev_tag = tag

    return prob

print(hmm_prob_with_state(["cats","drink","fresh","milk"],
                          ["noun","verb","adj","noun"]))
#expected output should be 0.000128

0.00012800000000000005


**Exercise 5:**

Using the Forward (dynamic programming) algorithm, write a function that calculates the likelihood of a sequence of words. Find the probability of the sentence "Cats drink fresh milk"

In [16]:
# refer to slides 36-40
def hmm_lm(words):
    probs = {tag:[0] for tag in all_tags}

    # probs[t][i] = probability that word at position i has tag t
    probs['start'][0] = 1

    for i, word in enumerate(words):
      for tag in all_tags:  # find the probabilty of each tag given current word
        probs[tag].append(0)
        if tag != "start":
          for prev_tag in all_tags: # find the probability of each tag given previous tag (law of total probability)
            probs[tag][i+1] += probs[prev_tag][i] * transitions[prev_tag][tag]
          probs[tag][i+1] *= emissions[tag][word]

    # sum to get the probability of the last word with each tag
    return sum(probs[tag][-1] for tag in all_tags)
  
print(hmm_lm(["cats","drink","fresh","milk"]))
#expected output should be 0.00057068

0.0005706800000000003


**Exercise 6:**

Write a function that finds the most likely sequence of part-of-speech tags for a given sequence of words using the Viterbi algorithm.

In [17]:
# viterbi algorithm implemented as modified version of the forward algorithm
def hmm_map(words):
    probs = {tag:[0] for tag in all_tags}

    # probs[t][i] stores the probability of word at position i has tag t; same as mem table in the slides
    probs['start'][0] = 1

    # viterbi[i][tag] stores the best tag sequence till position i; same as y table in the slides
    viterbi = [{tag: [] for tag in all_tags}]

    for i, word in enumerate(words):
      viterbi.append({}) # state in dp-table

      for tag in all_tags:  # find the probabilty of each tag given current word
        probs[tag].append(0)
        # modify the forward algorithm here; instead of calculating the sum, find the max. probability so far
        if tag != "start":
          best_tag = ''
          best_prob = 0.0
          for prev_tag in all_tags: # find the best probability so far
            curr_prob = probs[prev_tag][i] * transitions[prev_tag][tag]
            if curr_prob > best_prob:
              best_prob = curr_prob
              best_tag = prev_tag
          probs[tag][i+1] = transitions[best_tag][tag] * emissions[tag][word]
          
          # save the transition to the dp-table
          viterbi[i+1][tag] = viterbi[i][best_tag] + [tag]
    
    # return the best tag sequence based on the best last word
    best_prob = 0.0
    best_tag = ''

    for tag in all_tags:
      prob = probs[tag][-1]
      if prob > best_prob:
        best_prob = prob
        best_tag = tag
    
    return viterbi[-1][best_tag]

print(hmm_map(["cats","drink","fresh","milk"]))

['noun', 'verb', 'adj', 'noun']


In [18]:
# recursive algorithm for hmm (without dp)
def pi(words, s, i):
  if i == 0 and s == "start":
    return 1.0
  
  if i == 0:
    return 0.0

  best_prob = 0.0
  best_tag = ''

  for t in all_tags:
    prob = transitions[t][s] * emissions[s][words[i]] * pi(words, t, i-1)
    if prob > best_prob:
      best_prob = prob
      best_tag = t
  
  return best_prob

words = ["cats","drink","fresh","milk"]

best_prob = 0.0
for tag in all_tags:
  best_prob = max(best_prob, pi(words, tag, len(words)-1))

print(best_prob)

0.0012800000000000005


In [19]:
# recursive algorithm with dp-table mem
mem = dict()

def pi(words, s, i):
  if i == 0 and s == "start":
    return 1.0
  
  if i == 0:
    return 0.0

  if (s, i) in mem:     # memoization / caching / dynamic programming
    return mem[(s, i)]

  best_prob = 0.0
  best_tag = ''

  for t in all_tags:
    prob = transitions[t][s] * emissions[s][words[i]] * pi(words, t, i-1)
    if prob > best_prob:
      best_prob = prob
      best_tag = t
  
  mem[(s, i)] = best_prob  # save result to dp table

  return mem[(s, i)]

words = ["cats","drink","fresh","milk"]

best_prob = 0.0
for tag in all_tags:
  best_prob = max(best_prob, pi(words, tag, len(words)-1))

print(best_prob)

0.0012800000000000005


In the above two solutions we only print the max. probability associated with the best tags but not the actual tags themselves. To get the actual tags, we need to save them at each position i.

In [20]:
# recursive algorithm with dp-table mem and viterbi table

def pi(words, s, i):
  if i == 0 and s == "start":
    return 1.0
  
  if i == 0:
    return 0.0

  if (s, i) in mem:     # memoization / caching / dynamic programming
    return mem[(s, i)]

  best_prob = 0.0
  best_tag = ''

  for t in all_tags:
    prob = transitions[t][s] * emissions[s][words[i]] * pi(words, t, i-1)
    if prob > best_prob:
      best_prob = prob
      best_tag = t
  
  mem[(s, i)] = best_prob  # save best probability the to dp table

  if best_tag != '':
    viterbi[i][s] = viterbi[i-1][best_tag] + [s]   # save the best tag state to the dp-table

  return mem[(s, i)]



words = ["", "cats","drink","fresh","milk"]   # add a dummy string at the start to match with the 'start' tag

mem = dict()
viterbi = [{tag: [] for tag in all_tags} for i in range(len(words))]

best_prob = 0.0
best_seq = []

for tag in all_tags:
  prob = pi(words, tag, len(words)-1)
  if best_prob < prob:
    best_prob = prob
    best_seq = viterbi[-1][tag]

print(best_prob)
print(best_seq)

0.00012800000000000008
['noun', 'verb', 'adj', 'noun']


**Exercise 7:**

Consider the following corpus:

In [21]:
sentences = [
    ["cats","drink","milk"],
    ["dogs","drink","water"],
    ["fresh","milk"],
    ["dogs","drink","fresh","milk"],
    ["cats","milk"]
]

tagged = [
    ["noun","verb","noun"],
    ["noun","verb","noun"],
    ["adj","noun"],
    ["noun","verb","adj","noun"],
    ["noun","noun"]
]

Write a function that learns the emission and transition probabilities for the Hidden Markov Model

In [22]:
# refer to slides 44-47
def hmm_learn(sentences, tagged):
    transitions = {t:{t2:0.0 for t2 in all_tags} for t in all_tags}
    emissions    = {t:{w:0.0 for w in all_words} for t in all_tags}

    transition_counts = {t:{t2:0 for t2 in all_tags} for t in all_tags}
    emission_counts    = {t:{w:0 for w in all_words} for t in all_tags}
    
    # count all tag transitions and word emissions
    for sentence, tags in zip(sentences, tagged):
      prev = 'start'
      
      for word, tag in zip(sentence, tags):
        transition_counts[prev][tag] += 1
        emission_counts[tag][word] += 1
        prev = tag

    # normalize counts to get transition and emission probabilities
    for tag in all_tags:
      transition_from = sum(transition_counts[tag].values())

      if transition_from > 0:  # avoid dividing by zero
        for next_tag in all_tags:
          transitions[tag][next_tag] = transition_counts[tag][next_tag] / transition_from
      
      emission_from = sum(emission_counts[tag].values())
      if emission_from > 0:   # avoid dividing by zero
        for word in all_words:
          emissions[tag][word] = emission_counts[tag][word] / emission_from
      
    return transitions, emissions

transitions, emissions = hmm_learn(sentences, tagged)

import pprint
print('Transitions')
pprint.pprint(transitions)
print('\n')
print('Emissions')
pprint.pprint(emissions)

Transitions
{'adj': {'adj': 0.0, 'noun': 1.0, 'start': 0.0, 'verb': 0.0},
 'noun': {'adj': 0.0, 'noun': 0.25, 'start': 0.0, 'verb': 0.75},
 'start': {'adj': 0.2, 'noun': 0.8, 'start': 0.0, 'verb': 0.0},
 'verb': {'adj': 0.3333333333333333,
          'noun': 0.6666666666666666,
          'start': 0.0,
          'verb': 0.0}}


Emissions
{'adj': {'cats': 0.0,
         'dogs': 0.0,
         'drink': 0.0,
         'fresh': 1.0,
         'milk': 0.0,
         'water': 0.0},
 'noun': {'cats': 0.2222222222222222,
          'dogs': 0.2222222222222222,
          'drink': 0.0,
          'fresh': 0.0,
          'milk': 0.4444444444444444,
          'water': 0.1111111111111111},
 'start': {'cats': 0.0,
           'dogs': 0.0,
           'drink': 0.0,
           'fresh': 0.0,
           'milk': 0.0,
           'water': 0.0},
 'verb': {'cats': 0.0,
          'dogs': 0.0,
          'drink': 1.0,
          'fresh': 0.0,
          'milk': 0.0,
          'water': 0.0}}


**Exercise 8:**

Using the probability matrices you calculated in exercise 7, show that the probability of the sentence "fresh fresh milk" is zero. Suggest how you could change your calculation in exercise 7 to ensure that no sentence produces zero probability?

In [23]:
transitions, emissions = hmm_learn(sentences, tagged)
tag_transition, word_transition = hmm_learn(sentences, tagged)
print(hmm_lm(["fresh","fresh","milk"]))

0.0


In [24]:
def hmm_learn2(sentences, tagged):
    transitions = {t:{t2:0.0 for t2 in all_tags} for t in all_tags}
    emissions    = {t:{w:0.0 for w in all_words} for t in all_tags}

    transition_counts = {t:{t2:1 for t2 in all_tags} for t in all_tags} # add-one to avoid zero probability
    emission_counts    = {t:{w:1 for w in all_words} for t in all_tags} # add-one to avoid zero probability
    
    # count all tag transitions and word emissions
    for sentence, tags in zip(sentences, tagged):
      prev = 'start'
      
      for word, tag in zip(sentence, tags):
        transition_counts[prev][tag] += 1
        emission_counts[tag][word] += 1
        prev = tag

    for tag in all_tags:
      transition_from = sum(transition_counts[tag].values())
      if transition_from > 0: # avoid dividing by zero
        for next_tag in all_tags:
          transitions[tag][next_tag] = transition_counts[tag][next_tag] / transition_from
      
      emission_from = sum(emission_counts[tag].values())
      if emission_from > 0: # avoid dividing by zero
        for word in all_words:
          emissions[tag][word] = emission_counts[tag][word] / emission_from
      
    return transitions, emissions

transitions, emissions = hmm_learn2(sentences, tagged)
print(hmm_lm(["fresh","fresh","milk"]))

0.003020545852657478
