##Assignment 4 (part 1)(CPSC 436N): Count-based Sequence labeling  with Viterbi Algorithm

In Part1 of this assignment, we are going to use the the text file called __cmpt-hw2-3.txt__ to build a Hidden Markov Model to predict the part of speech tags for the words in a given sentence using the *Viterbi Algorithm* (see textbook sec 8.4.5). In class we learned about the *Forward Algorithm* and Viterbi is a slight variation of that.

## 1. Loading the dataset (cmpt-hw2-3.txt)

In order to ease later steps, we can load and save the cmpt-hw2-3.txt dataset (i.e., our corpus annotated with POSs) in a format different from its origianl one. In the cmpt-hw2-3.txt each word is followed by an underscore and a tag that represents the word’s correct part of speech in that context. For instance:

<br>

*(1) There_EX are_VBP also_RB plant_NN and_CC gift_NN shops_NNS ._.*

*(2) Tosco_NNP said_VBD it_PRP expects_VBZ BP_NNP to_TO shut_VB the_DT plant_NN ._.*

<br>

We will instead reformat and load the same info as (token, pos_tag) lists. For example:

<br>

[
  [(there, EX), (are, VBP), (also, RB), (plant, NN), (and, CC), (gift, NN), (shop, NNS), (., .)],

  [(tosco, NNP), (said, VBD), (it, PRP), (expects, VBZ), (BP, NNP), (to, To), (shut, VB), (the, DT), (plant, NN), (., .)]
]

<br>

**Please note that we convert each token into its lower case.**

Also in this step, we  obtain the set of states for the HHM (the POSs)  and set of observation (the tokens).


In [1]:
data_path = 'cmpt-hw2-3.txt'

dataset = []; states = []; observations = []; #initialize dataset, state and observation list.

for line in open(data_path): #load data from the text file.
  text_pos_list = [];
  l = line.strip().split(' ') #split each line in the file by space.
  for w in l:
    w = w.split('_') #each token and its pos tag are connected by an underscore, here we split it by underscore.
    states.append(w[1]) #add pos tag to the initial state list.
    observations.append(w[0].lower()) #lowercase the token and add it to the initial observation list.
    text_pos_list.append((w[0].lower(), w[1])) #add the (token, pos) pair into the list for the sentence.
  dataset.append(text_pos_list) #add the processed sentence list into the dataset list.

## 2. Further processing the dataset, state and observation list

In this step, we need to consider two issue:

**Q1(a)** What is the problem with the state lists we just obtained? report the problem and fix it in the code below.


**Q1(b)** The first pass over the dataset generates a fixed list of vocabulary tokens. What would happen if the input sentence we want to tag contained unknown tokens, i.e which are not covered in the vocabulary list? 
The code below is implementing a possible way to adderess this problem. Please add comments to the lines with "COMMENT NEEDED". And describe in your words the implemented solution



In [None]:
#SOLUTION Q1(a)

#*************

unk_list = []; #COMMENT NEEDED Q1(b)

for o in tuple(set(observations)):
  if observations.count(o) < 2:
     unk_list.append(o) #COMMENT NEEDED Q1(b)

for i in range(len(observations)):
  if observations[i] in unk_list:
     observations[i] = '_unk_' #COMMENT NEEDED Q1(b)

observations = tuple(set(observations)) #COMMENT NEEDED Q1(b)

for text_pos_i in range(len(dataset)): #COMMENT NEEDED Q1(b)
  for w_i in range(len(dataset[text_pos_i])):
    if dataset[text_pos_i][w_i][0] in unk_list:
       dataset[text_pos_i][w_i] = ('_unk_', dataset[text_pos_i][w_i][1]) 


## 3. Computing the Emission Probabilities

The emission probability matrix can be formed as a dictionary, **Q2:** please try to understand the structure of this dictionary and complete the following code to obtain the emission probability matrix with a final normalization+smoothing step



In [None]:
def normalize_and_smoothing(d, all_elements):
   for e in all_elements:
     if e not in d.keys():
       d[e] = 0
   raw = sum(d.values())+len(d)
   factor = float(1)/raw
   output = {key:(value+1)*factor for key,value in d.items()}
   return output

emission_probability = {} #initialize the emission probability matrix.
for sent in dataset: #traverse the dataset to obtain the emission probability matrix.
  for i in range(len(sent)):
    if sent[i][1] not in emission_probability.keys(): #if the state (pos) is not in the key list of emission probability dict yet, intiailize a subdict for it and add the corresponding observation (token).
       emission_probability[sent[i][1]] = {}
       emission_probability[sent[i][1]][sent[i][0]] = 1
    else: #if the state (pos) is in the key list of emission probability dict, update the frequency of the corresponding observation (token).
      if sent[i][0] not in emission_probability[sent[i][1]].keys():
         emission_probability[sent[i][1]][sent[i][0]] = 1
      else:
         emission_probability[sent[i][1]][sent[i][0]] += 1
            
#****** SOLUTION Q2 *******

#*************

## 4. Computing the Start and Transition Probabilities

The Transition and Strat probability matrices can be similarly formed as dictionaries,** Q2 cont'** please implement the normalization function also for the transition probabilitie

In [None]:
transition_probability = {} #initialize the transition probability matrix.

for sent in dataset: #traverse the dataset to obtain the transition probability matrix.
  for i in range(len(sent)-1):
    if sent[i][1] not in transition_probability.keys(): #if the current state (pos) is not in the key list of transition probability dict yet, intiailize a subdict for it and add its subsequent state.
       transition_probability[sent[i][1]] = {}
       transition_probability[sent[i][1]][sent[i+1][1]] = 1
    else: #if the state (pos) is in the key list of transition probability dict, update the frequency of its subsequent state.
      if sent[i+1][1] not in transition_probability[sent[i][1]].keys():
         transition_probability[sent[i][1]][sent[i+1][1]] = 1
      else:
         transition_probability[sent[i][1]][sent[i+1][1]] += 1

#********* SOLUTION Q2 *******


#*******************************


start_probability = {} #initialize the transition probability matrix.
for sent in dataset: #construct the start probability dictionary which contains the frequency of each state appearing at the start position of a sentence.
  if sent[0][1] not in start_probability.keys():
     start_probability[sent[0][1]] = 1
  else:
     start_probability[sent[0][1]] += 1
start_probability = normalize_and_smoothing(start_probability, states) #normalize the start probability dict.

## 5. Implementing the Viterbi Algorithm 

The code below is implementing the Viterbi algorithm. <br>
**Q3**: Please add comments to the lines with "COMMENT NEEDED". 

In [None]:
def Viterbi(input_observations, states, start_probability, transition_probability, emission_probability):
    path = {s:[] for s in states} #this dictionary saves the shortest path for each state.
    curr_probability = {}
    for s in states:
        curr_probability[s] = start_probability[s]*emission_probability[s][input_observations[0]] #"COMMENT NEEDED" (Q3)
    for i in range(1, len(input_observations)):
        last_probability = curr_probability #update the current state probabilities to the ones obtained in the last step.
        curr_probability = {} #initialize a new dict to save the state probabilities for this step. 
        for curr_state in states: #for each possible state in this step, compute all possible paths and their probabilities towards this state.
            possible_paths = [(last_probability[last_state]*transition_probability[last_state][curr_state]*emission_probability[curr_state][input_observations[i]], last_state) for last_state in states]
            max_probability, last_state = max(possible_paths) #"COMMENT NEEDED" (Q3)
            curr_probability[curr_state] = max_probability #update the current state probabilities.
            path[curr_state].append(last_state) #record the optimal path towards each possible state in this step.
    for s in states:
        path[s].append(s) #complete the path with each state as the end.

   #"COMMENT NEEDED" (Q3)
    sorted_probability = sorted(curr_probability.items(), key=lambda x: x[1], reverse=True)
    the_last_state = sorted_probability[0][0]
    optimal_path = path[the_last_state]

    return optimal_path

## 6. Applying the the Viterbi Algorithm 

**Q4:** Apply the Viterbi Algorithm to the following sentences (remember to deal with unknown words): <br>
(a) "there are also many plants in the square." <br>
(b) "we will move the table into cs123." <br>

Please report the tagging results.

In [None]:
# SOLUTION Q4(a)


In [None]:
# SOLUTION Q4(b)
