
##Part-of-speech tagging with HMMs##
Implement a bigram part-of-speech (POS) tagger based on Hidden Markov Models from scratch. Using NLTK is disallowed, except for the modules explicitly listed below. For this, you will need to develop and/or utilize the following modules:

1. Corpus reader and writer (10 points)
2. Training procedure (30 points)
3. Viterbi tagging, including unknown word handling (50 points) 
4. Evaluation (10 points)

The task is mostly very straightforward, but each step requires careful design. Thus, we suggest you proceed in the following way.


##Viterbi algorithm.##

First, implement the Viterbi algorithm for finding the optimal state (tag) sequence given the sequence of observations (words). We suggest you test your implementation on a small example for which you know the correct tag sequence, such as the Eisner’s Ice Cream HMM from the lecture.
Make sure your Viterbi algorithm runs properly on the example before you proceed to the next step. Submit the best state sequence x that your Viterbi implementation finds for y = 3, 1, 3 and its joint probability P (x, y).
There are plenty of other detailed illustrations for the Viterbi algorithm on the Web from which you can take example HMMs. Please resist the temptation to copy Python code from those websites; that would be plagiarism.

In [8]:
from typing import List, Any, Dict, Tuple
import numpy as np
from collections import deque

In [9]:
#@title Basic HMM
class HMM:
    def __init__(
            self,
            states: List[Any],
            state_transition_probs: List[List[float]],
            initial_probs: List[float],
            emission_probs: List[Dict[Any, float]],
    ):
        """Initializes Hidden Markov Model."""

        assert len(states) == len(initial_probs)
        assert len(states) == len(state_transition_probs)
        assert len(states) == len(emission_probs)

        self.states = states
        self.state_transition_probs = state_transition_probs
        self.initial_probs = initial_probs
        self.emission_probs = emission_probs

    def viterbi(self, observations: List[Any]) -> Tuple[List[Any], float]:
        """Accepts list of observations (words) and returns the optimal state (tag) sequence and its probability."""
        if not observations:
            return [], 0

        viterbi_matrix = [[0 for j in range(len(observations))] for i in range(len(self.states))]
        backpointers_matrix = [[0 for j in range(len(observations))] for i in range(len(self.states))]

        for state_i in range(len(self.states)):
            viterbi_matrix[state_i][0] = self.initial_probs[state_i] * self.emission_probs[state_i][observations[0]]

        for t in range(1, len(observations)):
            for state_j in range(len(self.states)):
                cur_token_prob = self.emission_probs[state_j][observations[t]]
                state_i_to_j_probs = [
                    viterbi_matrix[state_i][t - 1] * cur_token_prob * self.state_transition_probs[state_i][state_j]
                    for state_i in range(len(self.states))
                ]

                max_index = np.argmax(state_i_to_j_probs)
                backpointers_matrix[state_j][t] = max_index
                viterbi_matrix[state_j][t] = state_i_to_j_probs[max_index]

        last_col_viterbi = [viterbi_matrix[i][len(observations) - 1] for i in range(len(self.states))]
        max_prob_index = np.argmax(last_col_viterbi)
        max_prob_value = last_col_viterbi[max_prob_index]

        # Deque is used to avoid extra list reversal using appendleft method
        path = deque([max_prob_index, ])
        for t in range(len(observations) - 2, -1, -1):
            path.appendleft(backpointers_matrix[path[0]][t])

        return [self.states[state_i] for state_i in path], np.round(max_prob_value, 4)

In [10]:
model = HMM(states=["H", "C"],
            state_transition_probs=[[0.7, 0.3], [0.4, 0.6]],
            initial_probs=[0.8, 0.2],
            emission_probs=[
                {1: 0.2, 2: 0.4, 3: 0.4},
                {1: 0.5, 2: 0.4, 3: 0.1},
            ])

print(model.viterbi(observations=[3, 1, 3]))

(['H', 'H', 'H'], 0.0125)


In [11]:
#@title Refactored HMM with numpy

# Here is a small experiment with numpy instead of standard pythos lists.
# Eventually, code looks more consice thanks to native vector operations.
class HMM_np:
    def __init__(
            self,
            states: List[Any],
            state_transition_probs: List[List[float]],
            initial_probs: List[float],
            emission_probs: List[Dict[Any, float]],
    ):
        """Initializes Hidden Markov Model."""

        assert len(states) == len(initial_probs)
        assert len(states) == len(state_transition_probs)
        assert len(states) == len(emission_probs)

        self.states = np.asarray(states)
        self.state_transition_probs = np.asarray(state_transition_probs)
        self.initial_probs = np.asarray(initial_probs)
        self.emission_probs = np.asarray(emission_probs)

    def viterbi(self, observations: List[Any]) -> Tuple[List[Any], float]:
        """Accepts list of observations (words) and returns the optimal state (tag) sequence and its probability."""
        if not observations:
            return [], 0

        n, m = len(self.states), len(observations)

        backpointers_matrix = np.zeros((n, m), 'int')
        viterbi_matrix = np.zeros((n, m))
        initial_emissions_vector = [self.emission_probs[state_i][observations[0]] for state_i in range(n)]
        viterbi_matrix[:, 0] = self.initial_probs * initial_emissions_vector

        for t in range(1, m):
            for state_j in range(n):
                cur_token_prob = self.emission_probs[state_j][observations[t]]
                state_i_to_j_probs = cur_token_prob * viterbi_matrix[:, t - 1] * self.state_transition_probs[:, state_j]

                max_index = np.argmax(state_i_to_j_probs)
                backpointers_matrix[state_j, t] = max_index
                viterbi_matrix[state_j, t] = state_i_to_j_probs[max_index]

        last_col_viterbi = viterbi_matrix[:, m - 1]
        max_prob_index = np.argmax(last_col_viterbi)
        max_prob_value = last_col_viterbi[max_prob_index]

        # Deque is used to avoid extra list reversal using appendleft method
        path = deque([max_prob_index, ])
        for t in range(len(observations) - 2, -1, -1):
            path.appendleft(backpointers_matrix[path[0]][t])

        return [self.states[state_i] for state_i in path], np.round(max_prob_value, 4)

In [12]:
model_np = HMM_np(states=["H", "C"],
            state_transition_probs=[[0.7, 0.3], [0.4, 0.6]],
            initial_probs=[0.8, 0.2],
            emission_probs=[
                {1: 0.2, 2: 0.4, 3: 0.4},
                {1: 0.5, 2: 0.4, 3: 0.1},
            ])

print(model_np.viterbi(observations=[3, 1, 3]))

(['H', 'H', 'H'], 0.0125)


##Training.##

Second, learn the parameters of your HMM from data, i.e. the initial, transition, and emission probabilities. Implement a maximum likelihood training procedure for supervised learning of HMMs.
You can get a corpus at http://www.coli.uni-saarland.de/~koller/materials/anlp/de-utb.zip. It contains a training set, a test set, and an evaluation set. The training set (de-train.tt) and the evaluation set (de-eval.tt) are written in the commonly used CoNLL format. They are text files with two colums; the first column contains the words, the POS tags are in the second column, and empty lines delimit sentences. The test set (de-test.t) is a copy of the evaluation set with tags stripped, as you should tag the test set using your tagger and then compare your results with the gold-standard ones in the evaluation set. The corpus uses the 12-tag universal POS tagset by Petrov et al. (2012). Feel free to use the NLTK module nltk.corpus.reader (and submodules) for reading the corpus.

##Evaluation.##

Once you have trained a model, evaluate it on the unseen data from the test set. Run the Viterbi algorithm with each of your models, and output a tagged corpus in the two-column CoNLL format (*.tt). We will provide an evaluation script on Classroom. Run it on the output of your tagger and the evaluation set and report your results.
Note that your tagger will initially fail to produce output for sentences that contain words you haven’t seen in training. If you have such a word w appear at sentence position t, you will have bj(w) = 0 for all states/tags j, and therefore Vt(j) = 0 for all j. Adapt your tagger by implementing the following crude approach to unknown words. Whenever you get Vt(j) = 0 for all j because of an unknown word w at position t, pretend that bj(w) = 1 for all j. This will basically set Vt(j) = maxi Vt−1(i) · aij, and allow you to interpolate the missing POS tag based on the transition probabilities alone.

##Extra credit.##

The task is challenging as it stands. However, feel free to go further for extra credit, e.g. by doing one of the following: implement better unknown word handling, use a trigram tagger, plot a learning curve for your tagger (accuracy as a function of training data size), plot a speed vs. sentence length curve.

Please submit your code, instructions for running your tagger and tagging output(s). Document any additional data you submit. With this, you will have implemented your first POS tagger! Well done!