Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your collaborators below:

In [1]:
COLLABORATORS = ""

---

In [2]:
import numpy as np

# special import for running the viterbi algorithm
from hmm import viterbi

<div class='alert alert-success'>In this problem, we will use a hidden Markov model (HMM) as a model of human language. You can consult AIMA3, pages 578-581 and 892-896, or AIMA2, pages 549-551, 834-840 for help.</div>

![](images/hmm.png)

Recall that in an HMM a sequence of $T$ observations $(w_1,w_2,\ldots,w_T)$ is assumed to be generated by an underlying sequence of $T$ hidden (i.e., unobserved) *state variables*, $(s_1,s_2,\ldots,s_T)$. The *Markov* in the model name comes from the assumption that the next state variable, $s_{t+1}$, depends only on the current state, $s_{t}$. In this way, the state variables $(s_1,s_2,\ldots,s_T)$ form a **Markov chain**. It is assumed that each observation variable $w_t$ depends only on the corresponding state variable, $s_t$ (this is reflected in the diagram above—at step $t$ an observation node/variable $w_t$ is connected to a single state node/variable $s_t$).

Using the HMM as a model of human language, the variables $(w_1,w_2,\ldots,w_T)$ correspond to the sequence of words in an observed sentence of length $T$, and the hidden state variables $(s_1,s_2,\ldots,s_T)$ correspond to the (unobserved) parts of speech of the words in the sentence, such as *noun*, *verb*, and *adjective*.

 Recall that an HMM has three sets of parameters: (1) the initial state probability distribution $P(s_1)$, which specifies the probabilities with which the the first state takes on each of its possible values, (2) the transition probabilities $P(s_t | s_{t-1})$, which specify the probability of the next hidden state given the previous one, and (3) the emission probabilities $P(w_t | s_t)$, which specify the probability of the observation at time $t$ given on the hidden state at time $t$.

There are two primary applications of HMMs to language processing. The first is *state estimation*, where the model parameters are all known, and we are trying to recover the sequence of hidden states $(s_1,s_2,\ldots,s_T)$ from the sequence of observations $(w_1,w_2,\ldots,w_T)$. In our language modeling context, this would be like assuming we already have a good model of English grammar that specifies the HMM model parameters, and we want to take an observed English sentence and perform part-of-speech tagging to determine the part of speech (e.g., adjective, noun, or preposition) of each word in the sentence. The second application of HMMs is *parameter estimation*, where the model parameters are unknown, and we are trying to recover them from only a set of observation sequences of the form $(w_1,w_2,\ldots,w_T)$. Because the hidden state variables are also unknown when performing parameter estimation, they must also be estimated simultaneously. In this situation, the expectation-maximization (EM) algorithm is used to perform simultaneous state estimation and parameter estimation.


---
### Part A (2 points)

We will use a simplistic English grammar consisting of only three
parts of speech (*noun*, *verb*, and *adjective*) and ten words (*john*, *sally*, *reddit*, *love*,
*parks*, *dogs*, *exhausted*, *marbled*,
*big*, and *inappropriate*). Some of the words can be used as multiple parts of speech, as demonstrated in the following table:

<table>
<tr>
<td> Word </td><td> Noun </td><td> Verb </td><td> Adjective </td>
</tr><tr>
<td> john </td><td> x </td><td> </td><td> </td>
</tr><tr>
<td> sally </td><td> x </td><td> </td><td> </td>
</tr><tr>
<td> reddit </td><td> x </td><td> </td><td> </td>
</tr><tr>
<td> love </td><td> x </td><td> x </td><td> </td>
</tr><tr>
<td> parks </td><td> x </td><td> x </td><td> </td>
</tr><tr>
<td> dogs </td><td> x </td><td> x </td><td> </td>
</tr><tr>
<td> exhausted </td><td> </td><td> x </td><td> x </td>
</tr><tr>
<td> marbled </td><td> </td><td> x </td><td> x </td>
</tr><tr>
<td> big </td><td> </td><td> </td><td> x </td>
</tr><tr>
<td> inappropriate </td><td> </td><td> </td><td> x </td>
</tr>
</table>

Furthermore, we will assume that all sentences consist of exactly five
words, so that $T = 5$ in all cases.

In [3]:
#possible values of the hidden state: S_t=j means the part of speach is partsOfSpeech[j]
partsOfSpeech = ['noun','verb','adjective']

#initProbs[j] is P(S_0=j)
initProbs = np.array([ .59, .01, .4])

#the possible emissions: W_t=j means the word is words[j]
words = ['john','sally','reddit','love','parks','dogs',
         'exhausted','marbled','big','inappropriate']

#transitionProbs[i,j] is P(S_{t+1}=i| S_t=j)
transitionProbs = np.array([[.02, .3, .69], 
                            [.97, .01, .01], 
                            [.01, .69, .3]])

#emissionProbs[i,j] is p(W_t=i | S_t=j)
emissionProbs = np.array([[1/6, 0, 0], [1/6, 0, 0], [1/6, 0, 0], 
                          [1/6, .2, 0], [1/6, .2, 0], [1/6, .2, 0],
                          [0, .2, .25], [0, .2, .25], [0, 0, .25], 
                          [0, 0, .25]])

First we will define several variables:

`partsOfSpeech`: a list of the three possible values of the hidden state variables (corresponding to the three parts of speech). Each part of speech (*noun*, *verb*, and *adjective*) corresponds to a numerical index according to its position in this list.

`words`: a list containing each of the ten words in our vocabulary. Each item in word is given a numerical index according to its position in this list.

`initProbs`: A NumPy array containing the initial state probabilities $P(s_1)$, where the entry in row $i$ and column $j$ gives the probability $P(s_t = i | s_{t-1} = j)$ of transitioning from state $j$ to state $i$.

`emissionProbs`: A Numpy array containing the emission probabilities, where the entry in row $i$ and column $j$ gives the probability $P(w_t = i | s_t = j)$ of observing word $i$ when in state $j$.


To produce a sentence we will choose a part of speech (POS) according to the initial probabilities. After choosing a POS, we choose a corresponding word given the emission probababilities for that POS. Thereafter we transition to the next POS according to the transition matrix, and choose another word accoring to the emission probabilities. This process of transitioning to a new hidden state (POS) and emitting a word repeats until we complete a five word sentence.

<div class="alert alert-success">Complete the function generateSentence so that it generates a random sentence (each sentence contains 5 words) following the probabilities in the grammar. You may wish to use `np.random.choice` to choose among a set of options weighted by a vector of probabilities.</div>



In [4]:
def generateSentence(initProbs, words, transitionProbs, emissionProbs, partsOfSpeech):        
    """
    Constructs a sentence according to the probabilities in the HMM model
    
    Parameters
    ----------
    initProbs: 1*n array
        initProbs encodes the probability of each of n hidden states
        
    words : list of length 10
        A list containing each of the ten words in our vocabulary. 

    transitionProbs: n*n array
        transitionProbs encodes the probability of transitioning from
        any of n hidden states to any other state
    
    emissionProbs: m*n array
        emissionProbs encodes the probability of emitting a particular 
        word given the current hidden state, where m is the number of words
        and n is the number of hidden states
    
    partsOfSpeech: 1*n array
        partsOfSpeech contains the names of the parts of speech that
        correspond to the indices for initProbs, transitionProbs, etc.
            
    Returns
    -------
    a dict with two keys, pos and sentence
    pos: the parts of speech of each word in the sentence
    sentence: the words in the sentence, as a list
    """
    
    # YOUR CODE HERE
    dictionary = {}
    word_list, pos_list = [], []
    POS = list(initProbs).index(max(initProbs))
    pos_list.append(partsOfSpeech[POS])
    for i in range(0,5):
        rand_w = []
        for n in range(0,len(emissionProbs)):
            rand_w.append(emissionProbs[n][POS])
        word_index = np.random.choice(range(0,len(emissionProbs)), p = rand_w)
        word = words[word_index]
    
        rand_s = []
        for n in range(0,len(transitionProbs)):
            rand_s.append(transitionProbs[n][POS])
        s_index = np.random.choice(range(0,len(transitionProbs)), p = rand_s)
        POS_update = partsOfSpeech[s_index]
        
        POS = s_index
        word_list.append(word)
        pos_list.append(POS_update)     
    
    dictionary["pos"] = pos_list[:5]
    dictionary["sentence"] = word_list[:5]
    return dictionary

In [5]:
generateSentence(initProbs, words, transitionProbs, emissionProbs, partsOfSpeech)

{'pos': ['noun', 'verb', 'adjective', 'noun', 'verb'],
 'sentence': ['love', 'dogs', 'exhausted', 'parks', 'parks']}

In [6]:
# add your own test cases here!

In [7]:
from nose.tools import assert_equal

# basic length and type checking
test = generateSentence(initProbs, words, transitionProbs, emissionProbs, \
                        partsOfSpeech)
assert_equal(len(test['pos']),5)
assert_equal(len(test['sentence']),5)
assert(all([isinstance(x, str) for x in test['pos']]))
assert(all([isinstance(x, str) for x in test['sentence']]))

# check that every word has a nonzero emission probability for 
# the corresponding hidden state. Try 20 sentences
for i in range(20):
    test = generateSentence(initProbs, words, transitionProbs, emissionProbs, \
                        partsOfSpeech)
    for j in range(5):
        word = test['sentence'][j]
        singlePos = test['pos'][j]  
        assert(emissionProbs[words.index(word), 
                         partsOfSpeech.index(singlePos)] > 0) 

#Test the hidden-state sequence: Check that nouns are almost always followed by verbs. Try 100 sentences.
nrNouns=0;
nrNounVerbTransitions=0;
for i in range(100):
    test = generateSentence(initProbs, words, transitionProbs, emissionProbs, \
                        partsOfSpeech)
    for j in range(4):
        fromPos = test['pos'][j]  
        toPos = test['pos'][j+1]
        if fromPos=='noun':
            nrNouns+=1
            if toPos=='verb':
                nrNounVerbTransitions+=1
assert(abs(nrNounVerbTransitions/nrNouns - transitionProbs[1,0]) < 0.1)

print("Success!")    

Success!


<div class="alert alert-success">Judging from the randomly-sampled sentences, does the HMM seem to be a good model of our limited subset of the English language? Why or why not?</div>

No, I would say that it does not. The HMM still performs poorly at developing coherent English sentences, despite having a somewhat substantial probabilistic syntax method down. Even though it is limited by this subset of the English language, I think this model is not able to correctly put together basic sentences when they are randomly sampled.

---
### Part B (1 point)

We will now tag parts of speech in our sentence by estimating the hidden state variables in our hidden Markov model. As described above, state estimation refers to estimating the sequence of hidden states of an HMM corresponding to a sequence of observations. This can only be performed when the model parameters (i.e., the initial hidden state probabilities, the transition probabilities, and the emission probabilities) are already known; otherwise, simultaneous _parameter estimation_ must also be done using the **EM algorithm**. 

For now, we will assume that all the parameters for our HMM are known, which makes the problem easier. We can perform state estimation in Python using the **Viterbi algorithm**. The Viterbi algorithm is a popular algorithm for HMMs that returns the sequence of hidden states which maximizes the total joint probability of all the hidden states and observations in the graphical model for the HMM.

We have created a wrapper function `viterbi` which takes as inputs a sentence and the HMM model parameters described above and calls the Viterbi algorithm to return the most likely sequence of hidden state variables for the given sentence. `viterbi` returns a set of indices; use `partsOfSpeech` to interpret these indices. For example, if we represent the sentence "love dogs exhausted big exhausted" in terms of the corresponding indices in `words`, `[3, 5, 6, 8, 6]`, we can call `viterbi` to get the indices of the most likely hidden states:

In [8]:
viterbi([3, 5, 6, 8, 6], initProbs, transitionProbs, emissionProbs)

array([0, 1, 2, 2, 2])

<div class="alert alert-success">Complete the function `part_of_speech_tagging` to produce the highest probability tags for a set of input sentences. Inside your function, call the `viterbi` fucntion to find the best set of tags (hiddens states) for each sentence. Make sure function returns the names of the parts of speech, *not* their numerical indices.</div>

In [9]:
def part_of_speech_tagging(sentences, words, initProbs, partsOfSpeech,
                           transitionProbs, emissionProbs):
    """
    Identifies the parts of speech for each of the words in a collection
    of sentences.
    
    Parameters
    ----------
    sentences : numpy array of shape (n,)
        An array of n lists of strings. Each sentence corresponds to
        one list. The number of strings (words) in a sentence (list)
        may vary.
    
    words : list of length 10
        A list containing each of the ten words in our vocabulary. 
        
    partsOfSpeech: list of length 3
        An array containing the three possible values of the hidden 
        state variables (corresponding to the three parts of speech).

    initProbs: numpy array of shape (3,)
        An array containing the initial state probabilities P(s1), 
        where the entry in row i and column j gives the probability 
        P(st=i|st−1=j) of transitioning from state j to state i.
        
    emissionProbs: numpy array of shape (10,3)
        A matrix containing the emission probabilities, where the 
        entry in row i and column j gives the probability P(wt=i|st=j)
        of observing word i when in state j.
    
    Outputs
    -------
    speech_tags : numpy array of shape (n, T)
        An array of lists, where each list contains the parts of speech
        of the T words in the sentence.    
    """
    # YOUR CODE HERE
    indicies = []
    pos = []
    for s in sentences:
        indicies.append(viterbi([words.index(x) for x in s], initProbs, transitionProbs, emissionProbs))
    for i in indicies:            
        pos.append([partsOfSpeech[elem] for elem in i])
    return np.array(pos)

Once you have implemented the `part_of_speech_tagging` function, you can test it on the provided sentences with the code below:

In [10]:
sentences = np.array([
             ['exhausted', 'dogs', 'love', 'marbled', 'parks'],
             ['inappropriate', 'sally', 'love', 'inappropriate', 'reddit'],
             ['exhausted', 'exhausted', 'sally', 'parks', 'dogs'],
             ['sally', 'dogs', 'big', 'exhausted', 'john'],
             ['big', 'john', 'exhausted', 'exhausted', 'dogs'],
            ])

speech_tags = part_of_speech_tagging(sentences, words, initProbs, 
                                     partsOfSpeech, transitionProbs, 
                                     emissionProbs)

for idx,s in enumerate(sentences):
    print('sentence:    ' + str(s))
    print('speech tags: ' + str(speech_tags[idx])+'\n')

sentence:    ['exhausted' 'dogs' 'love' 'marbled' 'parks']
speech tags: ['adjective' 'noun' 'verb' 'adjective' 'noun']

sentence:    ['inappropriate' 'sally' 'love' 'inappropriate' 'reddit']
speech tags: ['adjective' 'noun' 'verb' 'adjective' 'noun']

sentence:    ['exhausted' 'exhausted' 'sally' 'parks' 'dogs']
speech tags: ['adjective' 'adjective' 'noun' 'verb' 'noun']

sentence:    ['sally' 'dogs' 'big' 'exhausted' 'john']
speech tags: ['noun' 'verb' 'adjective' 'adjective' 'noun']

sentence:    ['big' 'john' 'exhausted' 'exhausted' 'dogs']
speech tags: ['adjective' 'noun' 'verb' 'adjective' 'noun']



In [11]:
# add your own test cases here!


In [12]:
"""Is the part_of_speech_tagging function correctly implemented?"""
from numpy.testing import assert_array_equal
from nose.tools import assert_equal

# create some new sentences for testing
test_sentences = np.array([
                    ['sally', 'reddit', 'john', 'big', 'parks'],
                    ['reddit', 'parks', 'john', 'big', 'sally'],
                    ['john', 'big', 'dogs', 'love', 'parks']
                    ])

test_tags = np.array([
                ['noun', 'noun', 'noun', 'adjective', 'noun'],
                ['noun', 'verb', 'noun', 'adjective', 'noun'],
                ['noun', 'adjective', 'noun', 'verb', 'noun'],
                ])

p_o_s = part_of_speech_tagging(test_sentences, words, initProbs, partsOfSpeech, 
                               transitionProbs, emissionProbs)

# is the output of the correct shape?
assert_equal(p_o_s.shape, test_tags.shape)

# is the output an array of strings?
assert(all([isinstance(x, str) for x in p_o_s.flatten()]))

# are the correct speech tags being returned?
for idx, t in enumerate(test_tags):
    assert_array_equal(p_o_s[idx], t, "Incorrect speech tags generated for one of test_sentences")

print("Success!")

Success!


---

### Part C (1 point)

<div class="alert alert-success">Does the HMM model do a good job of recovering the parts of speech of the words in our limited subset of the English language? Which words in the above sentences are used as more than one part of speech? (**0.5 points**)</div>

I would say the HMM model does a very substantial job at recovering the POS's of the words in our limited subset. The word 'dog' is recovered as a noun and verb, 'parks' is used as a noun and a verb, 'exhausted' is used as an adjective and a verb, and 'love' is used as an adjective and a verb. In all of these cases, the HMM model correctly classifies the most probable POS for these words despite the fact that they could be used as more than one part of speech.

<div class="alert alert-success">Was the HMM able to determine the correct part of speech for each occurrence of these words? Give a short explanation of why the HMM model was or was not able to accomplish this disambiguation task. (**0.5 points**)</div>

Yes, the HMM model was able to determine the correct POS for each occurence of these words. I think the HMM model was able to accomplish this disambiguation task aptly because essentially figuring out the next part of speech for the word is directly tied to the word before it. The HMM model knows the previous state (or word), and encorporates this into making the decision into the next one. Therefore, if there was any ambiguity in the word's POS, like in the example: 'sally' 'dogs' 'big' 'exhausted' 'john', where 'dogs' was classified as a verb, the model could tell that the probability of 'dogs' being a noun, *after* the first noun is much lower than being a verb, and therefore correctly predicts it in this case, and all of the other ones as well.

---

Before turning this problem in remember to do the following steps:

1. **Restart the kernel** (Kernel$\rightarrow$Restart)
2. **Run all cells** (Cell$\rightarrow$Run All)
3. **Save** (File$\rightarrow$Save and Checkpoint)

<div class="alert alert-danger">After you have completed these three steps, ensure that the following cell has printed "No errors". If it has <b>not</b> printed "No errors", then your code has a bug in it and has thrown an error! Make sure you fix this error before turning in your problem set.</div>

In [13]:
print("No errors!")

No errors!
