<div style="text-align: right">INFO 6105 Data Science Eng Methods and Tools, Lecture 6 Day 2</div>
<div style="text-align: right">Dino Konstantopoulos, 12 February 2020, with material by Ankur Ankan and Abinash Panda</div>

Some people have this to say about advantages of the **German** language:

<br />
<center>
<img src="ipynb.images/german-flag.jpg" width=300 />
</center>

- It is better to keep the most important piece of information at the end, to keep people’s attention. In German, the main verb in conjugation is at the end of a sentence: *Sie (You) haben (have) bestimmt (definitely) noch (still) nicht (not) viele (many) anständige (respectable) Zauberer (wizard) **kennen gelernt** (met)*. In Spanish and English, we say all the important information first and, for this reason, we tend to interrupt each other in the middle of a sentence.


- The purpose of words is to transmit **knowledge**, so they should be easily understood. Some people seem to use words no one knows just to look smart. In German, it is almost impossible to do this as names for objects describe those objects. I really love this about German: Glutenunverträglichkeit means gluten-not-compatible (celiac). It helped a lot while reading Harry Potter: Zauberer (wizard), Zauberwort (Magic word), Zaubererschule (School of Magic), Zauberstab (magic wand), Zaubererwelt (wizarding world), etc

# POS tagging with Hidden Markov Models

In this notebook you will witness how you can *cheat* Science by relying on data probabilities instead of trying to figure out the rules or laws of Science. I don't know the internals of [Universal Dependencies](https://universaldependencies.org/), but I suspect they do not worry about analyzing the structure of the German language, figuring out that the verb is at the end of a sentence, and accomodating for this in the German Tree Bank. Instead, their algorithms probably read in a lot of german text, and just by looking at the probabilities of where the verb lands in a sentence they can correctly figure out that it's at the end. Probabilities powers **statistics**, and having lots of data means your probabilities can be very *exact*.

Not too many weeks ago, you called R libraries to do POS tagging for you. Now that you know everything about probabilities, *you* can do the same thing *on your own*!

We'll use the [Brown]() corpus to build a [POS tagger](https://en.wikipedia.org/wiki/Part-of-speech_tagging), first using a simple [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model (***most probable POS by count***), then using a **Hidden Markov Model** (HMM) that gets *transition* and *emission* probabilities from [POS bigrams](https://en.wikipedia.org/wiki/Bigram) (given a POS, what's the most probable ***next*** POS in the sentence?).

We'll divide the Brown corpus into training and test sets, and compare accuraces for BOW and HMM models.

We'll use some advanced python structures that are often used in Natural Language Processing (NLP).

# Reading in the Brown corpus efficiently

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from IPython.core.display import HTML
from itertools import chain
from collections import Counter, defaultdict, namedtuple, OrderedDict
from pomegranate import State, HiddenMarkovModel, DiscreteDistribution
import os
from io import BytesIO
from itertools import chain
import random

Some advanced python-fu:

Library `itertools` is a library of efficient iterators. `chain` makes an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. It is used for treating consecutive sequences as a single sequence

In python, a single star `*` unpacks the sequence/collection into positional arguments, so you can do this:
```(python)
def sum(a, b):
    return a + b

values = (1, 2)

s = sum(*values)
```

This will unpack the tuple so that it actually executes as:
```(python)
s = sum(1, 2)
```

The double star `**` does the same, only using a dictionary and thus named arguments:
```(python)
values = { 'a': 1, 'b': 2 }
s = sum(**values)
```

A python `frozenset` is just an immutable version of a Python set object. 

While elements of a set can be modified at any time, elements of frozen set remains the same after creation. 

So, frozen sets can be used as keys in a sictionary or as element of another set.

`read_data` below reads files page by page (`\n\n`), then line by line (`\n`), uses the first line of a page as a key to an ordered dictionary, with the values being a zipper made out of words and POS tags. It accomodates the syntax of the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus), pictured here below.

<br />
<center>
<img src="ipynb.images/brown.png" width=300 />
    Header of Brown corpus
</center>

In [2]:
def read_data(filename):
    """Read tagged sentence data"""
    with open(filename, 'r', encoding = 'UTF-8') as f:
        sentence_lines = [l.split("\n") for l in f.read().split("\n\n")]
    return OrderedDict(((s[0], Sentence(*zip(*[l.strip().split("\t")
                        for l in s[1:]]))) for s in sentence_lines if s[0]))

def read_tags(filename):
    """Read a list of word tag classes"""
    with open(filename, 'r', encoding = 'UTF-8') as f:
        tags = f.read().split("\n")
    return frozenset(tags)

Sentence = namedtuple("data", "tags")

Let's read in the Brown corpus to see if our python code works out:

In [3]:
tagset = read_tags("tags.txt")
sentences = read_data("test.txt")
print(sentences)
print(tagset)

FileNotFoundError: [Errno 2] No such file or directory: 'test.txt'

The class `Dataset` below incorporates our function above, reads in the Brown corpus and creates a collection of `keys`, a set of (unique) words, a sequence of words and a mirror sequence of tags as tuples, with `N` being the number of words in the Brown corpus.

Then it splits all this nice data into a training and test decomposition by using the `Subset` class defined further below, which mirrors the `Dataset` class.

In [21]:
class Dataset(namedtuple("_Dataset", "sentences keys vocab X tagset Y training_set testing_set N stream")):
    def __new__(cls, tagfile, datafile, train_test_split=0.8, seed=112890):
        tagset = read_tags(tagfile)
       # sentences = read_data(datafile)
        keys = tuple(sentences.keys())
        wordset = frozenset(chain(*[s.words for s in sentences.values()]))
        word_sequences = tuple([sentences[k].words for k in keys])
        tag_sequences = tuple([sentences[k].tags for k in keys])
        N = sum(1 for _ in chain(*(s.words for s in sentences.values())))
        
        # split data into train/test sets
        _keys = list(keys)
        if seed is not None: random.seed(seed)
        random.shuffle(_keys)
        split = int(train_test_split * len(_keys))
        training_data = Subset(sentences, _keys[:split])
        testing_data = Subset(sentences, _keys[split:])
        stream = tuple(zip(chain(*word_sequences), chain(*tag_sequences)))
        return super().__new__(cls, dict(sentences), keys, wordset, word_sequences, tagset,
                               tag_sequences, training_data, testing_data, N, stream.__iter__)

    def __len__(self):
        return len(self.sentences)

    def __iter__(self):
        return iter(self.sentences.items())
    
    
class Subset(namedtuple("BaseSet", "sentences keys vocab X tagset Y N stream")):
    def __new__(cls, sentences, keys):
        word_sequences = tuple([sentences[k].words for k in keys])
        tag_sequences = tuple([sentences[k].tags for k in keys])
        wordset = frozenset(chain(*word_sequences))
        tagset = frozenset(chain(*tag_sequences))
        N = sum(1 for _ in chain(*(sentences[k].words for k in keys)))
        stream = tuple(zip(chain(*word_sequences), chain(*tag_sequences)))
        return super().__new__(cls, {k: sentences[k] for k in keys}, keys, wordset, word_sequences,
                               tagset, tag_sequences, N, stream.__iter__)

    def __len__(self):
        return len(self.sentences)

    def __iter__(self):
        return iter(self.sentences.items())

Let's read in the Brown corpus *again*, leveraging our classes above now, which order the corpus into efficiently navigable structures:

In [4]:
data = Dataset("tags.txt", "test.txt", train_test_split=0.8)

#print("There are {} sentences in the corpus.".format(len(data)))
#print("There are {} sentences in the training set.".format(len(data.training_set)))
#print("There are {} sentences in the testing set.".format(len(data.testing_set)))

#assert len(data) == len(data.training_set) + len(data.testing_set), \
 #      "The number of sentences in the training set + testing set should sum to the number of sentences in the corpus"

NameError: name 'Dataset' is not defined

In [5]:
print("There are a total of {} samples of {} unique words in the corpus."
      .format(data.N, len(data.vocab)))
print("There are {} samples of {} unique words in the training set."
      .format(data.training_set.N, len(data.training_set.vocab)))
print("There are {} samples of {} unique words in the testing set."
      .format(data.testing_set.N, len(data.testing_set.vocab)))
print("There are {} words in the test set that are missing in the training set."
      .format(len(data.testing_set.vocab - data.training_set.vocab)))

assert data.N == data.training_set.N + data.testing_set.N, \
       "The number of training + test samples should sum to the total number of samples"

NameError: name 'data' is not defined

Let's look at an example POS tagging:

In [49]:
key = 'b100-5507'
print("Sentence: {}".format(key))
print("words:\n\t{!s}".format(data.sentences[key].words))
print("tags:\n\t{!s}".format(data.sentences[key].tags))

Sentence: b100-5507
words:
	('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.')
tags:
	('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')


This is how easy it is, now, to access words and associated tags, using the vocabulary of Machine Learning: `X` is the independent variable, and `Y` the dependent variable!

In [50]:
# accessing words with Dataset.X and tags with Dataset.Y 
for i in range(2):    
    print("Sentence {}:".format(i + 1), data.X[i])
    print()
    print("Labels {}:".format(i + 1), data.Y[i])
    print()

Sentence 1: ('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.')

Labels 1: ('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')

Sentence 2: ('But', 'there', 'seemed', 'to', 'be', 'some', 'difference', 'of', 'opinion', 'as', 'to', 'how', 'far', 'the', 'board', 'should', 'go', ',', 'and', 'whose', 'advice', 'it', 'should', 'follow', '.')

Labels 2: ('CONJ', 'PRT', 'VERB', 'PRT', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'ADP', 'ADV', 'ADV', 'DET', 'NOUN', 'VERB', 'VERB', '.', 'CONJ', 'DET', 'NOUN', 'PRON', 'VERB', 'VERB', '.')



Use `Dataset.stream()` to enumerate (word, tag) samples for the entire corpus. Let's enumerate the first 4:

In [51]:
print("\nStream (word, tag) pairs:\n")
for i, pair in enumerate(data.stream()):
    print("\t", pair)
    if i > 3: break


Stream (word, tag) pairs:

	 ('Mr.', 'NOUN')
	 ('Podger', 'NOUN')
	 ('had', 'VERB')
	 ('thanked', 'VERB')
	 ('him', 'PRON')


These are all words and tags in our **training set**. Let's uncover the first 4:

In [25]:
words = [word for i, (word, tag) in enumerate(data.training_set.stream())]
tags = [tag for i, (word, tag) in enumerate(data.training_set.stream())]
words[0:4], tags[0:4]

NameError: name 'data' is not defined

# POS Tagger using BOW Model

Let's create a dictionary of word + tag pairs where the values are just counts. Note that some words may be associated with different POS tags, in which case they will produce *distinct* pairs: 

In [24]:
def pair_counts(tags, words):
    d = defaultdict(lambda: defaultdict(int))
    for tag, word in zip(tags, words):
        d[tag][word] += 1
    return d
        
word_counts = pair_counts(words, tags)

NameError: name 'words' is not defined

Let's produce a dictionary where words (keys) are associated with their ***most frequent*** POS tag:

In [54]:
mfc_table = dict((word, max(tags.keys(), key=lambda key: tags[key])) for word, tags in word_counts.items())

In [55]:
i = 0
for key, value in mfc_table.items():
    print(key, value)
    i += 1
    if i > 3: break

Whenever NOUN
artists NOUN
, NOUN
indeed VERB


Python `namedtuple` supports a type of container-like dictionary that, like dictionaries, contains keys that are hashed to particular values. But it supports *both* access from key values as well as *iteration*, the functionality that dictionaries lack.

Let's write a class that takes in a table in its constructor and adds `<MISSING>` POS tags if the word is missing from the training set (possible that a word is in the test set but missing from the training set). It also has a `viterbi` method that takes in the table and builds a sequence of states that we will use in our Hidden Markov Model.

In [56]:
FakeState = namedtuple('FakeState', 'name')

class MFCTagger:
    missing = FakeState(name = '<MISSING>')
    
    def __init__(self, table):
        self.table = defaultdict(lambda: MFCTagger.missing)
        self.table.update({word: FakeState(name=tag) for word, tag in table.items()})
        
    def viterbi(self, seq):
        """This method simplifies predictions by matching the Pomegranate viterbi() interface"""
        return 0., list(enumerate(["<start>"] + [self.table[w] for w in seq] + ["<end>"]))

In [57]:
mfc_model = MFCTagger(mfc_table)

So essentially we built a table that associates words with their most frequent POS tag. This is a simplistic **bag of words** (BOW) model. Let's see, given a sentence, if we *guess the hidden states* (POS tags) right!

In [58]:
def replace_unknown(sequence):
    return [w if w in data.training_set.vocab else 'nan' for w in sequence]

def simplify_decoding(X, model):    
    _, state_path = model.viterbi(replace_unknown(X))
    return [state[1].name for state in state_path[1:-1]]

In [59]:
for key in data.testing_set.keys[:3]:
    print("Sentence Key: {}\n".format(key))
    print("Sentence: {}\n".format(data.sentences[key].words))
    print("Predicted labels:\n-----------------")
    print(simplify_decoding(data.sentences[key].words, mfc_model))
    print()
    print("Actual labels:\n--------------")
    print(data.sentences[key].tags)
    print("\n")

Sentence Key: b100-28144

Sentence: ('and', 'August', '15', ',', 'November', '15', ',', 'February', '17', ',', 'and', 'May', '15', ',', '(', 'Cranston', ')', '.')

Predicted labels:
-----------------
['NOUN', 'VERB', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', '.', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', '.', 'NOUN', 'NOUN']

Actual labels:
--------------
('CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.')


Sentence Key: b100-23146

Sentence: ('She', 'had', 'the', 'opportunity', 'that', 'few', 'clever', 'women', 'can', 'resist', ',', 'of', 'showing', 'her', 'superiority', 'in', 'argument', 'over', 'a', 'man', '.')

Predicted labels:
-----------------
['NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', '.', 'NOUN', 'NOUN', 'ADP', 'NOUN', 'VERB', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN']

Actual labels:
--------------
('PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VE

Pretty good! Let's evaluate the accuracy of our most-frequent-tag tagger:

In [60]:
def accuracy(X, Y, model):
    
    correct = total_predictions = 0
    for observations, actual_tags in zip(X, Y):
        
        # The model.viterbi call in simplify_decoding will return None if the HMM
        # raises an error (for example, if a test sentence contains a word that
        # is out of vocabulary for the training set). Any exception counts the
        # full sentence as an error (which makes this a conservative estimate).
        try:
            most_likely_tags = simplify_decoding(observations, model)
            correct += sum(p == t for p, t in zip(most_likely_tags, actual_tags))
        except:
            pass
        total_predictions += len(observations)
    return correct / total_predictions

In [61]:
mfc_training_acc = accuracy(data.training_set.X, data.training_set.Y, mfc_model)
print("training accuracy mfc_model: {:.2f}%".format(100 * mfc_training_acc))

mfc_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, mfc_model)
print("testing accuracy mfc_model: {:.2f}%".format(100 * mfc_testing_acc))

training accuracy mfc_model: 16.39%
testing accuracy mfc_model: 15.65%


# Hidden Markov Model

Let's build a POS tagger using a Hidden Markov Model.

First, let's see how many POS tags we have in our corpus, using the python `Counter` structure that we used last week to count instances.

In [62]:
def unigram_counts(sequences):
    return Counter(sequences)

tags = [tag for i, (word, tag) in enumerate(data.training_set.stream())]
tag_unigrams = unigram_counts(tags)
tag_unigrams

Counter({'ADV': 44877,
         'NOUN': 220632,
         '.': 117757,
         'VERB': 146161,
         'ADP': 115808,
         'ADJ': 66754,
         'CONJ': 30537,
         'DET': 109671,
         'PRT': 23906,
         'NUM': 11878,
         'PRON': 39383,
         'X': 1094})

We'll *slightly* modify the code above to get our POS bigrams, from *both* training and test subsets, to uncover which POS tags follow which other POS tags. So, instead of a simple list of POS tags, `Counter` will count *neighboring* POS tuples! 

In [63]:
def bigram_counts(sequences):
    return Counter(sequences)

tags = [tag for i, (word, tag) in enumerate(data.stream())]
o = [(tags[i],tags[i+1]) for i in range(0,len(tags)-2,2)]
tag_bigrams = bigram_counts(o)
tag_bigrams 

Counter({('NOUN', 'NOUN'): 20633,
         ('VERB', 'VERB'): 16877,
         ('PRON', 'ADV'): 1335,
         ('.', 'CONJ'): 6477,
         ('ADV', 'PRON'): 1392,
         ('VERB', 'NOUN'): 8869,
         ('ADP', 'DET'): 33105,
         ('NOUN', '.'): 39180,
         ('CONJ', 'PRT'): 468,
         ('VERB', 'PRT'): 6001,
         ('VERB', 'DET'): 15046,
         ('NOUN', 'ADP'): 33955,
         ('ADP', 'ADV'): 1174,
         ('ADV', 'DET'): 2068,
         ('NOUN', 'VERB'): 21839,
         ('VERB', '.'): 7405,
         ('CONJ', 'DET'): 2900,
         ('NOUN', 'PRON'): 2840,
         ('.', 'PRT'): 2391,
         ('DET', 'NOUN'): 42818,
         ('PRT', 'VERB'): 9304,
         ('ADV', 'ADJ'): 3780,
         ('ADP', 'PRON'): 5018,
         ('ADJ', 'NOUN'): 27380,
         ('ADJ', 'ADP'): 3676,
         ('.', 'NOUN'): 10104,
         ('.', 'ADJ'): 3097,
         ('CONJ', 'NOUN'): 4682,
         ('VERB', 'ADP'): 15460,
         ('DET', 'ADJ'): 16357,
         ('.', '.'): 10396,
         ('ADV'

What tags do our sentences *begin* with?

In [25]:
def starting_counts(sequences):
    return Counter(sequences)

tags = [tag for i, (word, tag) in enumerate(data.stream())]
starts_tag = [i[0] for i in data.Y]
tag_starts = starting_counts(starts_tag)
tag_starts

Counter({'NOUN': 8093,
         'CONJ': 2817,
         'PRT': 2103,
         'DET': 12238,
         'PRON': 9157,
         'ADP': 7044,
         'ADJ': 1969,
         'ADV': 5238,
         'VERB': 2588,
         'NUM': 964,
         '.': 5099,
         'X': 30})

What tags do our sentences *end* with?

In [26]:
def ending_counts(sequences):    
    return Counter(sequences)

end_tag = [i[len(i)-1] for i in data.Y]
tag_ends = ending_counts(end_tag)
tag_ends

Counter({'.': 56149,
         'ADJ': 31,
         'ADV': 20,
         'NOUN': 914,
         'NUM': 80,
         'PRON': 5,
         'ADP': 8,
         'VERB': 102,
         'DET': 18,
         'PRT': 9,
         'CONJ': 2,
         'X': 2})

Not surprising that most end with a period `.`! Ideally, we should end with the previous-to-last tag! 

In [27]:
end_tag = [i[len(i)-2] for i in data.Y]
tag_ends = ending_counts(end_tag)
tag_ends

Counter({'NOUN': 30751,
         'VERB': 5748,
         '.': 10002,
         'NUM': 1409,
         'PRT': 719,
         'PRON': 2125,
         'ADV': 2860,
         'ADJ': 2947,
         'CONJ': 54,
         'ADP': 302,
         'X': 85,
         'DET': 338})

Let's create our Hidden Markov Model and peek into most popular words per POS tag.

`tag_words_count` contains words associated to each POS tag, arranged by frequency so that we can eventually evaluate *emission* probabilities, which are probabilities of observable states (wrods) given hidden states (POS tags).

In [52]:
hmm_model = HiddenMarkovModel(name="base-hmm-tagger")

tags = [tag for i, (word, tag) in enumerate(data.stream())]
words = [word for i, (word, tag) in enumerate(data.stream())]

tags_count = unigram_counts(tags)
tag_words_count = pair_counts(tags, words)

starting_tag_list = [i[0] for i in data.Y]
#ending_tag_list = [i[-1] if len(i)==1 else i[-2] for i in data.Y]
#ending_tag_list = [i[-1] for i in data.Y]
ending_tag_list = [i[len(i)-2] for i in data.Y]

starting_tag_count = starting_counts(starting_tag_list) #the number of times a tag occured at the start
ending_tag_count = ending_counts(ending_tag_list)       #the number of times a tag occured at the end

tag_words_count

<generator object <genexpr> at 0x0000020E7C9ACC48>


defaultdict(<function __main__.pair_counts.<locals>.<lambda>()>,
            {'NOUN': defaultdict(int,
                         {'Mr.': 844,
                          'Podger': 21,
                          'use': 352,
                          'advice': 50,
                          'difference': 148,
                          'opinion': 94,
                          'board': 165,
                          'instrument': 44,
                          'elasticity': 5,
                          'pastes': 1,
                          'inks': 1,
                          'paints': 6,
                          'adhesives': 2,
                          'plastics': 28,
                          'bread': 41,
                          'dough': 13,
                          'properties': 61,
                          'length': 115,
                          'shortness': 1,
                          'spinnability': 1,
                          'methods': 138,
                          'present': 

In [29]:
ending_tag_list

['NOUN',
 'VERB',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 '.',
 'NOUN',
 'NUM',
 'NOUN',
 'NOUN',
 '.',
 'NOUN',
 'VERB',
 '.',
 'VERB',
 'NOUN',
 '.',
 '.',
 '.',
 'NOUN',
 '.',
 'NOUN',
 'NOUN',
 'PRT',
 'PRON',
 'NOUN',
 'NOUN',
 'VERB',
 '.',
 'PRON',
 'ADV',
 'NOUN',
 '.',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 '.',
 '.',
 'NOUN',
 'VERB',
 'NOUN',
 'NOUN',
 'NOUN',
 '.',
 'ADJ',
 'ADJ',
 'NUM',
 'ADJ',
 'NOUN',
 'NOUN',
 'PRON',
 'ADV',
 'NOUN',
 'ADV',
 '.',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'ADJ',
 'NOUN',
 'NOUN',
 'PRON',
 'NOUN',
 'NOUN',
 '.',
 'NOUN',
 'NOUN',
 'VERB',
 'NOUN',
 '.',
 '.',
 'NUM',
 '.',
 'VERB',
 '.',
 'VERB',
 '.',
 '.',
 'VERB',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'ADV',
 'NOUN',
 '.',
 'NOUN',
 '.',
 '.',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 'NOUN',
 '.',
 'NOUN',
 'VERB',
 'NOUN',
 'ADJ',
 'NOUN',
 'NOUN',
 'CONJ',
 'ADJ',
 'NOUN',
 'VERB',
 'NOUN',
 'NOUN',
 'NOUN',
 'ADJ',
 'NOUN'

Let's convert word frequencies by POS tag to probabilities by dividing by the total number of words per POS tag, yielding the `distribution` of words.

We'll define HMM emission probabilities using that `distribution`.

In [30]:
to_pass_states = []
for tag, words_dict in tag_words_count.items():
    total = float(sum(words_dict.values()))
    distribution = {word: count/total for word, count in words_dict.items()}
    tag_emissions = DiscreteDistribution(distribution)
    tag_state = State(tag_emissions, name=tag)
    to_pass_states.append(tag_state)

In [31]:
distribution

{'1': 0.03536372193088611,
 'One': 0.023060373806642463,
 '1957': 0.003227107704719645,
 'Six': 0.00154632244184483,
 '2:30.3-:36.1': 0.0001344628210299852,
 'two': 0.0881403791851553,
 'one': 0.16061583972031734,
 '26': 0.002151405136479763,
 '2000': 0.000336157052574963,
 'twenty-one': 0.00047061987360494823,
 '3': 0.019160951996772892,
 '4': 0.013849670566088476,
 '1721': 0.0001344628210299852,
 '6': 0.007664380798709157,
 '500': 0.0010757025682398816,
 '2100': 0.0002016942315449778,
 'six': 0.01317735646093855,
 'three': 0.03717897001479091,
 '1883': 0.0004033884630899556,
 'fifteen': 0.003227107704719645,
 'forty': 0.002151405136479763,
 'twenty': 0.00490789296759446,
 'Twenty-four': 0.0002016942315449778,
 '20': 0.00618528976737932,
 '1932': 0.0002016942315449778,
 '707': 0.0002689256420599704,
 'five': 0.01627000134462821,
 '25': 0.0055129756622293935,
 'thirty': 0.003764958988839586,
 'sixty': 0.0012101653892698668,
 '17': 0.0027564878311146968,
 'nine': 0.004975124378109453,
 

In [32]:
distribution["56"]

0.0004033884630899556

`to_pass_states` yields the probability distribution of words per POS tag:

In [33]:
to_pass_states

[{
     "class" : "State",
     "distribution" : {
         "class" : "Distribution",
         "dtype" : "str",
         "name" : "DiscreteDistribution",
         "parameters" : [
             {
                 "Mr." : 0.003062876055131769,
                 "Podger" : 7.620900137176203e-05,
                 "use" : 0.0012774080229933445,
                 "advice" : 0.00018145000326610006,
                 "difference" : 0.0005370920096676562,
                 "opinion" : 0.0003411260061402681,
                 "board" : 0.0005987850107781302,
                 "instrument" : 0.00015967600287416806,
                 "elasticity" : 1.8145000326610006e-05,
                 "pastes" : 3.6290000653220013e-06,
                 "inks" : 3.6290000653220013e-06,
                 "paints" : 2.1774000391932006e-05,
                 "adhesives" : 7.258000130644003e-06,
                 "plastics" : 0.00010161200182901603,
                 "bread" : 0.00014878900267820204,
                 "dough" 

Let's add states to our model:

In [34]:
hmm_model.add_states() 

The start probability for each tag is how many times it is a sentence-starting POS tag divided by its total count. We build the starting transitions for our HMM model:

In [35]:
start_prob={}

for tag in tags:
    start_prob[tag] = starting_tag_count[tag] / tags_count[tag]

for tag_state in to_pass_states :
    hmm_model.add_transition(hmm_model.start, tag_state, start_prob[tag_state.name])  

The end probability for each tag is how many times it is a sentence-ending POS tag divided by its total count. We build the ending transitions for our HMM model:

In [36]:
end_prob={}

for tag in tags:
    end_prob[tag] = ending_tag_count[tag]/tags_count[tag]
    
for tag_state in to_pass_states :
    hmm_model.add_transition(tag_state, hmm_model.end, end_prob[tag_state.name])

We now add the transition probabilities for our model, which uses our POS bigrams to enumerate what the probabilities are for transiting from one POS tag to another.

In [2]:
transition_prob_pair={}

for key in tag_bigrams.keys():
    transition_prob_pair[key] = tag_bigrams.get(key)/tags_count[key[0]]
    
for tag_state in to_pass_states:
    for next_tag_state in to_pass_states:
        hmm_model.add_transition(tag_state, next_tag_state, transition_prob_pair[(tag_state.name, next_tag_state.name)])

NameError: name 'tag_bigrams' is not defined

We *bake* our model:

In [38]:
hmm_model.bake()

We can now evaluate the accuracy of our HMM model and compare it to our BOW model:

In [1]:
hmm_training_acc = accuracy(data.training_set.X, data.training_set.Y, hmm_model)
print("training accuracy basic hmm model: {:.2f}%".format(100 * hmm_training_acc))

hmm_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, hmm_model)
print("testing accuracy basic hmm model: {:.2f}%".format(100 * hmm_testing_acc))

NameError: name 'accuracy' is not defined

Here's a decoding example:

In [41]:
for key in data.testing_set.keys[:3]:
    print("Sentence Key: {}\n".format(key))
    print("Sentence: {}\n".format(data.sentences[key].words))
    print("Predicted labels:\n-----------------")
    print(simplify_decoding(data.sentences[key].words, hmm_model))
    print()
    print("Actual labels:\n--------------")
    print(data.sentences[key].tags)
    print("\n")

Sentence Key: b100-28144

Sentence: ('and', 'August', '15', ',', 'November', '15', ',', 'February', '17', ',', 'and', 'May', '15', ',', '(', 'Cranston', ')', '.')

Predicted labels:
-----------------
['CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.']

Actual labels:
--------------
('CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.')


Sentence Key: b100-23146

Sentence: ('She', 'had', 'the', 'opportunity', 'that', 'few', 'clever', 'women', 'can', 'resist', ',', 'of', 'showing', 'her', 'superiority', 'in', 'argument', 'over', 'a', 'man', '.')

Predicted labels:
-----------------
['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.']

Actual labels:
--------------
('PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB

A 96% accuracy for our HMM model compared to a 93% accuracy for our BOW model is a ***huge*** improvement as it brings language understanding error to below 4%, and 5% error is considered a *gold standard* for NLP. Speech-to-text frameworks like Alexa and Siri only started betting popular when they crossed the 5% threshold.

# References

- Hands on Markov models with python, by Ankur Ankan and Abinash Panda, [on amazon](https://www.amazon.com/Hands-Markov-Models-Python-probabilistic/dp/1788625447/ref=sr_1_2?keywords=hands+on+markov+models+with+python&qid=1581280984&sr=8-2)</div>

- [Universal Dependency Parsing from Scratch](https://nlp.stanford.edu/pubs/qi2018universal.pdf)

- [Statistical Machine Translation](https://en.wikipedia.org/wiki/Statistical_machine_translation)

- [Language Models](https://en.wikipedia.org/wiki/Language_model).

# Homework

Use the methodology in this notebook to build a statistical language translator, *from your language to english*. So, from Hindi or Chinese to English. Teams of **3** students. You *have* to use a Hidden Markov Model and `pomegranate` as your HMM library, to ensure all student teams start from the same baseline. Start from a Most Frequent Word (BOW) translation baseline, then move on to a Hidden Markov Model to improve translation. How much can you improve it by? The translation engine with the best accuracy, per language, will be presented in class.