<div style="text-align: right">INFO 6105 Data Science Eng Methods and Tools, Lecture 6 Day 2</div>
<div style="text-align: right">Dino Konstantopoulos, 12 February 2020, with material by Ankur Ankan and Abinash Panda</div>

Some people have this to say about advantages of the **German** language:

<br />
<center>
<img src="images2/german-flag.jpg" width=300 />
</center>

- It is better to keep the most important piece of information at the end, to keep people’s attention. In German, the main verb in conjugation is at the end of a sentence: *Sie (You) haben (have) bestimmt (definitely) noch (still) nicht (not) viele (many) anständige (respectable) Zauberer (wizard) **kennen gelernt** (met)*. In Spanish and English, we say all the important information first and, for this reason, we tend to interrupt each other in the middle of a sentence.


- The purpose of words is to transmit **knowledge**, so they should be easily understood. Some people seem to use words no one knows just to look smart. In German, it is almost impossible to do this as names for objects describe those objects. I really love this about German: Glutenunverträglichkeit means gluten-not-compatible (celiac). It helped a lot while reading Harry Potter: Zauberer (wizard), Zauberwort (Magic word), Zaubererschule (School of Magic), Zauberstab (magic wand), Zaubererwelt (wizarding world), etc

# POS tagging with Hidden Markov Models

In this notebook you will witness how you can *cheat* Science by relying on data probabilities instead of trying to figure out the rules or laws of Science. I don't know the internals of [Universal Dependencies](https://universaldependencies.org/), but I suspect they do not worry about analyzing the structure of the German language, figuring out that the verb is at the end of a sentence, and accomodating for this in the German Tree Bank. Instead, their algorithms probably read in a lot of german text, and just by looking at the probabilities of where the verb lands in a sentence they can correctly figure out that it's at the end. Probabilities powers **statistics**, and having lots of data means your probabilities can be very *exact*.

Not too many weeks ago, you called R libraries to do POS tagging for you. Now that you know everything about probabilities, *you* can do the same thing *on your own*!

We'll use the [Brown]() corpus to build a [POS tagger](https://en.wikipedia.org/wiki/Part-of-speech_tagging), first using a simple [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model (***most probable POS by count***), then using a **Hidden Markov Model** (HMM) that gets *transition* and *emission* probabilities from [POS bigrams](https://en.wikipedia.org/wiki/Bigram) (given a POS, what's the most probable ***next*** POS in the sentence?).

We'll divide the Brown corpus into training and test sets, and compare accuraces for BOW and HMM models.

We'll use some advanced python structures that are often used in Natural Language Processing (NLP).

# Reading in the Brown corpus efficiently

# Homework

Use the methodology in this notebook to build a statistical language translator, *from your language to english*. So, from Hindi or Chinese to English. Teams of **3** students. You *have* to use a Hidden Markov Model and `pomegranate` as your HMM library, to ensure all student teams start from the same baseline. Start from a Most Frequent Word (BOW) translation baseline, then move on to a Hidden Markov Model to improve translation. How much can you improve it by? The translation engine with the best accuracy, per language, will be presented in class.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from IPython.core.display import HTML
from itertools import chain
from collections import Counter, defaultdict, namedtuple, OrderedDict
from pomegranate import State, HiddenMarkovModel, DiscreteDistribution
import os
from io import BytesIO
from itertools import chain
import random

Some advanced python-fu:

Library `itertools` is a library of efficient iterators. `chain` makes an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. It is used for treating consecutive sequences as a single sequence

In python, a single star `*` unpacks the sequence/collection into positional arguments, so you can do this:
```(python)
def sum(a, b):
    return a + b

values = (1, 2)

s = sum(*values)
```

This will unpack the tuple so that it actually executes as:
```(python)
s = sum(1, 2)
```

The double star `**` does the same, only using a dictionary and thus named arguments:
```(python)
values = { 'a': 1, 'b': 2 }
s = sum(**values)
```

A python `frozenset` is just an immutable version of a Python set object. 

While elements of a set can be modified at any time, elements of frozen set remains the same after creation. 

So, frozen sets can be used as keys in a sictionary or as element of another set.

`read_data` below reads files page by page (`\n\n`), then line by line (`\n`), uses the first line of a page as a key to an ordered dictionary, with the values being a zipper made out of words and POS tags. It accomodates the syntax of the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus), pictured here below.

<br />
<center>
<img src="images/brown.png" width=300 />
    Header of Brown corpus
</center>

In [None]:
#filename = "./test.txt"
# tags_file = "./pos-tagging-brown/tags-universal.txt"
tags_file = "./tags.txt"
# txt_file = "./pos-tagging-brown/brown-universal.txt"
txt_file = "./newfile.txt"

### seperate each 20 words as a sentence, newfile.txt is the seperated file

In [None]:
with open(filename,'r') as test:
    a = list(test.read().split("\n"))
    with open('./newfile.txt','w') as w_f:
        index = 0
        id = 0
        for str_file in a:
            if not str_file.split("\t")[0]==' ':
                if index == 20:
                    w_f.write(str_file+"\n\n")
                    w_f.write(str(id)+'\n')
                    index=0
                else:
                    w_f.write(str_file+"\n")
                index+=1
                id+=1

In [None]:
def read_data(filename):
    """Read tagged sentence data"""
    with open(filename, 'r', encoding = 'UTF-8') as f:
        sentence_lines = [l.split("\n") for l in f.read().split("\n\n")]
    return OrderedDict(((s[0], Sentence(*zip(*[l.strip().split("\t") for l in s[1:]]))) for s in sentence_lines if s[0]))

def read_tags(filename):
    """Read a list of word tag classes"""
    with open(filename, 'r', encoding = 'UTF-8') as f:
        tags = f.read().split("\n")
    return frozenset(tags)

Sentence = namedtuple("Sentence", "words tags")

Let's read in the Brown corpus to see if our python code works out:

In [None]:
tagset = read_tags(tags_file)
sentences = read_data(txt_file)
sentences

The class `Dataset` below incorporates our function above, reads in the Brown corpus and creates a collection of `keys`, a set of (unique) words, a sequence of words and a mirror sequence of tags as tuples, with `N` being the number of words in the Brown corpus.

Then it splits all this nice data into a training and test decomposition by using the `Subset` class defined further below, which mirrors the `Dataset` class.

In [28]:
class Dataset(namedtuple("_Dataset", "sentences keys vocab X tagset Y training_set testing_set N stream")):
    def __new__(cls, tagfile, datafile, train_test_split=0.8, seed=112890):
        tagset = read_tags(tagfile)
        sentences = read_data(datafile)
        keys = tuple(sentences.keys())
        wordset = frozenset(chain(*[s.words for s in sentences.values()]))
        word_sequences = tuple([sentences[k].words for k in keys])
        tag_sequences = tuple([sentences[k].tags for k in keys])
        N = sum(1 for _ in chain(*(s.words for s in sentences.values())))
        
        
        # split data into train/test sets
        _keys = list(keys)
        if seed is not None: random.seed(seed)
        random.shuffle(_keys)
        split = int(train_test_split * len(_keys))
        training_data = Subset(sentences, _keys[:split])
        testing_data = Subset(sentences, _keys[split:])
        stream = tuple(zip(chain(*word_sequences), chain(*tag_sequences)))
        return super().__new__(cls, dict(sentences), keys, wordset, word_sequences, tagset,
                               tag_sequences, training_data, testing_data, N, stream.__iter__)

    def __len__(self):
        return len(self.sentences)

    def __iter__(self):
        return iter(self.sentences.items())
    
    
class Subset(namedtuple("BaseSet", "sentences keys vocab X tagset Y N stream")):
    def __new__(cls, sentences, keys):
        word_sequences = tuple([sentences[k].words for k in keys])
        tag_sequences = tuple([sentences[k].tags for k in keys])
        wordset = frozenset(chain(*word_sequences))
        tagset = frozenset(chain(*tag_sequences))
        N = sum(1 for _ in chain(*(sentences[k].words for k in keys)))
        stream = tuple(zip(chain(*word_sequences), chain(*tag_sequences)))
        return super().__new__(cls, {k: sentences[k] for k in keys}, keys, wordset, word_sequences,
                               tagset, tag_sequences, N, stream.__iter__)

    def __len__(self):
        return len(self.sentences)

    def __iter__(self):
        return iter(self.sentences.items())

Let's read in the Brown corpus *again*, leveraging our classes above now, which order the corpus into efficiently navigable structures:

In [29]:
data = Dataset(tags_file, txt_file, train_test_split=0.8)

print("There are {} sentences in the corpus.".format(len(data)))
print("There are {} sentences in the training set.".format(len(data.training_set)))
print("There are {} sentences in the testing set.".format(len(data.testing_set)))

assert len(data) == len(data.training_set) + len(data.testing_set), \
       "The number of sentences in the training set + testing set should sum to the number of sentences in the corpus"

There are 1040 sentences in the corpus.
There are 832 sentences in the training set.
There are 208 sentences in the testing set.


In [30]:
print("There are a total of {} samples of {} unique words in the corpus."
      .format(data.N, len(data.vocab)))
print("There are {} samples of {} unique words in the training set."
      .format(data.training_set.N, len(data.training_set.vocab)))
print("There are {} samples of {} unique words in the testing set."
      .format(data.testing_set.N, len(data.testing_set.vocab)))
print("There are {} words in the test set that are missing in the training set."
      .format(len(data.testing_set.vocab - data.training_set.vocab)))

assert data.N == data.training_set.N + data.testing_set.N, \
       "The number of training + test samples should sum to the total number of samples"

There are a total of 20798 samples of 5223 unique words in the corpus.
There are 16638 samples of 4521 unique words in the training set.
There are 4160 samples of 1776 unique words in the testing set.
There are 702 words in the test set that are missing in the training set.


Let's look at an example POS tagging:

In [31]:
key = '10'
print("Sentence: {}".format(key))
print("words:\n\t{!s}".format(data.sentences[key].words))
print("tags:\n\t{!s}".format(data.sentences[key].tags))

Sentence: 10
words:
	('半个', '小时', '不到', '钟', '就', '敲响', '五点', '散课', '大家', '都', '进', '饭厅', '去', '吃', '茶点', '我', '这', '才', '大', '着', '胆')
tags:
	('Half', 'Hours', 'Less than', 'Minutes', 'Just.', 'Ring', 'Five', 'Breakout class', 'Everyone', 'All', 'Into', 'Dining room', 'To', 'Eat', 'Refreshment', 'I', 'This', "It's just", 'Big', 'On the', 'Bile')


This is how easy it is, now, to access words and associated tags, using the vocabulary of Machine Learning: `X` is the independent variable, and `Y` the dependent variable!

In [32]:
# accessing words with Dataset.X and tags with Dataset.Y 
for i in range(2):    
    print("Sentence {}:".format(i + 1), data.X[i])
    print()
    print("Labels {}:".format(i + 1), data.Y[i])
    print()

Sentence 1: ('半个', '小时', '不到', '钟', '就', '敲响', '五点', '散课', '大家', '都', '进', '饭厅', '去', '吃', '茶点', '我', '这', '才', '大', '着', '胆')

Labels 1: ('Half', 'Hours', 'Less than', 'Minutes', 'Just.', 'Ring', 'Five', 'Breakout class', 'Everyone', 'All', 'Into', 'Dining room', 'To', 'Eat', 'Refreshment', 'I', 'This', "It's just", 'Big', 'On the', 'Bile')

Sentence 2: ('走', '下', '凳子', '这时', '暮色', '正浓', '我', '躲进', '一个', '角落', '在', '地板', '上', '坐下', '来', '一直', '支撑', '着', '我', '的')

Labels 2: ('Go', 'Next', 'Stool', 'Then', 'Twilight', 'Positive', 'I', 'Hide in.', 'One', 'Corner', 'In', 'Floor', 'On', 'Sit down', 'To', 'Always', 'Support', 'On the', 'I', 'The')



Use `Dataset.stream()` to enumerate (word, tag) samples for the entire corpus. Let's enumerate the first 4:

In [33]:
print("\nStream (word, tag) pairs:\n")
for i, pair in enumerate(data.stream()):
    print("\t", pair)
    if i > 10: break


Stream (word, tag) pairs:

	 ('半个', 'Half')
	 ('小时', 'Hours')
	 ('不到', 'Less than')
	 ('钟', 'Minutes')
	 ('就', 'Just.')
	 ('敲响', 'Ring')
	 ('五点', 'Five')
	 ('散课', 'Breakout class')
	 ('大家', 'Everyone')
	 ('都', 'All')
	 ('进', 'Into')
	 ('饭厅', 'Dining room')


These are all words and tags in our **training set**. Let's uncover the first 4:

In [34]:
words = [word for i, (word, tag) in enumerate(data.training_set.stream())]
tags = [tag for i, (word, tag) in enumerate(data.training_set.stream())]
words[0:10], tags[0:10]

(['进步', '此外', '我', '也', '深受', '同学', '们', '的', '欢迎', '同'],
 ['Progress',
  'In addition,',
  'I',
  'Also',
  'By',
  'Classmates',
  'We',
  'The',
  'Welcome',
  'With'])

# POS Tagger using BOW Model

Let's create a dictionary of word + tag pairs where the values are just counts. Note that some words may be associated with different POS tags, in which case they will produce *distinct* pairs: 

In [35]:
def pair_counts(tags, words):
    d = defaultdict(lambda: defaultdict(int))
    for tag, word in zip(tags, words):
        d[tag][word] += 1
    return d
        
word_counts = pair_counts(words, tags)
word_counts

defaultdict(<function __main__.pair_counts.<locals>.<lambda>()>,
            {'进步': defaultdict(int, {'Progress': 4}),
             '此外': defaultdict(int, {'In addition,': 4}),
             '我': defaultdict(int, {'I': 823}),
             '也': defaultdict(int, {'Also': 86}),
             '深受': defaultdict(int, {'By': 1}),
             '同学': defaultdict(int, {'Classmates': 1}),
             '们': defaultdict(int, {'We': 17}),
             '的': defaultdict(int, {'The': 1124}),
             '欢迎': defaultdict(int, {'Welcome': 1}),
             '同': defaultdict(int, {'With': 45}),
             '年龄': defaultdict(int, {'Age': 3}),
             '相仿': defaultdict(int, {'Similar': 1}),
             '人': defaultdict(int, {'People': 88}),
             '对': defaultdict(int, {'Right': 42}),
             '平等': defaultdict(int, {'Equal': 3}),
             '相待': defaultdict(int, {'Wait inge': 2}),
             '屈尊': defaultdict(int, {'Condescension': 1}),
             '就': defaultdict(int, {'Just.': 47})

Let's produce a dictionary where words (keys) are associated with their ***most frequent*** POS tag:

In [36]:
mfc_table = dict((word, max(tags.keys(), key=lambda key: tags[key])) for word, tags in word_counts.items())
mfc_table

{'进步': 'Progress',
 '此外': 'In addition,',
 '我': 'I',
 '也': 'Also',
 '深受': 'By',
 '同学': 'Classmates',
 '们': 'We',
 '的': 'The',
 '欢迎': 'Welcome',
 '同': 'With',
 '年龄': 'Age',
 '相仿': 'Similar',
 '人': 'People',
 '对': 'Right',
 '平等': 'Equal',
 '相待': 'Wait inge',
 '屈尊': 'Condescension',
 '就': 'Just.',
 '驾': 'Driving',
 '结果': 'Results',
 '这样': 'This way,',
 '倒': 'Pour down',
 '更好': 'Better',
 '处境': 'Situation',
 '更': 'More',
 '自由': 'Free',
 '还': 'Also',
 '在': 'In',
 '沉思': 'Meditation',
 '着': 'On the',
 '这个': 'This one',
 '跟着': 'Follow',
 '费尔法克斯': 'Fairfax',
 '太太': 'Wife',
 '她': 'She',
 '把': 'Put',
 '刚才': 'Just now',
 '新闻': 'News',
 '重复': 'Repeat',
 '一遍': 'Once again.',
 '说': 'Said',
 '外科医生': 'Surgeon',
 '卡特': 'Carter',
 '已经': 'Have',
 '来': 'To',
 '这会儿': 'Now',
 '罗切斯特': 'Rochester',
 '先生': 'Mr',
 '睡': 'Sleep',
 '得': 'Get it.',
 '好好': 'Good',
 '’': "'",
 '”': '"',
 '喃喃地': 'Muttering',
 '“': '"',
 '你': 'You',
 '现在': 'Right now',
 '上': 'On',
 '哪儿': 'Where',
 '去': 'To',
 '呀': 'Oh, yes.',
 '因为': 'Be

In [37]:
i = 0
for key, value in mfc_table.items():
    print(key, value)
    i += 1
    if i > 120: break

进步 Progress
此外 In addition,
我 I
也 Also
深受 By
同学 Classmates
们 We
的 The
欢迎 Welcome
同 With
年龄 Age
相仿 Similar
人 People
对 Right
平等 Equal
相待 Wait inge
屈尊 Condescension
就 Just.
驾 Driving
结果 Results
这样 This way,
倒 Pour down
更好 Better
处境 Situation
更 More
自由 Free
还 Also
在 In
沉思 Meditation
着 On the
这个 This one
跟着 Follow
费尔法克斯 Fairfax
太太 Wife
她 She
把 Put
刚才 Just now
新闻 News
重复 Repeat
一遍 Once again.
说 Said
外科医生 Surgeon
卡特 Carter
已经 Have
来 To
这会儿 Now
罗切斯特 Rochester
先生 Mr
睡 Sleep
得 Get it.
好好 Good
’ '
” "
喃喃地 Muttering
“ "
你 You
现在 Right now
上 On
哪儿 Where
去 To
呀 Oh, yes.
因为 Because
— —
靠近 Near
大门 Door
那个 That one
教堂 Church
是 Is
他 He
管 Tube
这位 The
母亲 Mother
家 Home
那他 Then he
不是 No
自己 Myself
主动 Active
要 To
抚养 Raise
小姐 Miss
感到 Feel
很 Is
遗憾 Regret
不得不 Had
矮篱 Dwarf Hedges
草地 Grass
和 And
庭园 Garden
分开 Separate
上长 Upper long
一排排 A row of
巨大 Huge
老 Old
荆棘 Thorns
树丛 Trees
强劲 Strong
多节 Multi-section
大 Big
如 Such as
橡树 Oak
疼痛 Pain
蹒跚 Lurched
地 Ground
踱 I'd be
向 To
起身 Up
离开 Leave
台阶 Steps
一 One
屁股 Ass
坐下 Sit down

Python `namedtuple` supports a type of container-like dictionary that, like dictionaries, contains keys that are hashed to particular values. But it supports *both* access from key values as well as *iteration*, the functionality that dictionaries lack.

Let's write a class that takes in a table in its constructor and adds `<MISSING>` POS tags if the word is missing from the training set (possible that a word is in the test set but missing from the training set). It also has a `viterbi` method that takes in the table and builds a sequence of states that we will use in our Hidden Markov Model.

In [38]:
FakeState = namedtuple('FakeState', 'name')

class MFCTagger:
    missing = FakeState(name = '<MISSING>')
    
    def __init__(self, table):
        self.table = defaultdict(lambda: MFCTagger.missing)
        self.table.update({word: FakeState(name=tag) for word, tag in table.items()})
        
    def viterbi(self, seq):
        """This method simplifies predictions by matching the Pomegranate viterbi() interface"""
        return 0., list(enumerate(["<start>"] + [self.table[w] for w in seq] + ["<end>"]))

In [39]:
mfc_model = MFCTagger(mfc_table)

So essentially we built a table that associates words with their most frequent POS tag. This is a simplistic **bag of words** (BOW) model. Let's see, given a sentence, if we *guess the hidden states* (POS tags) right!

In [40]:
def replace_unknown(sequence):
    return [w if w in data.training_set.vocab else 'nan' for w in sequence]

def simplify_decoding(X, model):    
    _, state_path = model.viterbi(replace_unknown(X))
    return [state[1].name for state in state_path[1:-1]]

In [41]:
for key in data.testing_set.keys[:1]:
    print("Sentence Key: {}\n".format(key))
    print("Sentence: {}\n".format(data.sentences[key].words))
    print("Predicted labels:\n-----------------")
    print(simplify_decoding(data.sentences[key].words, mfc_model))
    print()
    print("Actual labels:\n--------------")
    print(data.sentences[key].tags)
    print("\n")

Sentence Key: 15680

Sentence: ('没听说过', '”', '费尔法克斯', '太太', '笑', '着', '说', '“', '鬼', '的', '传说', '也', '没有', '没有', '传奇', '或者', '鬼故事', '”', '“', '我')

Predicted labels:
-----------------
['<MISSING>', '"', 'Fairfax', 'Wife', 'Laugh', 'On the', 'Said', '"', 'Ghost', 'The', '<MISSING>', 'Also', 'No', 'No', 'Legend', 'Or', '<MISSING>', '"', '"', 'I']

Actual labels:
--------------
("I've never heard of it.", '"', 'Fairfax', 'Wife', 'Laugh', 'On the', 'Said', '"', 'Ghost', 'The', 'Legend', 'Also', 'No', 'No', 'Legend', 'Or', 'Ghost Story', '"', '"', 'I')




Pretty good! Let's evaluate the accuracy of our most-frequent-tag tagger:

In [42]:
def accuracy(X, Y, model):
    
    correct = total_predictions = 0
    for observations, actual_tags in zip(X, Y):
        
        # The model.viterbi call in simplify_decoding will return None if the HMM
        # raises an error (for example, if a test sentence contains a word that
        # is out of vocabulary for the training set). Any exception counts the
        # full sentence as an error (which makes this a conservative estimate).
        try:
            most_likely_tags = simplify_decoding(observations, model)
            correct += sum(p == t for p, t in zip(most_likely_tags, actual_tags))
        except:
            pass
        total_predictions += len(observations)
    return correct / total_predictions

In [43]:
mfc_training_acc = accuracy(data.training_set.X, data.training_set.Y, mfc_model)
print("training accuracy mfc_model: {:.2f}%".format(100 * mfc_training_acc))

mfc_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, mfc_model)
print("testing accuracy mfc_model: {:.2f}%".format(100 * mfc_testing_acc))

training accuracy mfc_model: 100.00%
testing accuracy mfc_model: 81.95%


# Hidden Markov Model

Let's build a POS tagger using a Hidden Markov Model.

First, let's see how many POS tags we have in our corpus, using the python `Counter` structure that we used last week to count instances.

In [44]:
def unigram_counts(sequences):
    return Counter(sequences)

tags = [tag for i, (word, tag) in enumerate(data.training_set.stream())]
tag_unigrams = unigram_counts(tags)
len(tag_unigrams)

3215

We'll *slightly* modify the code above to get our POS bigrams, from *both* training and test subsets, to uncover which POS tags follow which other POS tags. So, instead of a simple list of POS tags, `Counter` will count *neighboring* POS tuples! 

In [45]:
def bigram_counts(sequences):
    return Counter(sequences)

tags = [tag for i, (word, tag) in enumerate(data.stream())]
o = [(tags[i],tags[i+1]) for i in range(0,len(tags)-2,2)]
tag_bigrams = bigram_counts(o)
len(tag_bigrams) 

8188

What tags do our sentences *begin* with?

In [46]:
def starting_counts(sequences):
    return Counter(sequences)

tags = [tag for i, (word, tag) in enumerate(data.stream())]
starts_tag = [i[0] for i in data.Y]
tag_starts = starting_counts(starts_tag)
tag_starts

Counter({'Half': 1,
         'Go': 3,
         'Magic': 1,
         'No': 10,
         'Do': 4,
         'The': 73,
         'On the': 12,
         'Progress': 1,
         'I': 58,
         'The date': 1,
         'When': 4,
         'Walked': 1,
         'Said': 7,
         'Bread': 1,
         "Can't": 3,
         'Knee': 1,
         '"': 33,
         'People': 9,
         'Despise': 1,
         'Be sure.': 1,
         'Also': 10,
         'Public': 1,
         'You': 11,
         'Obvious': 1,
         'To': 14,
         'Whole': 1,
         'Stand': 3,
         'Horse': 2,
         'Love': 3,
         'Go inside.': 1,
         ';': 4,
         'Yes': 7,
         'Charges': 1,
         'Rewards': 1,
         'Sink': 1,
         'She': 17,
         'Moon': 1,
         'Recognize it.': 1,
         'We': 3,
         'Apartments': 1,
         'Chair': 1,
         'Do you?': 3,
         'Forever': 2,
         'Girl': 1,
         'Wife': 4,
         'But': 13,
         'Jane': 3,
        

What tags do our sentences *end* with?

In [47]:
def ending_counts(sequences):    
    return Counter(sequences)

end_tag = [i[len(i)-1] for i in data.Y]
tag_ends = ending_counts(end_tag)
# tag_ends

Not surprising that most end with a period `.`! Ideally, we should end with the previous-to-last tag! 

In [48]:
end_tag = [i[len(i)-2] for i in data.Y]
tag_ends = ending_counts(end_tag)
# tag_ends

Let's create our Hidden Markov Model and peek into most popular words per POS tag.

`tag_words_count` contains words associated to each POS tag, arranged by frequency so that we can eventually evaluate *emission* probabilities, which are probabilities of observable states (wrods) given hidden states (POS tags).

In [49]:
hmm_model = HiddenMarkovModel(name="base-hmm-tagger")

tags = [tag for i, (word, tag) in enumerate(data.stream())]
words = [word for i, (word, tag) in enumerate(data.stream())]

tags_count = unigram_counts(tags)
tag_words_count = pair_counts(tags, words)

starting_tag_list = [i[0] for i in data.Y]
#ending_tag_list = [i[-1] if len(i)==1 else i[-2] for i in data.Y]
#ending_tag_list = [i[-1] for i in data.Y]
ending_tag_list = [i[len(i)-2] for i in data.Y]

starting_tag_count = starting_counts(starting_tag_list) #the number of times a tag occured at the start
ending_tag_count = ending_counts(ending_tag_list)       #the number of times a tag occured at the end

# tag_words_count

In [50]:
ending_tag_list

['On the',
 'I',
 'No',
 'Performance',
 'Made',
 'Miss',
 'Same',
 'Equal',
 'And also',
 'This one',
 'Light',
 '"',
 'Or',
 'But',
 'On the',
 'First',
 'What about?',
 'Thousands',
 'Eighty.',
 'You',
 'I',
 'He',
 'Everywhere',
 'Yes',
 'Yes',
 'Myself',
 '"',
 'You',
 'No',
 'Look for it.',
 'I will',
 'Leg',
 'People',
 'Body',
 'Outside',
 'World',
 'In',
 'Suffered',
 'But',
 'Inside',
 'We',
 'Sad',
 'Mixed up',
 'One',
 'Take up',
 'Silently',
 'Cloud Block',
 'We',
 '"',
 "Let's do it",
 'She',
 'Low',
 'All',
 "'m afraid",
 'Yes',
 'A',
 '"',
 'The',
 'She',
 'Raise',
 'Good',
 'Myself',
 'Remember',
 'I',
 'Excited',
 'Warned',
 'Content',
 'Narrative',
 'No',
 'Crazy',
 'Is',
 'Some',
 'You',
 'Right',
 'Is',
 'That',
 'Joy',
 '"',
 'Stand',
 'She',
 'The',
 'Servant',
 'Plate',
 'The',
 "It's because",
 'She',
 'Go',
 'One moment.',
 'All',
 'Barbara',
 'This',
 'A small piece.',
 'Fruit',
 'Less',
 'Drink',
 'Hostess',
 'Tray',
 'And',
 'Solemn',
 'She',
 'The',
 'Is',

In [51]:
starting_tag_count

Counter({'Half': 1,
         'Go': 3,
         'Magic': 1,
         'No': 10,
         'Do': 4,
         'The': 73,
         'On the': 12,
         'Progress': 1,
         'I': 58,
         'The date': 1,
         'When': 4,
         'Walked': 1,
         'Said': 7,
         'Bread': 1,
         "Can't": 3,
         'Knee': 1,
         '"': 33,
         'People': 9,
         'Despise': 1,
         'Be sure.': 1,
         'Also': 10,
         'Public': 1,
         'You': 11,
         'Obvious': 1,
         'To': 14,
         'Whole': 1,
         'Stand': 3,
         'Horse': 2,
         'Love': 3,
         'Go inside.': 1,
         ';': 4,
         'Yes': 7,
         'Charges': 1,
         'Rewards': 1,
         'Sink': 1,
         'She': 17,
         'Moon': 1,
         'Recognize it.': 1,
         'We': 3,
         'Apartments': 1,
         'Chair': 1,
         'Do you?': 3,
         'Forever': 2,
         'Girl': 1,
         'Wife': 4,
         'But': 13,
         'Jane': 3,
        

Let's convert word frequencies by POS tag to probabilities by dividing by the total number of words per POS tag, yielding the `distribution` of words.

We'll define HMM emission probabilities using that `distribution`.

In [52]:
to_pass_states = []
for tag, words_dict in tag_words_count.items():
    total = float(sum(words_dict.values()))
    distribution = {word: count/total for word, count in words_dict.items()}
    tag_emissions = DiscreteDistribution(distribution)
#     print(distribution)
    tag_state = State(tag_emissions, name=tag)
    to_pass_states.append(tag_state)

In [53]:
distribution

{'并不需要': 1.0}

`to_pass_states` yields the probability distribution of words per POS tag:

In [54]:
# to_pass_states

Let's add states to our model:

In [55]:
hmm_model.add_states() 

The start probability for each tag is how many times it is a sentence-starting POS tag divided by its total count. We build the starting transitions for our HMM model:

In [56]:
start_prob={}

for tag in tags:
    start_prob[tag] = starting_tag_count[tag] / tags_count[tag]

for tag_state in to_pass_states :
    hmm_model.add_transition(hmm_model.start, tag_state, start_prob[tag_state.name])  

The end probability for each tag is how many times it is a sentence-ending POS tag divided by its total count. We build the ending transitions for our HMM model:

In [57]:
end_prob={}

for tag in tags:
    end_prob[tag] = ending_tag_count[tag]/tags_count[tag]
    
for tag_state in to_pass_states :
    hmm_model.add_transition(tag_state, hmm_model.end, end_prob[tag_state.name])

We now add the transition probabilities for our model, which uses our POS bigrams to enumerate what the probabilities are for transiting from one POS tag to another.

In [61]:
transition_prob_pair={}

for key in tag_bigrams.keys():
    transition_prob_pair[key] = tag_bigrams.get(key)/tags_count[key[0]]
    
for tag_state in to_pass_states:
    for next_tag_state in to_pass_states:
       if(tag_state.name,next_tag_state.name) in tag_bigrams.keys():
          hmm_model.add_transition(tag_state, next_tag_state, transition_prob_pair[(tag_state.name, next_tag_state.name)])

We *bake* our model:

In [62]:
hmm_model.bake()

We can now evaluate the accuracy of our HMM model and compare it to our BOW model:

In [63]:
hmm_training_acc = accuracy(data.training_set.X, data.training_set.Y, hmm_model)
print("training accuracy basic hmm model: {:.2f}%".format(100 * hmm_training_acc))

hmm_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, hmm_model)
print("testing accuracy basic hmm model: {:.2f}%".format(100 * hmm_testing_acc))

training accuracy basic hmm model: 0.12%
testing accuracy basic hmm model: 0.72%


Here's a decoding example:

In [59]:
for key in data.testing_set.keys[1:]:
    print("Sentence Key: {}\n".format(key))
    print("Sentence: {}\n".format(data.sentences[key].words))
    print("Predicted labels:\n-----------------")
    print(simplify_decoding(data.sentences[key].words, hmm_model))
   
    print("Actual labels:\n--------------")
    print(data.sentences[key].tags)
    print("\n")

Sentence Key: 8980

Sentence: ('手', '叫', '道', '我', '定睛一看', '见', '是', '一个', '少妇', '穿戴', '得', '像', '一个', '衣着', '讲究', '的', '仆人', '一', '付', '已婚')

Predicted labels:
-----------------


TypeError: 'NoneType' object is not subscriptable

A 96% accuracy for our HMM model compared to a 93% accuracy for our BOW model is a ***huge*** improvement as it brings language understanding error to below 4%, and 5% error is considered a *gold standard* for NLP. Speech-to-text frameworks like Alexa and Siri only started betting popular when they crossed the 5% threshold.

# References

- Hands on Markov models with python, by Ankur Ankan and Abinash Panda, [on amazon](https://www.amazon.com/Hands-Markov-Models-Python-probabilistic/dp/1788625447/ref=sr_1_2?keywords=hands+on+markov+models+with+python&qid=1581280984&sr=8-2)</div>

- [Universal Dependency Parsing from Scratch](https://nlp.stanford.edu/pubs/qi2018universal.pdf)

- [Statistical Machine Translation](https://en.wikipedia.org/wiki/Statistical_machine_translation)

- [Language Models](https://en.wikipedia.org/wiki/Language_model).

# Homework

Use the methodology in this notebook to build a statistical language translator, *from your language to english*. So, from Hindi or Chinese to English. Teams of **3** students. You *have* to use a Hidden Markov Model and `pomegranate` as your HMM library, to ensure all student teams start from the same baseline. Start from a Most Frequent Word (BOW) translation baseline, then move on to a Hidden Markov Model to improve translation. How much can you improve it by? The translation engine with the best accuracy, per language, will be presented in class.