# Examples of Markov Chains

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import numpy.random as rnd
import string

from pandas import Series
from pandas import DataFrame
from typing import List

from tqdm import tnrange
from plotnine import *

## I. Markov chain as language model 
There are many methods for detecting language of a text document. 
The simplest one is based on Markov cains of order one, i.e., the last letter determines the probabilities for the next letter.
It is terribly naive but it useful for language detection.

## II. Likelihood and data generation

Let $\beta[x]$ denote the probability that the first letter is $x$. Let $\alpha[x,y]$ denote the probability that the next letter is $y$ provided that the last letter was $x$.
Then we can easily estimate that a word $x_0,\ldots, x_n$ came from this distribution

\begin{align*}
\Pr[\boldsymbol{x}|\alpha,\beta]= \beta[x_0]\cdot\prod_{i=1}^n \alpha[x_{i-1},x_i]
\end{align*}

Let us define likelihood for a truly random lower-case language as an example.

## III. Naive parameter estimation

The parameters of the Markov Chain can be computed by looking at relative frequency of start symbols and bigrams in words. The naive maximum likelihood estimates are 

\begin{align*}
 \beta[x]&=\frac{\# \text{words starting with }x}{\# \text{words}}\\
 \alpha[x,y]&=\frac{\# \text{bigrams of }xy}{\# \text{bigrams starting with }x}
\end{align*}

# Homework

## 9.1 Language detection without Laplace smoothing (<font color='red'>1p</font>)

Use files `est_training_set.csv` and `eng_training_set.csv` in the directory `data` to learn model parameters $\alpha$ and $\beta$ for both languages using maximum likelihood estimates.
Put these parameters into the formal model to compute probabilities

\begin{align*}
      p_1 &=\Pr[word|\mathsf{Estonian}]\\
      p_2 &=\Pr[word|\mathsf{English}]
\end{align*}

and then use Bayes formula

\begin{align*}
 \Pr[\mathsf{Estonian}|word]
 =\frac{\Pr[word|\mathsf{Estonian}]\Pr[\mathsf{Estonian}]}{\Pr[word]}
\end{align*}
to guess the language of a word on test samples `est_test_set.csv` and `eng_test_set.csv`.
Why the procedure does not work? 

**Hint:** The number of samples is not the problem. You can assume that there are enough samples to estimate all parameters with high accuracy. The same problem could have manifested even if there would have been millions of word examples.

## Solution

**We created a language class for estimating the parameters $\alpha$ and $\beta$ for each language, and to calculate the probability. In order to be able to estimate whether an Estonian word is English (and vice versa), we used as alphabet the union of the alphabets of both languages (all possible letters seen in the training data).**

In [2]:
from collections import defaultdict

class language:
    def __init__(self, path):
        self.letters = set()
        self.first_letters = defaultdict(int)
        self.letter_pairs = defaultdict(int)
        self.count_letters(path)
        
        self.beta = Series()
        self.alpha = DataFrame()
        
    def count_letters(self, path):
        with open(path, 'r') as infile:
            for line in infile:
                word = line.strip().lower()
                if word.startswith('"') and word.endswith('"'):
                    word = word[1:-1]
                if len(word) == 0:
                    continue
                    
                self.first_letters[word[0]] += 1
                self.letters.add(word[0])        
                for pair in zip(word, word[1:]):
                    self.letters.add(pair[1])
                    self.letter_pairs[pair] += 1
                    
    def estimate_parameters(self, alphabet, laplace):
        self.estimate_alpha(alphabet, laplace)
        self.estimate_beta(alphabet, laplace)
    
    def estimate_alpha(self, alphabet, laplace):
        self.alpha = DataFrame(np.zeros((len(alphabet), len(alphabet))) + laplace, index = alphabet, columns = alphabet)
        
        for pair, count in self.letter_pairs.items():
            self.alpha.loc[pair[0], pair[1]] += count
        
        self.alpha = self.alpha.apply(lambda x: x if np.sum(x) == 0 else x / np.sum(x), axis = 1)
        
    def estimate_beta(self, alphabet, laplace):
        self.beta = Series(np.zeros(len(alphabet)) + laplace, index = alphabet)
        
        n = np.sum([count for count in self.first_letters.values()])
        for letter, count in self.first_letters.items():
            self.beta[letter] += count
            
        self.beta = self.beta / self.beta.sum()
            
    def word_probability(self, word):
        if len(word) == 0:
            return 0
        
        prob = self.beta[word[0]]
        for first, second in zip(word, word[1:]):
            prob *= self.alpha.loc[first, second]
            
        return prob
    
estonian = language('data/est_training_set.csv')
english = language('data/eng_training_set.csv')
alphabet = estonian.letters.union(english.letters)

estonian.estimate_parameters(alphabet, laplace = 0)
english.estimate_parameters(alphabet, laplace = 0)

**Having trained the language models, we used these to estimate words on the test data of both languages.**

In [3]:
def estimate_language(path, language_for, language_against, prior = np.array([0.5, 0.5])):
    estimates = []
    
    with open(path, 'r') as infile:
        for line in infile:
            word = line.strip().lower()
            if word.startswith('"') and word.endswith('"'):
                word = word[1:-1]
            if len(word) == 0:
                continue
            
            likelihood_for = language_for.word_probability(word)
            likelihood_against = language_against.word_probability(word)
            
            posterior = np.array([likelihood_for, likelihood_against]) * prior
            posterior = posterior / sum(posterior)
            
            estimates.append(posterior[0] > posterior[1])
            
    return estimates

estonian_estimates = estimate_language('data/est_test_set.csv', estonian, english)
english_estimates = estimate_language('data/eng_test_set.csv', english, estonian)
print('Accuracy classifying Estonian words: {}'.format(np.mean(estonian_estimates)))
print('Accuracy classifying English words: {}'.format(np.mean(english_estimates)))
print('Total accuracy classifying words: {}'.format(np.mean([estonian_estimates] + [english_estimates])))

Accuracy classifying Estonian words: 0.9
Accuracy classifying English words: 0.53
Total accuracy classifying words: 0.715


  app.launch_new_instance()


**The procedure fails for some words if their likelihood in both languages is $0$. In that case the normalizing term of the posterior distribution also becomes $0$ and we have a $0/0$ division. Of course, to eliminate the warning, we could have just used the likelihoods to determine the language of our words. However, that would not have helped against the rather poor performance of our classifier as we would still not have known where to classify a word with zero likelihoods for both languages.**

## 9.2 Language detection with Laplace smoothing (<font color='red'>1p</font>)

Use files `est_training_set.csv` and `eng_training_set.csv` in the directory `data` to learn model parameters $\alpha$ and $\beta$ for both languages using Laplace smoothing.
Put these parameters into the formal model to compute probabilities

\begin{align*}
      p_1 &=\Pr[word|\mathsf{Estonian}]\\
      p_2 &=\Pr[word|\mathsf{English}]
\end{align*}

and then use Bayes formula

\begin{align*}
 \Pr[\mathsf{Estonian}|word]
 =\frac{\Pr[word|\mathsf{Estonian}]\Pr[\mathsf{Estonian}]}{\Pr[word]}
\end{align*}
to guess the language of a word on test samples `est_test_set.csv` and `eng_test_set.csv`.
Did the problem dissapear? If not then what we have to consider if we apply Laplace smoothing.  

**To overcome the problem outlined above, we retrained our language models using Laplace smoothing, i.e. added a small value to every possible starting letter and bigram before normalization to $\alpha$ and $\beta$.**

In [4]:
estonian.estimate_parameters(alphabet, laplace = 1)
english.estimate_parameters(alphabet, laplace = 1)

estonian_estimates = estimate_language('data/est_test_set.csv', estonian, english)
english_estimates = estimate_language('data/eng_test_set.csv', english, estonian)
print('Accuracy classifying Estonian words: {}'.format(np.mean(estonian_estimates)))
print('Accuracy classifying English words: {}'.format(np.mean(english_estimates)))
print('Total accuracy classifying words: {}'.format(np.mean([estonian_estimates] + [english_estimates])))

Accuracy classifying Estonian words: 0.96
Accuracy classifying English words: 0.79
Total accuracy classifying words: 0.875


**We can see quite a big improvement in the classification, especially on English words.**