# Examples of Markov Chains

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import numpy.random as rnd
import string

from pandas import Series
from pandas import DataFrame
from typing import List

from tqdm import tnrange
from plotnine import *

# Local imports
from common import *
from convenience import *

## I. Markov chain as a language model 
There are many methods for detecting the language of a text document. 
The simplest one is based on Markov cains of order one, i.e., the last letter determines the probabilities for the next letter.
It is terribly naive but it is useful for language detection.

## II. Likelihood and data generation

Let $\beta[x]$ denote the probability that the first letter is $x$. Let $\alpha[x,y]$ denote the probability that the next letter is $y$ provided that the last letter was $x$.
Then we can easily estimate the probability that a word $\boldsymbol{x} = (x_0,\ldots, x_n)$ came from this distribution:

\begin{align*}
\Pr[\boldsymbol{x}|\alpha,\beta]= \beta[x_0]\cdot\prod_{i=1}^n \alpha[x_{i-1},x_i]\enspace.
\end{align*}

As an example, let us define a likelihood for a truly random lower-case language.

In [2]:
def rlang_likelihood(x:string) -> float:
    alphabet = list(string.ascii_lowercase)
    alpha = (combine_categories({'x': alphabet, 'y': alphabet})
             .assign(pr = lambda df: 1/len(df))
             .set_index(['x', 'y']))  
    beta = (DataFrame({'x': alphabet})
        .assign(pr = lambda df: 1/len(df))
        .set_index(['x']))
    
    if len(x) == 0:
        return 1
    
    pr = beta.loc[x[0], 'pr']
    for i in range(1, len(x)):
        pr *= alpha.loc[(x[i-1], x[i]), 'pr']
    
    return pr

In [3]:
print(rlang_likelihood('abs'))
print(rlang_likelihood('sss'))

8.416533573215762e-08
8.416533573215762e-08


In [4]:
def rlang_gen(n:int) -> str:
     alphabet = np.array(list(string.ascii_lowercase))
     return ''.join(list(rnd.choice(alphabet, n, replace=True)))   

In [5]:
rlang_gen(5)

'eopdt'

## III. Naive parameter estimation

The parameters of the Markov Chain can be computed by looking at relative frequencies of start symbols and bigrams in words. The naive maximum likelihood estimates are the following:

\begin{align*}
 \beta[x]&=\frac{\# \text{words starting with }x}{\# \text{words}}\\
 \alpha[x,y]&=\frac{\# \text{bigrams of }xy}{\# \text{bigrams starting with }x}\enspace.
\end{align*}

# Homework

## 9.1 Language detection without Laplace smoothing (<font color='red'>1p</font>)

Use files `est_training_set.csv` and `eng_training_set.csv` in the directory `data` to learn model parameters $\alpha$ and $\beta$ for both languages using maximum likelihood estimates.
Put these parameters into the formal model to compute probabilities

\begin{align*}
      p_1 &=\Pr[word|\mathsf{Estonian}]\\
      p_2 &=\Pr[word|\mathsf{English}]
\end{align*}

and then use Bayes formula

\begin{align*}
 \Pr[\mathsf{Estonian}|word]
 =\frac{\Pr[word|\mathsf{Estonian}]\Pr[\mathsf{Estonian}]}{\Pr[word]}
\end{align*}
to guess the language of a word on test samples `est_test_set.csv` and `eng_test_set.csv`.
Why the procedure does not work? 

**Hint:** The number of samples is not the problem. You can assume that there are enough samples to estimate all parameters with high accuracy. The same problem could have manifested even if there were millions of word examples.

## 9.2 Language detection with Laplace smoothing (<font color='red'>1p</font>)

Use files `est_training_set.csv` and `eng_training_set.csv` in the directory `data` to learn model parameters $\alpha$ and $\beta$ for both languages using Laplace smoothing.
Put these parameters into the formal model to compute probabilities

\begin{align*}
      p_1 &=\Pr[word|\mathsf{Estonian}]\\
      p_2 &=\Pr[word|\mathsf{English}]
\end{align*}

and then use Bayes formula

\begin{align*}
 \Pr[\mathsf{Estonian}|word]
 =\frac{\Pr[word|\mathsf{Estonian}]\Pr[\mathsf{Estonian}]}{\Pr[word]}
\end{align*}
to guess the language of a word on test samples `est_test_set.csv` and `eng_test_set.csv`.
Did the problem disappear? If not, what we have to consider if we apply Laplace smoothing?

In [6]:
%config IPCompleter.greedy=True