# Text segmentation using Hidden Markov Models

## Theoric question : 


This Lab aims to build an email segmentation tool, dedicated to separate the email header from its body. It is proposed to perform this task by learning a HMM $(A, B, \pi)$ with two states, one (state $1$) for the header, the other (state $2$) for the body. In this model, it is assumed that each mail actually contains a header: the decoding necessarily begins in state $1$.

$\textbf{Q1 :}$ Give the value of the $\pi$ vector of the initial probabilities.


Knowing that each mail contains exactly one header and one body, each mail follows once the transition from $1$ to $2$. The transition matrix $(A(i, j) = P(j|i))$ estimated on a labeled small corpus has thus the following form: 


 $$A = \begin{pmatrix} 0.999218078035812 & 0.000781921964187974 \\ 0 & 1 \end{pmatrix}$$



<div class='alert alert-block alert-warning'>
            Answer:</div>

This : "In this model, it is assumed that each mail actually contains a header: the decoding necessarily begins in state $1$."

Implies that $\pi = \begin{pmatrix} 1 & 0 \end{pmatrix}$


$\textbf{Q2 :}$ What is the probability to move from state 1 to state 2 ? What is the probability
to remain in state 2 ? What is the lower/higher probability ? Try to explain why

<div class='alert alert-block alert-warning'>
            Answer:</div>
Probability to move from state 1 to state 2 = 0.000781921964187974

Probability to remain in state 2 = 1 

lower/higher probability = 0/1

A mail is represented by a sequence of characters. Let N be the number of different characters. Each
part of the mail is characterized by a discrete probability distribution on the characters P(c|s), with s = 1
or s = 2.

$\textbf{Q3 :}$ What is the size of B ?

<div class='alert alert-block alert-warning'>
            Answer:</div>
B is of size (2, N)

In [2]:
import os
import glob
import numpy as np

In [3]:
ROOT = os.path.abspath('.')

PERL_DIR = os.path.join(ROOT,'PerlScriptAndModel')
RES_DIR = os.path.join(ROOT,'res')

print(ROOT, PERL_DIR, RES_DIR)

c:\Users\samai\Dev\SD-TSIA214\TP2 c:\Users\samai\Dev\SD-TSIA214\TP2\PerlScriptAndModel c:\Users\samai\Dev\SD-TSIA214\TP2\res


In [4]:
glob.glob('*',recursive=True)

['dat',
 'PerlScriptAndModel',
 'TP_HMM_English.pdf',
 'TP_HMM_SD-TSIA-214_Students.ipynb',
 'visualize_segmentation.py']

### Coding/Decoding Mails

In [156]:
DATA_DIR = os.path.join(ROOT,'dat')

# Iterate through files and load the text 
def files_iter(data_dir, with_name=False):
    files = glob.glob('{}/*.dat'.format((data_dir)))
    if with_name:
        for f in files:
            # Get the filename 
            name = f.split('\\')[-1].split('.')[0]
            # Return filename and associated text
            yield name, np.loadtxt(f).astype(int)
    else:
        for f in files :
            yield np.loadtxt(f).astype(int)

In [163]:
# And we get a generator that will allow us to iterate through the mails
mail_iter = files_iter(DATA_DIR, with_name=True)

In [164]:
for name, mail in mail_iter:
    print(name, mail.shape)
    break

mail1 (5216,)


### Distribution files

In [34]:
PERL_DIR = os.path.join(ROOT,'PerlScriptAndModel/P.text')

# Writing a function to get the probability data
def get_emission_prob(perl_dir):
    return np.loadtxt(perl_dir).T


(2, 256)

In [212]:
# Inputs to the Viterbi function
trans = np.array([[0.999218078035812, 0.000781921964187974],[0,1]])
emission_prob = get_emission_prob(PERL_DIR)
states = [0,1]
start_prob = np.array([1,0])
np.log(trans+10**(-8))

array([[-7.82217817e-04, -7.15374282e+00],
       [-1.84206807e+01,  9.99999989e-09]])

### To implement:

In [214]:
# Viterbi function
def viterbi(obs, states, start_prob, trans, emission_prob):
    """
        Viterbi Algorithm Implementation

        Keyword arguments:
            - obs: sequence of observation
            - states:list of states
            - start_prob:vector of the initial probabilities
            - trans: transition matrix
            - emission_prob: emission probability matrix
        Returns:
            - seq: sequence of state
    """

    # Avoid underflow: use the logarithm !
    # Avoid 0 in logarithm: use a small constant !
    small = 10**(-15)
    
    start_prob = np.log(start_prob + small)
    trans = np.log(trans+small)
    emission_prob = np.log(emission_prob+small)

    T = len(obs) # Number of observations
    N = len(start_prob) # Number of model states

    # Initialisation
    log_l = np.full((T, N), -np.inf)
    bcktr = np.zeros((T, N))
    
    # Viterbi
    # Forward loop:
    for s in states:
        log_l[0,s] = start_prob[s] + emission_prob[s,obs[0]]
    for t in range(1,T):
        for s in states:
            for r in states:
                new_prob = log_l[t-1,r]+trans[r,s]+emission_prob[s,obs[t]]
                if new_prob > log_l[t,s]:
                    log_l[t,s] = new_prob
                    bcktr[t,s] = r
    # Backward loop
    path = np.zeros(T, dtype=int)
    path[-1] = np.argmax(log_l[-1,:])
    for i in range(T-1, 0, -1):
        path[i - 1] = bcktr[i, path[i]]
    return path

In [174]:
RES_DIR = os.path.join(ROOT,'res')

# Creating a directory to put the result of the viterbi function
if not os.path.exists(RES_DIR):
    os.mkdir(RES_DIR)
    
# Function that will write a viterbi path for a mail in a dedicated result file
def create_viterbi_path_file(mail_name, viterbi_path):
    with open('{}/{}_path.txt'.format(RES_DIR, mail_name), 'w') as f: 
        f.write(''.join([str(c) for c in viterbi_path]))  

In [215]:
# Using our generator, we get the mail names and data
# And we get a generator that will allow us to iterate through the mails
mail_iter = files_iter(DATA_DIR, with_name=True)
for name_file, data in mail_iter:
    # Find out the viterbi path using viterbi
    viterbi_path = viterbi(data,states,start_prob,trans,emission_prob)
    # Put it in the result file
    create_viterbi_path_file(name_file, viterbi_path)

### Visualizing segmentation

In [216]:
# Writing a function to go into the directory and execute the perl script "segment.pl" on the mail in the given path
def exec_perl_script(mail, path):
    res = !cd {PERL_DIR}; perl segment.pl {mail} {path}
    return res

# Writing a function getting the original mail, the result of viterbi, and applying the segmentation script
# Then putting the result
def segment_mail(mail_name, data_dir, output_dir):
    # Get the full path of the mail
    mail = ...
    # Get the full path of the result
    path = ...
    # Execute the visualization script
    formatted_mail = ...
    # Get the results
    formatted_mail_text = ...
    # Go through the resulting text until the cutting line
    ...
    # If this was not the last line, return the text cut in to parts: header and body
    ...
    # If not, it's just a header
    ...

In [11]:
# Getting mails names
...
# Call the function and look at the result of segmentation
...