# Text segmentation using Hidden Markov Models
> Tristan Perrot

## Automatic segmentation of mails, problem statement

- ***Q1:** Give the value of the π vector of the initial probabilities*

It is assumed that each mail actually contains a header : the decoding necessarily begins in the state 1. Then the $\pi$ vector is defined as :
$$
\pi = \begin{pmatrix} 1 \\ 0 \end{pmatrix}
$$

- ***Q2:** What is the probability to move from state 1 to state 2 ? What is the probability to remain in state 2 ? What is the lower/higher probability ? Try to explain why*

The transition matrix estimated on a labeled small corpus has the following form :
$$
A = \begin{pmatrix} 0.999218078035812 & 0.000781921964187974 0 \\ 0 & 1 \end{pmatrix}
$$

The probability to move from state 1 to state 2 is $0.000781921964187974$, and the probability to remain in state 2 is $1$. The **lower** probability is the probability to **move from state 1 to state 2** and the **higher** is to **remain in the state 2**. This is due to the fact that since we moved from the header to the body it's impossible to see again the body because we know that each mail contains exactly one header and one body, each mail follows once the transition from 1 to 2.

- ***Q3:** What is the size of B ?*

$B$ is the observation matrix. $N$ is the number of different characters. Since each part of the mail is characterized by a discrete probability distribution on the characters $P(c|s)$, with $s = 1$ or $s = 2$. Then, the shape of $B$ is $(N, 2)$.

## Material

### Coding/decoding mails

In [1]:
import os
import glob
import numpy as np

In [2]:
ROOT = os.path.abspath('.')

PERL_DIR = os.path.join(ROOT,'PerlScriptAndModel')
RES_DIR = os.path.join(ROOT,'res')

In [1]:
DATA_DIR = os.path.join(ROOT,'dat')

# Iterate through files and load the text 
def files_iter(data_dir, with_name=False):
    files = glob.glob('{}/*.dat'.format((data_dir)))
    if with_name:
        for f in files:
            # Get the filename 
            name = f.split('/')[-1].split('.')[0]
            # Return filename and associated text
            yield name, np.loadtxt(f, dtype=int)
    else:
        for f in files:
            # Return text
            yield np.loadtxt(f, dtype=int)

NameError: name 'os' is not defined

In [4]:
# And we get a generator that will allow us to iterate through the mails
mail_iter = files_iter(DATA_DIR, with_name=True)

### Distribution files

In [5]:
PERL_DIR = ...

# Writing a function to get the probability data
def get_emission_prob(perl_dir):
    # We will store the probabilities in a dictionary
    prob_dict = {}
    # We iterate through the files
    
    return ...

In [6]:
# Inputs to the Viterbi function
trans = ...
emission_prob = get_emission_prob(PERL_DIR)
states = ...
start_prob = ...

### To implement:

In [7]:
# Viterbi function
def viterbi(obs, states, start_prob, trans, emission_prob):
    """
        Viterbi Algorithm Implementation

        Keyword arguments:
            - obs: sequence of observation
            - states:list of states
            - start_prob:vector of the initial probabilities
            - trans: transition matrix
            - emission_prob: emission probability matrix
        Returns:
            - seq: sequence of state
    """

    # Avoid underflow: use the logarithm !
    # Avoid 0 in logarithm: use a small constant !
    small = ...
    
    start_prob = ...
    trans = ...
    emission_prob = ...
    
    T = ... # Number of observations
    N = ... # Number of model states
    
    # Initialisation
    log_l = ...
    bcktr = ...
    
    # Viterbi
    
    # Forward loop:
    log_l[0,:]= ...
    for t in range(...):
        logl[t, :] = ...
        bcktr[t, :] = ...
    # Backward loop
    path = ...
    path[-1] = ... 
    for i in range(...):
        path[i - 1] = ...

    return ...

In [8]:
RES_DIR = ...

# Creating a directory to put the result of the viterbi function
if not os.path.exists(RES_DIR):
    os.mkdir(RES_DIR)
    
# Function that will write a viterbi path for a mail in a dedicated result file
def create_viterbi_path_file(mail_name, viterbi_path):
    with open('{}/{}_path.txt'.format(RES_DIR, mail_name), 'w') as f: 
        f.write(''.join([str(c) for c in viterbi_path]))

In [9]:
# Using our generator, we get the mail names and data
for name_file, data in mail_iter:
    # Find out the viterbi path using viterbi
    viterbi_path = ...
    # Put it in the result file
    create_viterbi_path_file(name_file, viterbi_path)

### Visualizing segmentation

In [10]:
# Writing a function to go into the directory and execute the perl script "segment.pl" on the mail in the given path
def exec_perl_script(mail, path):
    res = !cd {PERL_DIR}; perl segment.pl {mail} {path}
    return res

# Writing a function getting the original mail, the result of viterbi, and applying the segmentation script
# Then putting the result
def segment_mail(mail_name, data_dir, output_dir):
    # Get the full path of the mail
    mail = ...
    # Get the full path of the result
    path = ...
    # Execute the visualization script
    formatted_mail = ...
    # Get the results
    formatted_mail_text = ...
    # Go through the resulting text until the cutting line
    ...
    # If this was not the last line, return the text cut in to parts: header and body
    ...
    # If not, it's just a header
    ...

In [11]:
# Getting mails names
...
# Call the function and look at the result of segmentation
...