# Text segmentation using Hidden Markov Models
> Tristan Perrot

## Automatic segmentation of mails, problem statement

- ***Q1:** Give the value of the π vector of the initial probabilities*

It is assumed that each mail actually contains a header : the decoding necessarily begins in the state 1. Then the $\pi$ vector is defined as :
$$
\pi = \begin{pmatrix} 1 \\ 0 \end{pmatrix}
$$

- ***Q2:** What is the probability to move from state 1 to state 2 ? What is the probability to remain in state 2 ? What is the lower/higher probability ? Try to explain why*

The transition matrix estimated on a labeled small corpus has the following form :
$$
A = \begin{pmatrix} 0.999218078035812 & 0.000781921964187974 0 \\ 0 & 1 \end{pmatrix}
$$

The probability to move from state 1 to state 2 is $0.000781921964187974$, and the probability to remain in state 2 is $1$. The **lower** probability is the probability to **move from state 1 to state 2** and the **higher** is to **remain in the state 2**. This is due to the fact that since we moved from the header to the body it's impossible to see again the body because we know that each mail contains exactly one header and one body, each mail follows once the transition from 1 to 2.

- ***Q3:** What is the size of B ?*

$B$ is the observation matrix. $N$ is the number of different characters. Since each part of the mail is characterized by a discrete probability distribution on the characters $P(c|s)$, with $s = 1$ or $s = 2$. Then, the shape of $B$ is $(N, 2)$.

## Material

### Coding/decoding mails

In [1]:
import numpy as np

In [13]:
# Load the list of files
mailList = np.loadtxt('dat/mail.lst', dtype='str')
mails = np.array([np.loadtxt("dat/" + mail, dtype=int) for mail in mailList], dtype=object)

# Config
pi = np.array([1, 0])
A = np.array([[0.999218078035812, 0.000781921964187974], [0, 1]])
B = np.loadtxt("PerlScriptAndModel/P.text")

In [11]:
def viterbi(observations, pi, A, B):
    # Number of states and observations
    num_states = A.shape[0]
    num_observations = len(observations)

    # Initialize the Viterbi trellis and backpointers
    viterbi_trellis = np.zeros((num_states, num_observations))
    backpointers = np.zeros((num_states, num_observations), dtype=int)

    # Initialize the first column of the Viterbi trellis
    viterbi_trellis[:, 0] = pi * B[observations[0], :]

    # Perform the Viterbi algorithm
    for t in range(1, num_observations):
        for s in range(num_states):
            # Calculate the maximum probability and corresponding backpointer
            max_prob = np.max(viterbi_trellis[:, t-1] * A[:, s] * B[observations[t], s])
            backpointers[s, t] = np.argmax(viterbi_trellis[:, t-1] * A[:, s] * B[observations[t], s])

            # Update the Viterbi trellis with the maximum probability
            viterbi_trellis[s, t] = max_prob

    # Find the sequence of states with the highest probability
    best_sequence = [np.argmax(viterbi_trellis[:, -1])]
    for t in range(num_observations-1, 0, -1):
        best_sequence.append(backpointers[best_sequence[-1], t])

    best_sequence.reverse()
    return best_sequence

In [16]:
# Test Viterbi on some mails that are given in the dat directory (especially mail11.txt to mail30.txt).

for i in range(10, 30):
    print("Mail " + str(i+1) + ":")
    print(1 in viterbi(mails[i], pi, A, B))

Mail 11:
False
Mail 12:
False
Mail 13:
False
Mail 14:
False
Mail 15:
False
Mail 16:
False
Mail 17:
False
Mail 18:
False
Mail 19:
False
Mail 20:
False
Mail 21:
False
Mail 22:
False
Mail 23:
False
Mail 24:
False
Mail 25:
False
Mail 26:
False
Mail 27:
False
Mail 28:
False
Mail 29:
False
Mail 30:
False


- ***Q5:** How would you model the problem if you had to segment the mails in more than two parts (for example : header, body, signature) ?*

In this case, the model will have more states, if we denote $Q$ the number of states then the shape of $A$ will be $(Q,Q)$, the shape of $B$ will be $(N,Q)$ and $\pi$ will be equal to $\begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix}$.The rest will be the same.

- ***Q6:** How would you model the problem of separating the portions of mail included, knowing that they always start with the character ">"*
