Author: ZHANG Bolong,
        ZHU Fangda

In [186]:
# import numpy as np
import difflib

# Task : automatic segmentation of mails, problem statement
This Lab aims to build an email segmentation tool, dedicated to separate the email header from its
body. It is proposed to perform this task by learning a HMM (A; B; π) with two states, one (state 1) for
the header, the other (state 2) for the body. In this model, it is assumed that each mail actually contains
a header : the decoding necessarily begins in the state 1.

## Q1 : Give the value of the $\pi$ vector of the initial probabilities

According to the Task, the decoding necessarily begins in the states1, sp our HMM necessarily has an initial state of 1.So the $\pi$ vector of the initial probabilities should be:
$$\pi = (1,0)^T$$ $\Box.$

Knowing that each mail contains exactly one header and one body, each mail follows once the transition
from 1 to 2. The transition matrix $\left(A(i; j) = P(j|i)\right)$ estimated on a labeled small corpus has thus the following form :
$$
A =
\begin{pmatrix}
        0.999218078035812 & 0.000781921964187974\\
        0 & 1 \\
\end{pmatrix}
$$

## Q2.  What is the probability to move from state 1 to state 2 ? What is the probability to remain in state 2 ? What is the lower/higher probability ? Try to explain wh

We can find the probabilities of movements between states in the transition matrx $A$ given above. The row index determines the starting state, while the column index determines the arriving state. Thus:
- Probability from state 1 to state 2:
$$ A(1,2) = 0.000781921964187974 $$
- Probability to remain in state 2:
$$ A(2, 2) = 1.0 $$


For now, the probability to remain state 2 is the highest, this value is relative intuitive: once our email in the state of body, we can not get back to the header state when we continue to read following charaters.$\Box$

A mail is represented by a sequence of characters. Let N be the number of different characters. Each
part of the mail is characterized by a discrete probability distribution on the characters $P(c|s)$, with s = 1
or s = 2.

## To implement

In [187]:
def viterbi(obs, states, start_prob, trans, emission_prob, log=False):
    """
        Viterbi Algorithm Implementation
        
        Keyword arguments:
            - obs: sequence of observation 
            - states:list of states
            - start_prob:vector of the initial probabilities
            - trans: transition matrix
            - emission_prob: emission probability matrix
        Returns:
            - seq: sequence of state
    """
    start_prob = np.log(start_prob)
    trans = np.log(trans)
    emission_prob = np.log(emission_prob)
    T1 = np.zeros((obs.size, states.size))
    T2 = np.zeros((obs.size, states.size), dtype = np.uint8)
    for state in states:
        T1[0,state] = start_prob[state] +  emission_prob[obs[0]][state]
        T2[0,state] = 0
    
    for index in range(1,obs.size):        
        for state in states:
            liste = [T1[index-1,start_state] + trans[start_state, state] for start_state in states]
            T1[index, state] = np.max(liste) + emission_prob[obs[index]][state]
            T2[index, state] = np.argmax(liste)
            if(log == False): continue

    path = np.zeros(len(obs), dtype= np.uint8)

    path[-1] = np.argmax(T1[-1])
    for i in range(len(obs)-2,-1,-1):
        path[i] = T2[i, path[i+1]]
    return path

In [188]:
# Test
A = np.array([[0.7,0.3], [0.4, 0.6]])
start_prob = np.array([0.6, 0.4])
emis = np.array([[0.1,0.6], [0.4,0.3], [0.5,0.1]])
obs = np.array([0,1,2])
states = np.array([0,1])

x = viterbi(obs, states, start_prob, A, emis)
print(x)

[0 1 0]


In [189]:
# Test
A = np.array([[0.5,0.2, 0.3], [0.3, 0.5, 0.2], [0.2,0.3,0.5]])
start_prob = np.array([0.2, 0.4, 0.4])
emis = np.array([[0.5,0.5], [0.4,0.6], [0.7,0.3]])
obs = np.array([0,1,0])
states = np.array([0,1,2])

x = viterbi(obs, states, start_prob, A, emis.T)
print(x)

[0 2 2]


## Question 4 Print the track and present and discuss the results obtained on mail11.txt to mail30.txt

In [191]:
# Get emission matrix
emission = np.loadtxt('P.text')

# Get Transition matrix
trans = np.array([[0.999218078035812,0.000781921964187974],[1e-100,1-1e-100]])
# Set start probalility
start_prob = np.array([1-1e-100,1e-100])

# Import data file
with open('dat/mail.lst', 'r') as file:
    file_list = file.read().splitlines()
datas = [np.loadtxt('dat/'+ x, dtype = int) for x in file_list]
states = np.array([0,1])

In [236]:
# Text for mail1 - mail10
def spliceText(path, text):
    '''splice the text according to the states path'''
    vals, index = np.unique(path, return_index=True)
    index = index[1]
    return text[0:index], text[index:]

def verify_res(number):
        path = viterbi(np.loadtxt('dat/mail'+ str(number) + '.dat', dtype = int), states, start_prob, trans, emission)
        val, index = np.unique(path, return_index=True)
        index = index[1]
        print('+------------------------------------------ mail %d -------------------------------------+' % number)
        print('Test result:')
        print("state 1: 0 ~ " + str(index-1))
        print('state 2:' + str(index) + ' ~ ' + str(len(path)))
        with open('dat/mail' + str(number) + 'h.txt') as file:
            text_h = file.read()
            nb_h = len(text_h)
        with open('dat/mail' + str(number) + 'c.txt') as file:
            text_c = file.read()
            nb_c = len(text_c)
        with open('dat/mail'+ str(number) + '.txt') as file:
            text = file.read()
            header,body = spliceText(path,text)
        print('Real result:')
        print("state 1: 0 ~ " + str(nb_h))
        print('state 2:' + str(nb_h+1) + ' ~ ' + str(nb_h + nb_c))
        
        print("<------------------------------ Diff -------------------------------------------------------->")
        diff = difflib.context_diff(text_c.splitlines(), body.splitlines())
        print('\n'.join(list(diff)))
    
    
# def print_data():
#     for i,data in enumerate(['mail' + str(i) + '.dat' for i in range(1,11)]):

    

In [237]:
for i in range(1,11):
    verify_res(i)

+------------------------------------------ mail 1 -------------------------------------+
Test result:
state 1: 0 ~ 3796
state 2:3797 ~ 5216
Real result:
state 1: 0 ~ 3611
state 2:3612 ~ 5216
<------------------------------ Diff -------------------------------------------------------->
*** 

--- 

***************

*** 1,7 ****

- 
-     Date:        Wed, 21 Aug 2002 10:54:46 -0500
-     From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
-     Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>
  
  
    | I can't reproduce this error.
--- 1,3 ----

+------------------------------------------ mail 2 -------------------------------------+
Test result:
state 1: 0 ~ 2445
state 2:2446 ~ 3376
Real result:
state 1: 0 ~ 2476
state 2:2477 ~ 3376
<------------------------------ Diff -------------------------------------------------------->
*** 

--- 

***************

*** 1,3 ****

--- 1,4 ----

+ ntent-Transfer-Encoding: 7bit
  
  Martin A posted:
  Tassos Papadopoulos,

The results for the 10 first mail seem consistent.

In [253]:
# Text for mail1 - mail10
def spliceText(path, text):
    '''splice the text according to the states path'''
    vals, index = np.unique(path, return_index=True)
    index = index[1]
    return text[0:index], text[index:]

def verify_res_last(number):
        path = viterbi(np.loadtxt('dat/mail'+ str(number) + '.dat', dtype = int), states, start_prob, trans, emission)
        val, index = np.unique(path, return_index=True)
        index = index[1]
        print('+-------------------------------------- mail %d ----------------------------+' % number)
        print('Test result:')
        print("state 1: 0 ~ " + str(index-1))
        print('state 2:' + str(index) + ' ~ ' + str(len(path)))
        with open('dat/mail'+ str(number) + '.txt') as file:
            text = file.read()
            header,body = spliceText(path,text)
        print('<------------------------------ header ------------------------------------------------->')
        print('\n'.join(header.splitlines()[0:6]))
        print('...............')
        print('\n'.join(header.splitlines()[-6:]))
        print('<------------------------------ body --------------------------------------------------->')
        print('\n'.join(body.splitlines()[0:9]))
        print('...............\n')
        print('<------------------------------ End --------------------------------------------------->')
for i in range(10,31):
    verify_res_last(i)
    

+-------------------------------------- mail 10 ----------------------------+
Test result:
state 1: 0 ~ 2846
state 2:2847 ~ 3715
<------------------------------ header ------------------------------------------------->
From spamassassin-talk-admin@lists.sourceforge.net  Thu Aug 22 15:25:29 2002
Return-Path: <spamassassin-talk-admin@example.sourceforge.net>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id B48D543F99
	for <zzzz@localhost>; Thu, 22 Aug 2002 10:25:28 -0400 (EDT)
...............
List-Id: Talk about SpamAssassin <spamassassin-talk.example.sourceforge.net>
List-Unsubscribe: <https://example.sourceforge.net/lists/listinfo/spamassassin-talk>,
    <mailto:spamassassin-talk-request@lists.sourceforge.net?subject=unsubscribe>
List-Archive: <http://www.geocrawler.com/redir-sf.php3?list=spamassassin-talk>
X-Original-Date: Thu, 22 Aug 2002 10:16:36 -0400
Date: Thu, 22 Aug 2002 10:16:36 -

+-------------------------------------- mail 17 ----------------------------+
Test result:
state 1: 0 ~ 2282
state 2:2283 ~ 3425
<------------------------------ header ------------------------------------------------->
From robert.chambers@baesystems.com  Thu Aug 22 17:19:36 2002
Return-Path: <robert.chambers@baesystems.com>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id F2AD843F99
	for <zzzz@localhost>; Thu, 22 Aug 2002 12:19:25 -0400 (EDT)
...............
Reply-To: zzzzteana@yahoogroups.com
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

--- In forteana@y..., "D.McMann" <dmcmann@b...> wrote:
> Robert Moaby, 33,
<------------------------------ body --------------------------------------------------->
 who sent death threats to staff, was also jailed
> for hoarding indecent pictures of children on his home computer.
> 
> Hmm, if I didn't trust our government 

+-------------------------------------- mail 25 ----------------------------+
Test result:
state 1: 0 ~ 2319
state 2:2320 ~ 3238
<------------------------------ header ------------------------------------------------->
From ilug-admin@linux.ie  Fri Aug 23 11:07:47 2002
Return-Path: <ilug-admin@linux.ie>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 6F82C4416B
	for <zzzz@localhost>; Fri, 23 Aug 2002 06:06:31 -0400 (EDT)
...............
Precedence: bulk
List-Id: Irish Linux Users' Group <ilug.linux.ie>
X-Beenthere: ilug@linux.ie


> On Thu, 22 Aug 2002,
<------------------------------ body --------------------------------------------------->
 John P. Looney wrote:
> >  Sun's hardware in general is more reliable,
> ROFL. not in our experience.

Well at least our Caps-Lock keys work:

peter@staunton.ie said:
> Another problem. I have a Dell branded keyboard and if I hit Caps-Lock
> twice, 

In most of condition, the algorithm can split the mail well. But we can also find that if there a mail included at the begining of the mail body, the algorithm will meet difficulty to find the bound between the body and the header, we can find this condition in **mail14**, **mail18** and **mail23**. Sometimes we can find the algorithm find the bound in the middle of line. it's not reasonable. So I think we should condiser escape characters as a character in order to resolve this problem. 

## Question5. How would you model the problem if you had to segment the mails in more than two parts (for example : header, body, signature) ?

In this case, we firstly have to recalculate the transition matrix and emission matrix on the basis of the samples. The transition matrix will hence become a 3x3 matrix of this form:
$$ \begin{bmatrix}
    p_{11}       & p_{12} & 0 \\
     0       & p_{22} & p_{23} \\
     0       & 0 & 1
 \end{bmatrix} $$
The transition matrix will hence become a 256x3 matrix:
and the initial vector will become π:
$$π^{T} = (1, 0, 0)$$.


## 6. How would you model the problem of separating the portions of mail included, knowing that they always start with the character ">".

In this case, the model would have four states:header, body_text, body_included, and signature. So we need a transition matrix 4x4 of this form:

$$ \begin{bmatrix}
    p_{11}       & p_{12} & p_{13} & 0 \\
     0       & p_{22} & p_{23} & p_{24} \\
     0       & p_{32} & p_{33} & p_{34} \\
     0       & 0 & 0 & 1
 \end{bmatrix} $$

and the initial vector π:

$$π^{T} = (1, 0, 0, 0)$$
We also need to recalculate the emission matrix. 

We can also traite the lines instead of character.Beacause most of the bound between different part is not in the middle of line. if a line start with ">", the conditional probability of this line in 'mail include state' will increase.  We can also apply the bigram to increase accuracy rate.