# Proyecto Final:
## Reconocimiento de Entidades Nombradas

El objetivo es identificar entidades nombradas en un corpus del área de genética. Las etiquetas que se buscan indican si se trata de un gen o no. Debe tomarse en cuenta que las entidades nombradas pueden constar de mas de un elemento. Por tanto, se utiliza un etiquetado BIO.

Las etiquetas son como siguen:
1. __B-tag__ indica el inicio de una entidad (de izquierda a derecha).
2. __I-tag__ indica que la palabra pertenece a una entidad etiquetada con B-tag. En este sentido, siempre debe existir una etiqueta B-tag, pero no necesariamente una I-tag.
3. Finalmente la etiqueta __O__ indica que no se trata de una entidad nombrada.l

## Cadenas Ocultas de Markov

Para dar solución al problema propuesto se usara el modelo de aprendizaje automático Hidden Markov Model (HMM). Es importante establecer que para construir un HMM se necesitan llevar a cabo tres etapas:


1. __Likelihood__: Calcular la probabilidad de las observaciones dado los parámetros del modelo $\lambda$: 

    $p(O|\lambda)$

Para calcular esta probabilidad se pueden usar los algoritmos _forward_ y _backward_, ambos producen el mismo resultado.

2. __Decodificacion__: Encontrar la secuencia de estados que maximice las observaciones: 

    $argmax_S(p(S|O))$

El algoritmo de _Viterbi_ se utiliza para calcular la secuencia mas probable.

3. __Aprendizaje__: Ajustar los parámetros A, B y $\Pi$ del HMM de tal forma que maximicen la probabilidad de observar una secuencia:

Con el algoritmo de _Expectation Maximization_ se ajustan las matrices de probabilidades del modelo dada una secuencia.

## Construccion del HMM $\lambda$

Un modelo oculto de Markov se denota mediante la letra $\lambda$ y es una 5-tupla:

$$ \lambda = (S, O, A, B, \Pi) $$

Donde: 

* $S = \{s_1, ..., s_N\}$
        Conjunto de estados
* $O = \{o_1, ..., o_T\}$
        Conjunto de emisiones
* $A = \{a_{i,j}\} = p(q_{t+1}=S_j|q_t=S_i)$
        Probabilidad de transiciones
* $B = \{b_{i,j}\} = p(q_t=o_i| q_t=S_j)$
        Probabilidad de emisiones
* $\Pi = \{\Pi_i\} = p(q_1=S_i)$
        Probabilidad de estados iniciales

In [1]:
import re
import numpy as np
import pandas as pd
from nltk import bigrams

### Lectura y preprocesamiento del corpus

El conjunto de estados ocultos $S$ se conformara por las etiquetas y el conjunto de simbolos de observación $O$ seran las palabras.

In [2]:
with open('Final1/data_test.txt', 'r') as file:
    raw_corpus = file.read().splitlines()

In [3]:
raw_corpus[:10]

['IL-2\tB-DNA',
 'gene\tI-DNA',
 'expression\tO',
 'and\tO',
 'NF-kappa\tB-protein',
 'B\tI-protein',
 'activation\tO',
 'through\tO',
 'CD28\tB-protein',
 'requires\tO']

In [4]:
# Crear el conjunto de estados y observaciones junto con un diccionario
# que almacena la secuencia de estados y emisiones como una cadena

corpus = {
    'states': [],
    'obs': []
}

S = []
O = []


string_obs = ''
string_state = ''

for phrase in raw_corpus:
    if phrase == '' and len(string_obs) > 0 and len(string_state) > 0:
        corpus['states'].append('<BOS> ' + string_state + ' <EOS>')
        corpus['obs'].append('<BOS> ' + string_obs + ' <EOS>')
        string_obs = ''
        string_state = ''
    try:
        obs , state = phrase.split('\t')
        string_obs += obs + ' ' 
        string_state += state + ' '
        if obs not in O:
            O.append(obs)
        if state not in S:
            S.append(state)
    except:
        pass

In [5]:
obs2indx = {}
state2indx = {}
indx2state = {}
indx2obs = {}

for idx, obs in enumerate(O):
    obs2indx[obs] = idx
    
for idx, state in enumerate(S):
    indx2state[idx] = state

In [6]:
print(len(corpus['obs'][0].split()))
print(corpus['obs'][0].split())

18
['<BOS>', 'IL-2', 'gene', 'expression', 'and', 'NF-kappa', 'B', 'activation', 'through', 'CD28', 'requires', 'reactive', 'oxygen', 'production', 'by', '5-lipoxygenase', '.', '<EOS>']


In [7]:
print(len(corpus['states'][0].split()))
print(corpus['states'][0].split())

18
['<BOS>', 'B-DNA', 'I-DNA', 'O', 'O', 'B-protein', 'I-protein', 'O', 'O', 'B-protein', 'O', 'O', 'O', 'O', 'O', 'B-protein', 'O', '<EOS>']


In [8]:
len_S = len(S)
len_O = len(O)

print('Numero de estados = ', len_S)
print('Alfabeto de observaciones = ', len_O)

Numero de estados =  11
Alfabeto de observaciones =  22053


### Obtener frecuencias de bigramas

In [9]:
freq_states = {}
freq_states2states = {}
freq_states2obs = {}
TRUE_LABELS = []

In [10]:
for sstate, sobs in zip(corpus['states'], corpus['obs']):
    tags = ['<BOS>', '<EOS>']
    for si, sj in list(bigrams(sstate.split())):
        if (si, sj) in freq_states2states:
            freq_states2states[(si, sj)] += 1
        else:
            freq_states2states[(si, sj)] = 1

    for si in sstate.split():
        if si in freq_states:
            freq_states[si] += 1
        else:
            freq_states[si] = 1
            
    for oj, si in zip(sobs.split()[1:-1], sstate.split()[1:-1]):
        if (si, oj) not in freq_states2obs:
            freq_states2obs[(si, oj)] = 1
            TRUE_LABELS.append((oj, si))
        else:
            freq_states2obs[(si, oj)] += 1

## Creación del modelo del lenguaje

El modelo del lenguaje será construido al hacer las matrices $A$, $B$ y $\Pi$ usando el corpus de entrenamiento que ya esta etiquetado.

In [11]:
A = np.zeros((len_S, len_S))
B = np.zeros((len_S, len_O))
Pi = np.zeros(len_S)

N = len_S
T = len_O

print(A.shape)
print(B.shape)
print(Pi.shape)

(11, 11)
(11, 22053)
(11,)


El __smoothing Laplaciano__ se usara para calcular la probabilidad condicional:

$$p(x_j|x_i) =  \frac{fr(x_i, x_j) + 1}{fr(x_i) + N}$$

In [12]:
def smoothingLaplacian(wi, wj, L, mode):
    if mode == 'A':
        si, sj = wi, wj
        try:
            prob = (freq_states2states[(si, sj)] + 1) / (freq_states[si] + L)
        except: 
            prob = 1 / (freq_states[wi] + L)
    elif mode == 'B':
        si, oj = wi, wj
        try:
            prob = (freq_states2obs[(si, oj)] + 1) / (freq_states[si] + L)
        except:
            prob = 1 / (freq_states[si] + L)
    elif mode == 'Pi':
        si = wi
        try:
            prob = (freq_states2states[('<BOS>', si)] + 1) / (freq_states['<BOS>'] + L)
        except:
            prob = 1 / (freq_states['<BOS>'] + L)
    return prob

#### Matriz de transiciones de estado

Representa la probabilidad de que el siguiente estado en el tiempo _t+1_ sea $S_j$, dado que el estado actual es $S_i$:

$$A_{i,j} = p(S^{t+1}_j|S^t_i)$$

$$\sum_j A_{ij} = 1$$

In [13]:
for i in range(N):
    for j in range(N):
        if j == N:
            A[i, j+1] = smoothingLaplacian(S[i] , '<EOS>', N, 'A')
        else:
            A[i, j] = smoothingLaplacian(S[i], S[j], N, 'A')

In [14]:
list(A.sum(axis=1))

[0.9999999999999999,
 0.9997465788139889,
 0.9516157264564267,
 0.999900826446281,
 0.9998389499536982,
 0.9995537042546863,
 0.9997712978845054,
 0.9997395154988278,
 1.0,
 1.0,
 1.0]

In [15]:
Aij = pd.DataFrame(A, columns=S, index=S) 
Aij

Unnamed: 0,B-DNA,I-DNA,O,B-protein,I-protein,B-cell_type,I-cell_type,B-cell_line,I-cell_line,B-RNA,I-RNA
B-DNA,0.001048,0.784345,0.213245,0.000524,0.000105,0.000105,0.000105,0.000105,0.000105,0.00021,0.000105
I-DNA,0.003548,0.525215,0.470223,0.000317,6.3e-05,6.3e-05,6.3e-05,6.3e-05,6.3e-05,6.3e-05,6.3e-05
O,0.024216,3e-06,0.825477,0.072979,3e-06,0.017023,3e-06,0.009588,3e-06,0.00232,3e-06
B-protein,0.000793,3.3e-05,0.55924,0.007273,0.432132,0.000132,3.3e-05,9.9e-05,3.3e-05,9.9e-05,3.3e-05
I-protein,0.000564,4e-05,0.501228,0.024278,0.473326,8.1e-05,4e-05,4e-05,4e-05,0.000161,4e-05
B-cell_type,0.000149,0.000149,0.251562,0.001339,0.000149,0.000893,0.744719,0.000149,0.000149,0.000149,0.000149
I-cell_type,0.000229,0.000114,0.570383,0.001372,0.000114,0.000229,0.426529,0.000457,0.000114,0.000114,0.000114
B-cell_line,0.00026,0.00026,0.149779,0.000781,0.00026,0.000781,0.00026,0.00026,0.846575,0.00026,0.00026
I-cell_line,0.000135,0.000135,0.430717,0.001218,0.000135,0.000406,0.000135,0.007578,0.558999,0.000406,0.000135
B-RNA,0.002079,0.00104,0.133056,0.00104,0.00104,0.00104,0.00104,0.00104,0.00104,0.00104,0.856549


#### Matriz de emisiones de observaciones

Esta matriz describe la probabilidad de que en el estado $S^t_i$ se emita la observación $O^t_j$:

$$ B_{i,j} = p(O^t_j|S^t_i) $$

$$ \sum_j B_{i,j} = 1 $$

In [16]:
for i in range(N):
    for j in range(T):
        B[i, j] = smoothingLaplacian(S[i], O[j], T, 'B')

In [17]:
list(B.sum(axis=1))

[0.9999999999999999,
 1.0,
 0.9999999999999999,
 1.0,
 0.9999999999999998,
 0.9999999999999998,
 1.0,
 1.0,
 0.9999999999999999,
 1.0000000000000002,
 1.0]

In [18]:
Bij = pd.DataFrame(B, columns=O, index=S)
Bij

Unnamed: 0,IL-2,gene,expression,and,NF-kappa,B,activation,through,CD28,requires,...,80-,chromatin-assembled,-assembly,packaging,counteracted,TFE-3,Individually,HMG-88,nucleosomal,bending
B-DNA,0.005446,0.000158,0.000158,3.2e-05,0.002944,0.000665,3.2e-05,3.2e-05,0.000317,3.2e-05,...,3.2e-05,6.3e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,6.3e-05,3.2e-05
I-DNA,0.00082,0.036536,0.001163,0.008724,0.000608,0.006133,0.000449,2.6e-05,5.3e-05,2.6e-05,...,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05
O,4.2e-05,0.002082,0.007483,0.029831,3.5e-05,0.000553,0.005437,0.001225,5e-06,0.000306,...,5e-06,2e-06,5e-06,5e-06,5e-06,2e-06,5e-06,2e-06,2e-06,5e-06
B-protein,0.010766,9.6e-05,1.9e-05,1.9e-05,0.019544,0.000861,0.000344,1.9e-05,0.002467,1.9e-05,...,1.9e-05,1.9e-05,1.9e-05,1.9e-05,1.9e-05,5.7e-05,1.9e-05,3.8e-05,1.9e-05,1.9e-05
I-protein,0.000299,0.001557,2.1e-05,0.00817,0.001408,0.033426,0.000853,2.1e-05,0.000256,2.1e-05,...,2.1e-05,2.1e-05,2.1e-05,2.1e-05,2.1e-05,2.1e-05,2.1e-05,2.1e-05,2.1e-05,2.1e-05
B-cell_type,7e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,0.009734,3.5e-05,3.5e-05,3.5e-05,3.5e-05,...,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05,3.5e-05
I-cell_type,6.5e-05,3.2e-05,6.5e-05,0.006301,3.2e-05,0.006431,3.2e-05,3.2e-05,3.2e-05,3.2e-05,...,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05
B-cell_line,0.000155,3.9e-05,3.9e-05,3.9e-05,3.9e-05,0.001468,3.9e-05,3.9e-05,3.9e-05,3.9e-05,...,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05
I-cell_line,0.000136,3.4e-05,3.4e-05,0.004213,3.4e-05,0.003771,3.4e-05,3.4e-05,3.4e-05,3.4e-05,...,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05
B-RNA,0.000869,4.3e-05,4.3e-05,4.3e-05,0.00013,4.3e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05,...,4.3e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05,4.3e-05


#### Vector de probabilidades iniciales

Se representa la probabilidad de que el estado $S_i$ sea el inicial:

$$\Pi_i = p(S^{t=1}_i|<BOS>)$$

$$\sum_i \Pi_i = 1$$

In [19]:
for i in range(N):
    Pi[i] = smoothingLaplacian(S[i], '', N, 'Pi')

In [20]:
Pi.sum()

1.0

In [21]:
Pii = pd.DataFrame(Pi, columns=['<BOS>'], index=S)
Pii

Unnamed: 0,<BOS>
B-DNA,0.008788
I-DNA,5.4e-05
O,0.89449
B-protein,0.078122
I-protein,5.4e-05
B-cell_type,0.009866
I-cell_type,5.4e-05
B-cell_line,0.005391
I-cell_line,5.4e-05
B-RNA,0.003073


### Creación del modelo de Markov

Para encontrar la probabilidad de las observaciones dado el modelo $\lambda$ se aplican los algoritmos forward y backward.

$$p(o_1,\dots, o_T| \lambda)$$

#### Algoritmo Forward

Se define la variable forward como:

$\alpha_t(j) = p(o_1, ..., o_t, q_t=S_j|\lambda)$

Para encontrar en cada iteración la probabilidad conjunta de las observaciones se define:

$\alpha_1(j) = \Pi_j b_j(o_1)$

$\alpha_{t+1}(j) = \sum_{i=1}^N \alpha_{t}(i) A_{i,j} b_j(o_t)$

En forma matricial se puede escribir como:

$\alpha_{t+1} = b(o_t) \odot ( A^T \bullet \alpha_t)$

#### Algoritmo Backward

Para encontrar en cada iteracion la probabilidad conjunta de las observaciones se define:

$\beta_T(j) = 1 $

$\beta_t(j) = \sum_i A_{j,i} b_i(O_{t+1}) \beta_{t+1}(i) $

#### Decodificacion

Se utiliza el algoritmo de Viterbi para encontrar la secuencia de estados que maximiza la probabilidad de emitir las observaciones dadas.

Se define la variable $\delta$:

$\delta_0(i) = \Pi_i b_i(O_0)$

$\delta_t(i) = max(\delta_{t-1}(j)a_{ji}b_i(O_t))$

En cada iteración del tiempo, se selecciona el estado que tenga la mayor probabilidad y se guarda la posición en una variable llamada _backpointer_ con un _argmax_ del mismo argumento de la variable $\delta$.

Si se comparan el algoritmo de Viterbi y el Forward, son basicamente el mismo algoritmo con la diferencia de que en Forward se toma la suma de las probabilidades de cada posible camino y en Viterbi se toma el camino con mayor probabilidad.

#### Aprendizaje 

Los parámetros del HMM se ajustan mediante la aplicación del algoritmo Forward-Backward que consiste de dos etapas:

1. Expectation: Calcular los valores esperados de transiciones de estados.

$\gamma_t(i) = \frac{\alpha_t(i)\beta_t(i)}{p(O|\lambda)} = \frac{\alpha_t(i)\beta_t(i)}{\alpha_T}$

$\xi_t(i,j) = \frac{\alpha_t(i)a_{ij}b_j(O_{t+1}\beta_{t+1}(j))}{\alpha_T}$

2. Maximization: Actualizar los parámetros del modelo.

$a_{ij} = \frac{\sum_t\xi_t(i,j)}{\sum_t\sum_j\xi_t(i,j)}$

$b_i(k) = \frac{\sum_t \begin{cases} 0 & O_t \neq V_k \\ \gamma_t(i) & O_t = V_k \end{cases} }{\sum_t\gamma_t(k)}$

$\Pi_i = \alpha_0(i)$

Este proceso se realiza hasta llegar a cierto numero de iteraciones o que se cumpla algún criterio.

In [22]:
class HMM():
    def __init__(self, S, O, A, B, Pi):
        self.S = S
        self.O = O
        self.A = A
        self.B = B
        self.Pi = Pi
        self.N = len(S)
        self.T = len(O)
    
    def forward(self, sequence):
        words = sequence.split()
        alpha = np.zeros((self.N, len(words)))
        alpha[:, 0] = self.Pi * Bij[words[0]]
        t = np.arange(len(words))
        for word, ti in zip(words[1:], t[1:]):
            alpha[:, ti] = np.dot(self.A.T, alpha[:, ti-1]) * self.B[:, obs2indx[word]]
        return alpha[:,-1].sum(), alpha
            
    def backward(self, sequence):
        words = sequence.split()
        beta = np.zeros((self.N, len(words)))
        beta[:,-1] = np.ones(self.N)
        for t in range(len(words)-1, 0, -1):
            for j in range(self.N):
                beta[j, t-1] = (self.A[j,:] * self.B[:, obs2indx[words[t]]] * beta[:, t]).sum()
        return (self.Pi * beta[:, 0] * self.B[:, obs2indx[words[0]]]).sum(), beta
    
    def viterbi(self, seq):
        words = seq.split()
        delta = np.zeros((self.N, len(words)))
        delta[:,0] = self.Pi * self.B[:, obs2indx[words[0]]]
        bckptr = np.zeros((self.N, len(words)))
        path = []
        for t in range(len(words)-1):
            for j in range(self.N):
                try:
                    z = self.A[:, j] * delta[:, t] * self.B[:, obs2indx[words[t+1]]]
                except: 
                    z = self.A[:, j] * delta[:, t] * 1
                delta[j, t+1] = np.max(z)
                bckptr[j, t+1] = np.argmax(z)
        prob = np.max(delta[:,-1], axis=0)
        for t in range(len(words)):
            z = np.max(delta[:, t])
            zidx = np.argmax(delta[:, t])
            path.append(zidx)
        return path
    
    def forward_backward(self, seq, steps):
        words = seq.split()
        alphaT, alpha = self.forward(seq)
        beta0, beta = self.backward(seq) 
        gamma = np.zeros((self.N, len(words)))
        E = np.zeros((len(words), self.N, self.N))
        for i in range(steps):
            # E-step
            for t in range(len(words)):
                for j in range(self.N):
                    gamma[j, t] = (alpha[j, t]*beta[j, t])/alphaT
            
            for t in range(len(words)-1):
                for i in range(self.N):
                    for j in range(self.N):
                        E[t, i, j] = (alpha[i, t]*self.A[i,j]*self.B[j, obs2indx[words[t+1]]]*beta[j, t+1])/alphaT
            # M-step
            for i in range(self.N):
                for j in range(self.N):
                    self.A[i, j] = np.sum(E[:-1, i, j])/np.sum([np.sum(E[t, i, :]) for t in range(len(words)-1)])
            
            for j in range(self.N):
                for k in range(self.T):
                    num = 0
                    for t in range(len(words)):
                        if obs2indx[words[t]] == k:
                            num += gamma[j, t]
                    #self.B[j, k] = num / np.sum(gamma[j, :])
            self.Pi = gamma[:, 0]
        #return gamma, E
        
    def score(self, predicted_labels, true_labels):
        sc = 0
        N = len(predicted_labels)
        for pred_pair, true_pair in zip(predicted_labels, true_labels):
            if pred_pair[1] == true_pair[1]:
                sc += 1
        return sc/N

In [23]:
# Predecir los estados o tags de una secuencia dado un modelo de markov
def predictTags(hmm, seq, RTG=False):
    words = seq.split()
    path = hmm.viterbi(seq)
    if RTG:
        return [(word, indx2state[itag]) for itag, word in zip(path, words)]
    for word, p in zip(words, path):
        print(f'{word}\t{indx2state[p]}')

#### Entrenamiento del HMM

Se muestra el entrenamiento del HMM con una oracion en particular del corpus.

In [24]:
hmm = HMM(S, O, A, B, Pi)

In [25]:
# Ejecucion del algoritmo forward y backward para el calculo de la probabilidad 
# de las observaciones dado el modelo de markov definido para las primeras
# 10 oraciones. El valor de alpha y beta debe ser el mismo como se verifica:
for seq in corpus['obs'][:10]:
    seq = seq.strip('<[B|E]OS>')
    alphaT , _ = hmm.forward(seq)
    beta0, _ = hmm.backward(seq)
    print('alpha = {} \t beta = {}'.format(alphaT, beta0))

alpha = 3.9589842025343836e-47 	 beta = 3.958984202534385e-47
alpha = 2.5586096125373176e-75 	 beta = 2.558609612537318e-75
alpha = 2.6646760772142282e-114 	 beta = 2.6646760772142282e-114
alpha = 4.713049060145071e-71 	 beta = 4.713049060145066e-71
alpha = 3.437866893431722e-60 	 beta = 3.4378668934317234e-60
alpha = 1.3337146861198805e-58 	 beta = 1.3337146861198798e-58
alpha = 1.0793322882318052e-55 	 beta = 1.079332288231806e-55
alpha = 1.940121066137957e-77 	 beta = 1.94012106613796e-77
alpha = 4.1462103665387673e-57 	 beta = 4.1462103665387684e-57
alpha = 1.9760470421394936e-66 	 beta = 1.9760470421394952e-66


In [26]:
# Algoritmo de Viterbi para la primera oracion
# Los numeros de la lista representan el indice de los estados o tags

seq = corpus['obs'][0].strip('<[B|E]OS>')

print('Secuencia de observaciones: ')
print(seq)

print('\nSecuencia de estados:')
print(hmm.viterbi(seq))

Secuencia de observaciones: 
 IL-2 gene expression and NF-kappa B activation through CD28 requires reactive oxygen production by 5-lipoxygenase .  

Secuencia de estados:
[3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]


In [27]:
print('Prediccion:')
predictTags(hmm, seq, True)

Prediccion:


[('IL-2', 'B-protein'),
 ('gene', 'O'),
 ('expression', 'O'),
 ('and', 'O'),
 ('NF-kappa', 'O'),
 ('B', 'O'),
 ('activation', 'O'),
 ('through', 'O'),
 ('CD28', 'O'),
 ('requires', 'O'),
 ('reactive', 'O'),
 ('oxygen', 'O'),
 ('production', 'O'),
 ('by', 'O'),
 ('5-lipoxygenase', 'O'),
 ('.', 'O')]

In [28]:
# Se imprime el verdadero etiquetado de la oracion
TRUE_LABELS[:16]

[('IL-2', 'B-DNA'),
 ('gene', 'I-DNA'),
 ('expression', 'O'),
 ('and', 'O'),
 ('NF-kappa', 'B-protein'),
 ('B', 'I-protein'),
 ('activation', 'O'),
 ('through', 'O'),
 ('CD28', 'B-protein'),
 ('requires', 'O'),
 ('reactive', 'O'),
 ('oxygen', 'O'),
 ('production', 'O'),
 ('by', 'O'),
 ('5-lipoxygenase', 'B-protein'),
 ('.', 'O')]

In [29]:
# Se entrena el HMM para la primer oracion
hmm.forward_backward(seq, 10)

In [30]:
# Se realiza una nueva prediccion despues del ajuste de parametros
print('Nueva prediccion:')
predictTags(hmm, seq, True)

Nueva prediccion:


[('IL-2', 'B-protein'),
 ('gene', 'O'),
 ('expression', 'B-protein'),
 ('and', 'B-protein'),
 ('NF-kappa', 'O'),
 ('B', 'O'),
 ('activation', 'B-protein'),
 ('through', 'O'),
 ('CD28', 'O'),
 ('requires', 'B-protein'),
 ('reactive', 'O'),
 ('oxygen', 'B-protein'),
 ('production', 'B-protein'),
 ('by', 'B-protein'),
 ('5-lipoxygenase', 'O'),
 ('.', 'B-protein')]

In [31]:
# Calcular el score del hmm 

TRUE_LABELS[:5]
PREDICTED_LABELS = []

for seq in corpus['obs']:
    seq = seq.strip('<[B|E]OS>')
    PREDICTED_LABELS += predictTags(hmm, seq, True)
    
#SCORE = hmm.score(TRUE_LABELS, PREDICTED_LABELS)
#print(SCORE*100)

### Predicción de estados (datos de prueba)

In [32]:
with open ("Final1/data_eval.txt", "r") as file:
    corpus_test = file.read().splitlines()

In [33]:
corpus_test

['HIV-1 envelope glycoproteins induce activation of activated protein-1 in CD4+ T cells [published erratum appears in J Biol Chem 1995 Dec 1; 270 (48): 29038]  Activation of CD4 positive T cells is a primary requirement for human immunodeficiency virus (HIV) entry, efficient HIV replication, and progression to AIDS, Utilizing CD4 positive T cell lines and purified T cells from normal individuals, we have demonstrated that native envelope glycoproteins of HIV, gp 160, can induce activation of transcription factor, activated protein-1 (AP-1).',
 'The stimulatory effects of gp160 are mediated through the CD4 molecule, since treatment of gp160 with soluble CD4-IgG abrogates its activity, and CD4 negative T cell lines fail to be stimulated with gp160.',
 'Immunoprecipitation of the gp 160-induced nuclear extracts with polyclonal antibodies to Fos and Jun proteins indicates that AP-1 complex is comprised of members of these family of proteins.',
 'The gp160-induced AP-1 complex is dependent 

In [34]:
# Predecir los tags de cada palabra de una secuencia 
ctags = []
for seq in corpus_test:
    ctags.append(predictTags(hmm, seq, RTG=True))

In [35]:
# Predicciones de la primera secuencia
ctags[0]

[('HIV-1', 'B-DNA'),
 ('envelope', 'I-DNA'),
 ('glycoproteins', 'O'),
 ('induce', 'B-protein'),
 ('activation', 'O'),
 ('of', 'B-protein'),
 ('activated', 'O'),
 ('protein-1', 'O'),
 ('in', 'B-protein'),
 ('CD4+', 'O'),
 ('T', 'B-protein'),
 ('cells', 'B-protein'),
 ('[published', 'O'),
 ('erratum', 'B-protein'),
 ('appears', 'B-protein'),
 ('in', 'B-protein'),
 ('J', 'O'),
 ('Biol', 'B-protein'),
 ('Chem', 'O'),
 ('1995', 'B-protein'),
 ('Dec', 'O'),
 ('1;', 'B-protein'),
 ('270', 'O'),
 ('(48):', 'B-protein'),
 ('29038]', 'O'),
 ('Activation', 'B-protein'),
 ('of', 'B-protein'),
 ('CD4', 'O'),
 ('positive', 'B-protein'),
 ('T', 'O'),
 ('cells', 'B-protein'),
 ('is', 'B-protein'),
 ('a', 'B-protein'),
 ('primary', 'O'),
 ('requirement', 'B-protein'),
 ('for', 'B-protein'),
 ('human', 'O'),
 ('immunodeficiency', 'O'),
 ('virus', 'B-protein'),
 ('(HIV)', 'O'),
 ('entry,', 'B-protein'),
 ('efficient', 'B-protein'),
 ('HIV', 'O'),
 ('replication,', 'B-protein'),
 ('and', 'B-protein'),
 ('

### Referencias

Para construir el HMM se consultaron las siguientes fuentes:
1. https://dev.widemeadows.de/2014/09/01/hidden-markov-model-training-using-the-baum-welch-algorithm/
2. https://medium.com/@kangeugine/hidden-markov-model-7681c22f5b9
3. https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf