<a href="https://colab.research.google.com/github/shanguanma/Aligners/blob/master/HMM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementation of HMM algorithms

# Choosing the appropriate data structures

You can choose any data structure to represent your HMM as long as your implementation works. The important thing to bear in mind is that your nodel needs to contain:

- N states
  Represented by characters, strings or numbers. It can include the first and the last state explicity.
- A set of transition probabilities$\{{a_{aj}}\}$

  ${a_{kj}} = P(\pi(i)= j |\pi(i-1) =k) $

  The probability of being in the state j at step i given that at step i-1 we were in the state k.
- A set of emission probabilities$\{{e_k(c)}\}$
  $e_k(c) = P(s_i = c | \pi(i) = k)$

  The probability of observing the symbol c at the i-th position sequence given that at the i-th step we are in the state k.

One possible representation is described below.


# States

States are represented as a list of characters.


In [0]:
states = ['b','y','n','e']

# Transtions

Transitions are represented as a dictionary where keys are tuples and values are transition probabilities. 

The first element of tuple is the state from which we transition and the second element is the state we transition to.

In [0]:
transitions = {('b','y') : 0.2,
               ('b','n') : 0.8,
               ('y','y') : 0.7,
               ('y','n') : 0.2,
               ('y','e') : 0.1,
               ('n','n') : 0.8,
               ('n','y') : 0.1,
               ('n','e') : 0.1}
               

# Emission

Emissions are represented as a dictionary where keys are states and values are dictionaries. The internal dictionaries contain emitted symbols as keys and emission probabilities as values.

In [0]:
emissions = {'y' : {'A':0.1, 'C':0.4, 'G':0.4, 'T':0.1},
            'n': {'A': 0.25, 'C': 0.25,'G':0.25,'T':0.25}}

# Sequence

The sequence is simply a string.

In [0]:
sequence = 'ATGCG'

# Initializing matrices

We can write a simple utility function to initialize matrices that will be useful afterwards.It takes the desired number of rows, the desired of columns and the value to fill the matrix . If the third parameter is not provided, the matrix is filled with zeros by default.

In [0]:
def initialize_matrix(dim1, dim2, value=0):
    F = []
    for i in range(0, dim1):
        F.append([])
        for j in range(0,dim2):
            F[i].append(value)
    return F
            

# Visualizing matrices

For visualization purposed only we implement some printing functions. They are not important for the sake of the solution and the details of the implementation can be ignored.

In [0]:
def print_matrix(matrix, axis1,axis2):
    w = '{:<10}'
    print(w.format('') + w.format('0') + ''.join([w.format(char) for char in axis2]) + w.format('0'))
    for i, row in enumerate(matrix):
        print(w.format(axis1[i]) + ''.join(['{:<10.2e}'.format(item) for item in row]))

def print_matrix_p(maxtrix, axis1, axis2):
    w = '{:<10}'
    print(w.format('') + w.format('0') + ''.join([w.format(char) for char in axis2])+w.format('0'))
    for i, row in enumerate(matrix):
        print(w.format(axis1[i]) + ''.join(['{:<10s}'.format(item) for item in row]))


# Forward algorithm

We start by initializing matrix F that will contain as many rows as there are states and as many columns as there are symbols in the sequence plus two additional columns (the first and the last). The probability of starting from the begin state is 1 so we set the first elements of the matrix to 1.

In [9]:
F = initialize_matrix(len(states), len(sequence)+2)
F[0][0] = 1
print_matrix(F,states, sequence)

          0         A         T         G         C         G         0         
b         1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  
y         0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  
n         0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  
e         0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  


Next, we calculate the values for the first symbol. For each state, it is the probability of transitioning from the begin state to the current state times the probability of emitting the first symbol of the sequence in the current state.

In [14]:
for i in range(1, len(states) -1):
    F[i][1] = transitions[(states[0],states[i])] * emissions[states[i]][sequence[0]]

print_matrix(F, states, sequence)
# F[1][1] = transitions[states[0],states[1]] * emissions[states[1]][sequence[0]]
# F[1][1] = transitions[('b','y')] * emissions['y']['A']
# F[1][1] = 0.2 * 0.1 
# F[1][1] = 0.02
# F[1][1] = 2.00e-02
# F[2][1] = transitions[states[0],states[2]] * emissions[states[2]][sequence[0]]
# F[2][1] = transitions[('b','n')] * emissioms['n']['A']
# F[2][1] = 0.8 * 0.25
# F[2][1] = 0.2
# F[2][1] = 2.00e-01
# F[3][1] = transitions[states[0],states[3]] * emissions[states[3]][sequence[0]]
# F[3][1] = transitions[('b','e')] * emissions['e']['A']
# F[3][1] = 0 * 0
# F[3][1] = 0
# F[3][1] = 0.00e+00

          0         A         T         G         C         G         0         
b         1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  
y         0.00e+00  2.00e-02  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  
n         0.00e+00  2.00e-01  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  
e         0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  


For all of the other symbols, from the second to the last, we calculate the values as the sum of probabilities. For each state, it is the sum of probabilities of transitioning form any previous state to the current states times the probability of emitting the corresponding symbol of the sequence in the current state times the probability of the previous state.

In [16]:
# loops on the symbols ,from second symbol to last specify symbol (e.g: 0)
for j in range(2, len(sequence) + 1):
    # loops on the states, from second state to last state
    for i in range(1, len(states) - 1 ):
        p_sum = 0
        # loops on all of the possible previous states 
        for k in range(1, len(states) - 1):
            p_sum += F[k][j-1] * transitions[(states[k],states[i])] * emissions[states[i]][sequence[j-1]]
        F[i][j] = p_sum

print_matrix(F, states, sequence)




          0         A         T         G         C         G         0         
b         1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  
y         0.00e+00  2.00e-02  3.40e-03  2.59e-03  1.06e-03  3.69e-04  0.00e+00  
n         0.00e+00  2.00e-01  4.10e-02  8.37e-03  1.80e-03  4.14e-04  0.00e+00  
e         0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  


Now, we need to calculate the final value. It is the sum of probabilities of transitioning from any previous state to the end state times the probability of the previous state.

In [17]:
p_sum = 0
for k in range(1, len(states) - 1 ):
    p_sum += F[k][len(sequence)] * transitions[(states[k], states[-1])]

F[-1][-1] = p_sum
print_matrix(F, states, sequence)


          0         A         T         G         C         G         0         
b         1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  
y         0.00e+00  2.00e-02  3.40e-03  2.59e-03  1.06e-03  3.69e-04  0.00e+00  
n         0.00e+00  2.00e-01  4.10e-02  8.37e-03  1.80e-03  4.14e-04  0.00e+00  
e         0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  7.83e-05  


We can now put everything into a function that takes states, transitions, emissions and a sequence and returns a matrix F.