# Hidden Markov Model for DNA Sequence Analysis

**Have examples of the models we were working with and how we can make them more complex**

# Getting started

First, we're going to import any external packages we need.

For today, the only external package we're going to use is "numpy", which let's us use and manipulate arrays.

In [1]:
import numpy

To make the training data more accessible to you, we've created files that have.... **Explain the need for dictionaries and the need to have the data represented as integers**

In [None]:
get_nuc_index = {
    'A' : 0,
    'C' : 1,
    'G' : 2,
    'T' : 3
}

get_state_index = {
    'Intergenic' : 0,
    'Start1' : 1,
    'Start2' : 2,
    'Start3' : 3,
    'Exon1' : 4,
    'Exon2' : 5,
    'Exon3' : 6,
    'Intron1' : 7,
    'Intron2' : 8,
    'Intron3' : 9
}

# Read in + format the data

Now that we are able to convert nucleotides and model states to integers, we want to actually read in the data.

We've put the training data in two `.csv` files in the same directory as this notebook.

The DNA sequence training data is in the file `HMM_DNA_training.csv`
The State sequence training data is in the file `HMM_State_training.csv`

Python has a functionality that allows us to read through these files line by line, and convert each line to a string

In [11]:
DNA_training_data_file = 'HMM_DNA_training.csv'
State_training_data_file = 'HMM_State_training.csv'

for line in open(DNA_training_data_file):
    line_as_list = line.strip().split(',')
    print(line_as_list)
    break

['T', 'A', 'T', 'G', 'G', 'C', 'A', 'T', 'C', 'A', 'G']


Because we have the DNA data and the State data in different files, we need to store the DNA data such that when we read through the state data, we can connect each nucleotide emission with it's state.

To store the DNA data, we can create an empty list, and populate it with each line of the data (creating a list of lists). Finally we can convert that into a numpy array.

In [None]:
# Empty list for the DNA sequence 
DNA_training_data = []

for line in open(DNA_training_data_file):
    line_as_list = line.strip().split(',')
    DNA_training_data.append(line_as_list)
    
# Rename DNA_training_data to be the information, formatted as an array
DNA_training_data = numpy.array(DNA_training_data)

# What does it look like?
print(DNA_training_data)

In [None]:
# Format the state sequence through the same process
State_training_data = []

for line in open(State_training_data_file):
    line_as_list = line.strip().split(',')
    State_training_data.append(line_as_list)
    
State_training_data = numpy.array(State_training_data)
print(State_training_data)

# Learn training values to use in our model

In [1]:
# Because the `range` function gives us the dimensions (row,column) of an array,
# we use it to isolate each row and each column in our DNA sequence training data.

# For each row in our DNA 
for row_num in range(DNA_training_data.shape[0]):
    # For each column in our DNA
    for col_num in range(DNA_training_data.shape[1]):
        # For the correponding `state` for our HMM, we go to the state data and
        # find the same row and same column number 
        state = State_training_data[row_num,col_num]
        # We rename the value at our given (row,col) in the DNA data as `nucleotide`
        nucleotide = DNA_training_data[row_num,col_num]
        # ?? 
        state_index = get_state_index[state]
        nucleotide_index = get_nuc_index[nucleotide]

        # ?? 
        emission_counts[state_index,nucleotide_index] += 1

        # If there are more columns left to consider, do it again 
        if col_num < State_training_data.shape[1]-1:
            # Go to the next state 
            next_state = State_training_data[row_num,col_num+1]
            # ??
            next_state_index = get_state_index[next_state]

            # ??
            transition_counts[state_index, next_state_index] += 1

        # This is actually the first one. If we're at the first column
        # (remember, python starts at zero), kick it off! 
        if col_num == 0:
            start_counts[state_index] += 1

NameError: name 'DNA_training_data' is not defined

## Convert emission counts to probabilities 

Remember the emission probabilities for our HMM? The probability of any state giving us each of the nucleotides. 

In [None]:
# First, we make an array of zeros in the same shape as our emission counts.
# This gives us an easy place (with the correct number of rows and columns)
# to store our emission probabilities once we have them.
emission_probs = numpy.zeros(emission_counts.shape)

# For each row
for row_num in range(emission_counts.shape[0]):
    # Get the number of emissions at each row and add them together. 
    # We're using the `sum` command within the numpy package.
    row_sum = numpy.sum(emission_counts[row_num])
    # If we have any value that is not zero,
    if row_sum != 0:
        # ??
        emission_probs[row_num] = emission_counts[row_num]/row_sum

## Convert transition counts to probabilities 

Remember the transition probabilities for our HMM? The probability of being in a state at position x+1, given the state we were in at position x. 

In [None]:
transition_probs = numpy.zeros(transition_counts.shape)

for row_num in range(transition_counts.shape[0]):
    row_sum = numpy.sum(transition_counts[row_num])
    if row_sum != 0:
        transition_probs[row_num] = transition_counts[row_num]/row_sum

## Convert start counts to probabilities 

In [None]:
# ??
start_probs = start_counts / numpy.sum(start_counts)

# The Viterbi Algorithm

First, we define a function that will convert our nucleotides to integers. This is important because ?? 

In [None]:
# The input to our function is some sequence of DNA we're looking to encode
def encode_DNA(DNA_seq):

    encoded_seq = numpy.zeros(len(DNA_seq),dtype=int)

    # Here we use the range function to go through the entire length of the DNA_seq. 
    # The `for loop` takes us from the beginning of the sequence all the way to the end. 
    for i in range(len(DNA_seq)):
        # The nucleotide considered in each iteration of the for loop is 
        # the ith letter in the sequence. So DNA_seq[2] gets us the 3rd letter in the sequence. 
        nucleotide = DNA_seq[i]
        # ??
        nuc_index = get_nuc_index[nucleotide]
        # ??
        encoded_seq[i] = nuc_index

    # We have a function `return` a value that we want to use for further analysis.
    # ?? (what form is the returned info in?)
    return encoded_seq


Now we define a function that will use our state probabilities, transition probabilities, emission probabilities, and the encoded DNA sequence and compute the Viterbi Algorithm. Remember that Viterbi gives us the most probable path through a sequence of states, given a sequence of emissions (in this case, given a DNA sequence). 

# can u explain in the above cell what are the S_probs? 

In [None]:
# The inputs to our function are the three probabilities and the DNA sequence,
# encoded as ___. This is the output of our above function, `encode_DNA`. 
def viterbi(s_probs, t_probs, e_probs, encoded_DNA_seq):

    # To figure out the length of our DNA sequence, we take do the `shape` command to get the 
    # dimensions of our encoded DNA. This gives us the (dimension1,dimension2). 
    # By taking the 0th term of that output, we get the ?dimension1. 
    DNA_length = encoded_DNA_seq.shape[0]
    # We do the same thing to get the number of possible states. 
    num_states = s_probs.shape[0]

    # Initialize empty matrices -- we didn't do matrices in the intro 
    # We make an empty matrix to store the traceback - the path through the states based on 
    # our emissions. (Think of that rectangle chart used in week 1, where we have some
    # probability for each position.)
    traceback_matrix = numpy.zeros((num_states,DNA_length), dtype=int)
    # ??
    traceback_matrix[:,0] = numpy.nan

    # We do the same for our probabilities 
    probability_matrix = numpy.zeros((num_states,DNA_length))

    # Compute the probability and traceback matrices
    # `position` is just the term we use to refer to each step, as we go from the beginning of 
    # `DNA_length` to the end. 
    for position in range(DNA_length):
        # Name the variable `nucleotide` to be the value at the given position in our 
        # encoded DNA sequence
        nucleotide = encoded_DNA_seq[position]
        # If we're at the first position, kick it off! 
        if position == 0:
            # For any state in the series of states
            for state in range(num_states):
                # For the state, fill in our empty probability_matrix in the correct cell with 
                # the correct probability: the probability of being in that state times
                # the probability of emitting that nucleotide if in that state
                probability_matrix[state,position] = s_probs[state] * e_probs[state,nucleotide]
        # If we're at any position besides the first 
        else:
            # Consider the current state 
            for current_state in range(num_states):
                # ??
                max_previous_state = None
                # ??
                max_probability = None
                # ?? We walk through the states, using the term `previous_state` 
                for previous_state in range(num_states):
                    # Compute the probability of the path of interest: the probability from the previous cell times
                    # the probability of getting to this state from the previous state 
                    path_prob = probability_matrix[previous_state,position-1] * t_probs[previous_state, current_state] *  e_probs[current_state, nucleotide]				
                    # We keep only the path with the higher probability 
                    if max_probability == None or path_prob > max_probability:
                        max_previous_state = previous_state
                        max_probability = path_prob
                # Update the probability matrix using the newly computed maximum probability 
                probability_matrix[current_state, position] = max_probability
                # Update the traceback matrix (the path through states) using 
                # ?? 
                traceback_matrix[current_state, position] = max_previous_state

    # Navigate the traceback matrix
    # We find the maximum value in the probability matrix, considering ??
    # ?? every row until the second to last colun
    max_path_probability = numpy.max(probability_matrix[:,-1])
    # ?? 
    max_end_state = numpy.argmax(probability_matrix[:,-1])

    # ?? 
    max_path = numpy.zeros(DNA_length, dtype=int)

    # ??
    current_state = max_end_state
    # We use the `range` command to walk through the DNA sequence. 
    # We go from ?? 
    for i in range(DNA_length-1, -1, -1): # could also use a while loop
        # ??
        max_path[i] = current_state
        # We replace our current state with the value in our `traceback_matrix` at the 
        # position of interest 
        current_state = traceback_matrix[current_state, i]

    # The function the probability of the most likely path, `max_path_probability`,
    # and returns the most likely path itself, `max_path` 
    return max_path_probability, max_path

We check our work by using a test sequence! 

In [None]:
# We make up a strand of DNA and name it `test_seq`
test_seq = 'CATGAGCTCTCGAGATCGATAGCTCTCGAGATGCGATATACGCTCGCGATGCATGCACTC'

# We encode our strand using the `encode_DNA` function and save that output in the variable
# `encoded_test_seq` to be used later
encoded_test_seq = encode_DNA(test_seq)

# We run the Viterbi Algorithm function, using the inputs `start_probs`, `transition_probs`,
# `emission_probs` from our training data and the encoded DNA output from the above function
viterbi_results = viterbi(start_probs, transition_probs, emission_probs, encoded_test_seq)

# The Viterbi Algorithm function returns two values: the `max_path_probability` and the 
# `max_path`. We're interested in what that path is - not really its probability. 
# Remember that python starts counting at zero. So since we want the second value returned
# from the Viterbi Algorithm function, `max_path`, we index the `viterbi_results` 
# asking for the [1] term. 
print(viterbi_results[1])