# Hidden Markov Model (HMM) Workshop 
## Sara Carioscia and Dylan Taylor
### Hosted by Agara Bio

Here you'll write your pseudocode for each of our prompts as we discuss. Then, you'll follow along with us as we write the code OR just copy the relevant code from `HMM_Building_Key` that corresponds with your pseudocode. We'll go through this step-by-step to ensure we all understand a) in theory what you'd like to do for your model and b) how the actual Python code is implementing your vision.

## Step 1: Getting started
Because we'll be working primarily with arrays, we'll need to import the `numpy` library

In [None]:
import numpy

The data is encoded as indicies so that we can more easily look up the values we need from our various probability arrays (start, emission, and transition). We won't be interacting directly with the below dictionaries, but they show you how the data is encoded.

In [None]:
get_nuc_index = {
    'A' : 0,
    'C' : 1,
    'G' : 2,
    'T' : 3
}

get_state_index = {
    'intergenic' : 0,
    'start1' : 1,
    'start2' : 2,
    'start3' : 3,
    'exon1' : 4,
    'exon2' : 5,
    'exon3' : 6,
    'intron1' : 7,
    'intron2' : 8,
    'intron3' : 9,
    'stop1' : 10,
    'stop2' : 11,
    'stop3' : 12
}

## Step 2: Read in the data
Now we need to read in the encoded training data (both DNA sequences and state paths) and store them in numpy arrays.

In [1]:
DNA_training_data = numpy.load('training_data/HMM_DNA_training.npy')
State_training_data = numpy.load('training_data/HMM_State_training.npy')

NameError: name 'numpy' is not defined

## Step 3: Learn training values to use in our model
From the data we just read in, we need to figure out start counts, emission counts, and transition counts.

![Inferring Probabilities](images/inferring_probabilities.png)

If we store this data in arrays, how big will each array need to be?

In [None]:
# emission counts

In [None]:
# transition counts

In [None]:
# start counts

Now that we've initialized our arrays, let's actually pull these counts from the training data arrays

In [None]:
# Can we iterate through both the DNA_training_data array
# and the State_training_data array at the same time?
# Hint: Are they the same shape?

We had some issues with this function taking too long to run, so if this is the case for you, stop the code cell, and just load in the counts by running the cell below

In [None]:
emission_counts = numpy.load('model_outputs/emission_counts.npy')
transition_counts = numpy.load('model_outputs/transition_counts.npy')
start_counts = numpy.load('model_outputs/start_counts.npy')

Great! Now we have our counts, but to use this information in an HMM, we need to convert it to probabilities.

In [None]:
# emission probabilities

In [None]:
# transition probabilities

In [None]:
# start probabilities

## Step 4: The Forward Algorithm
Because we're potentially going to be using the forward algorithm on multiple models (spoiler alert: we will be), we'd like to make our algorithm a function, so we can use it multiple times without having to rewrite code.

![Forward Algorithm](forward_alg_graphic.png)

In [None]:
def forward_alg():# what inputs does our function need to take?

Theoretically, our function is working! Let's run it on a test sequence to make sure. We can load in a 400 bp sequence to test it on. This sequence, like the training data, has already been encoded for us, so we don't need to worry about that.

In [None]:
test_seq = numpy.load('training_data/DNA_test_sequence.npy')

Now let's run our function on our `test_seq` along with the right probabilities.

In [None]:
full_model_test_prob = forward_alg()
# What arguments are we going to pass to our function?

## Step 5: Comparing to a null model
This test sequence comes from an exonic region in a real human gene, so we would hope that our model would be able to pick up on that. The probability we get from the forward algorithm on it's own isn't very useful, so we need to compare it to the probability from a null model. Let's make a null model to run our test sequence on.

In [None]:
# What does the null model look like? What components does it have?

Now that we have our null model, let's re-run the forward algorithm on the `test_seq` using this model. Aren't you glad we made it a function!

In [None]:
null_model_test_prob = forward_alg() # What arguments are we going to pass to our function?

Now that we have our two probabilities, let's compare them using log-odds.