# Hidden Markov Model (HMM) Workshop 
## Intro to Python
## Sara Carioscia and Dylan Taylor
### Hosted by Agara Bio

In a lecture, we covered the fundamentals of HMMs, including the below. For a refresher, please check out `slides.pdf` in the GitHub repository (much of this is covered in the "Backup" slides). This includes: 

- Uses of HMMs, both in biology (e.g., sequence characterization, alignment) and elsewhere (e.g., gait characterization, speech recognition) 
- Probability, including conditional probability (the idea that the probability of an event depends on the probability of the events before it)
- The *markov assumption*, which assumes that a probability is dependent only on the event right before it (not the rest that came before) 
- The inputs to an HMM: 
  - States (e.g., genic or intergenic region)
  - Emission probabilities (the probability of showing each nucleotide, given the state we're in)
  - Transition probabilities (the probability of being in a certain state for position 2, given the state we were in for position 1)
- Preparing our HMM with training data and counting 
- Two algorithms used in an HMM:
  - Viterbi Algorithm: find the most likely sequence of states, given our model and a sequence of emissions (e.g., nucleotides)
  - Forward Algorithm: find the total probability of getting a sequence of emissions (e.g., nucleotides), given our model 

In this notebook, we're taking that theoretical information and actually implementing it in Python. Here, we'll introduce the Python skills and structures you'll need to build your HMM. For a more detailed Python introduction, which covers what you need for the HMM and just a bit more, check out another [repository](https://github.com/dtaylo95/A-Computational-Approach-to-CRISPR-Reagent-Design/tree/main/intro_notebooks) we made with introductory Python notebooks. 

# Introduction to Python Basics

## Variables

A variable is a name that you give to some piece of data in Python. The name can be whatever you want, with some restrictions. It can't have any spaces, can't start with a number, and shouldn't be the same as one of Python's inherent variables or functions. For example, I could create a variable called `my_var`.

The data that you store in a variable can be an integer (`3`), a float (`3.0`), a string (`'Hello world'`), or any other data type python can handle. We'll talk about some other examples of datatypes later.

We can "assign" our variables using the `=` operator. So I can assign the value of `my_var` to be the integer `4` by doing the following:

`my_var = 4`

Once we assign a variable, we can get the associated value by just typing the variable name instead. For all intents and purposes, the variable *IS* that value.


In [None]:
my_var1 = 4
fruit = 'Hello'
salad = 6.3

To check out the value of any of your variables, you can use the `print` function.

In [None]:
print(my_var1)

We can also modify the values of variables

In [None]:
my_var1 = 'world'
print(my_var1)

salad = salad + 1.7
print(salad)

salad += 2
print(salad)

## Lists and Arrays

A list is a vector of values of any type - one can be a string, followed by an integer, followed by a float... it doesn't matter. A list is just one dimension.

A list uses square brackets `[]` and values in the list are separated by commas.

Here we have three string values and one integer value in our list. We can access the items in our vector by indexing. Just remember that python indexes from zero. If we want the term our eyes see first, that is actually the zero-th item.

To index the list, we just use the variable name of the list, followed by `[]` with the index we want. So for the "zero-th" item:

In [None]:
sample_list = ["my", "dog", "is", 1]

print(sample_list[0])

Sometimes we need to see how long our list actually is. To do this, we use the `len()` function.  You can either just query the length of the vector, or you can assign that length value to a variable to be used later.

The format is `len({list to check})`

In [None]:
#How long is our list?
len(sample_list)

In [None]:
#Save the length to be used later 
#Here, we save the variable `len_sample_list` as the length of our list
len_sample_list = len(sample_list)
#Show us that length
print(len_sample_list)

An array is like a list, in that it is a a vector of values, but unlike a list, all of these values must be of the same type - they're all strings, or all integers, etc.

The benefit that an array offers us over a list is that an array can be *multiple* dimensions, like a matrix. Like the `len()` function for lists, we can use `.shape` to find the dimensions of an array.

Arrays are not a native feature of Python, and so we are using the `numpy` package to access the all the tools we need to work with the `array` class. We thus need to be sure we import the `numpy` package.

We can create an array from a list, using the following syntax: `numpy.array({list to convert})

In [None]:
import numpy

In [None]:
#One-dimensional array
one_dim = numpy.array([1,2,3,4,5])
#Tell us how many rows and columns are in our array. Looking at the above, what do we expect? 
one_dim.shape
#The result (5,) tells us 

A two-dimensional array can be created using a *list of lists*. Remember how lists can hold any datatype? They can also hold other lists! As long as these sub-lists have a single datatype and have the same length, the list can be made into an array.

In [None]:
#Two-dimensional array 
two_dim = numpy.array([[1,2,3],[4,5,6]])
#How many rows and columns? 
two_dim.shape
#The result (2,3) tells us

In [None]:
# We can index the shape, just like we index a list
print(two_dim.shape[0])
print(two_dim.shape[1])

## Dictionaries

And one last data storage method... Dictionaries have tons of uses, but for today we're primarily going to be using them to encode our data.

A data lets you pair two pieces of data, called the `key` and the `value`. If you "look up" your key in the dictionary, it will return the associated value.

A dictionary is enclosed in curly brackets `{}`, each key/value pair is written `key : value`, and pairs are separated by commas.

In today's workshop, it will be useful to encode our nucleotide observations and states as integers. We can do this with a dictionary.

So if we wanted to encode 'A', 'C', 'G', and 'T' as 0, 1, 2 , and 3 respectively, we could create the below dictionary.

In [None]:
encode_dict = {
    'A' : 0,
    'C' : 1,
    'G' : 2,
    'T' : 3
}
print(encode_dict)

In [None]:
#How do we encode 'C' as an integer?
print(encode_dict['C'])

## For Loops

Often, it is useful to be able to walk through an entire set of data and perform the same (or similar) actions on that data. For this purpose we can use something called a `for` loop.

Let's say we have the following list:

In [None]:
my_list = [1,2,3,4,5,6]

We want to run through this list and print each element of the list. To do so we use the following structure:

In [None]:
for x in my_list:
    print(x)

There are a few things to note here.

The `x` is a temporary variable. Each time you iterate through the list, `x` gets "set" as the next element in the list. As with other variables, you can (mostly) name this variable whatever you want.

The structure of the for loop is:

```
for {temporary variable} in {thing to loop through}:
    {do something}
```

The instructions of what to do on each iteration are after the first line and are indented by a single `tab`.

The `for` loop starts at the beginning of the list, setting `x` as the first element of the list, in this case `1`. Then it does the instructions we gave it. In this case, `print(x)` which will print `1`. Because our instructions are now over, it will move to the next element in the list, setting `x` as `2` and so on. It will continue until the end of the list.

What if we wanted to print each element of the list multiplied by 2?

In [None]:
for x in my_list:
    print(2*x)

Once we un-indent, we're no longer in the `for` loop.

In [None]:
for x in my_list:
    print(x)

print('Hello')

## The range() Function

The `range()` function returns something like a list of sequential numbers. The syntax is `range(start, stop, step)`. The stop is *not* included in the sequence.

We use typically use `range()` with a `for` loop to generate sequential integers that we can use to index a list or array.

In [None]:
#We have a series of numbers from 1 through 10 (not including 10) 
#We want to take every other number (step in groups of 2)
sample_range = range(1, 10, 2)
for value in sample_range:
    #What should this show us?
    print(value)

In [None]:
my_list = numpy.array([[1,0,1],[2,3,4]])
print(my_list)
print(my_list.shape)

# How can we iterate through the rows of the array?
for i in range(my_list.shape[0]):
    print(my_list[i])

## Conditionals

Sometimes, we want to check the value of a variable and carry out one set of instructions if it meets a condition, and another set of instructions if it doesn't (or meets another condition).

We can do so using the `if`, `else` structure.

```
if {some conditional}:
    {do something}
else:
    {do something else}
```

There are tons of conditions we can check, but the basic ones are:

`x < y`  is x less than y<br>
`x <= y` is x less than or equal to y<br>
`x > y`  is x greater than y<br>
`x >= y` is x greater than or equal to y<br>
`x == y` is x equal to y (note that this is different from `=` which is used for variable assignment)

We first check a condition using `if`. For example, we could check whether our variable is smaller than 20, and print something if it is.

In [None]:
my_var = 12

if my_var < 20:
    print('This is a small number')

As it is now, if our variable is greater than or equal to 20, our code doesn't do anything and will just move on. If we want to do something else if this is the case, we can use an `else` statement.

In [None]:
my_var = 12

if my_var < 20:
    print('This is a small number')
else:
    print('This is a big number')

Let's say we're looping through a list using the `range()` function, and we want to print each value of the list *unless* it's the first element, in which case we want to print the value multiplied by 2.

In [None]:
my_list = [5,4,3,2,1]

for i in range(len(my_list)):
    if i == 0:
        print(2*i)
    else:
        print(i)

## Functions

The last topic we need to discuss is functions.

At its most basic, a function takes some number of inputs and *returns* some number of outputs (typically based on those inputs). A function has the following structure:

```
def my_function({inputs separated by commas}):
    {do something}
    return {outputs}
```

First, we name our function. Above, I named it `my_function`. Then, we decide on the inputs we need, in parantheses and separated by commas.

The body of our function contains the instructions for our function, which will often be manipulating the inputs in some way.

At the end of the body of our function, we have a `return` statement that will "output" what we tell it to. This output can be stored as a variable, used in a conditional, added to a list, etc.

Let's write a simple function that takes two numbers (`x` and `y`) and adds them together.

In [None]:
def our_add_function(x,y):
    added_value = x + y
    return added_value

Now we call our function on two numbers and store that output in a variable we call `added_number`. 

In [None]:
added_number = our_add_function(3,5.5)

print(added_number)

We can also write a function that returns two outputs.

Let's write a function that takes two numbers (`x` and `y`) as inputs and returns their sum, and also returns their product.

In [None]:
def our_double_function(x,y):
    return x+y, x*y

As before, let's call our function on two numbers and store the output in a variable.

In [None]:
results = our_double_function(2,3)

# Remember python indexes from zero, so the "first" term we see with our eyes is 
# the zeroth term.
print(results[0])
print(results[1])