# Course Project: What's this gene?

In this project you will implement a de-novo sequence assembly algorithm. You will be provided with a small sample of fragments (e.g. from an Illumina type machine) for a part of a well-known protein encoding human gene. Your task is to assemble the reads into a sequence then perform an online [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) search to find out what the gene is.

This project will see you use all of the techniques you will learn along the way in the course and during the course will will provide you with opportunities to work on the project.

---

## Background

We will use a ["greedy algorithm"](https://en.wikipedia.org/wiki/Sequence_assembly#Greedy_algorithm) for sequence assembly. This is because it is straightforward to understand and implement but will also give you good enough results to solve the challenge. The steps of the algorithm are:

1. Ð¡alculate pairwise alignments of all fragments.
1. Choose two fragments with the largest overlap.
1. Merge chosen fragments.
1. Repeat step 2 and 3 until only one fragment is left or you cannot merge anymore.

In the following code cell, we have supplied the fragments you should assemble the sequence from.

In [None]:
fragments = [
    "GAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGTAGACATAAAAGTCTTCGCACAGTGAAAACTAAAATGGATCAAGCAGATGATG",
    "TTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAGAAGAATCTGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAAAGGAAA",
    "TGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTTAATTGGTTTGAAGA",
    "CTCCACAAAGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAGGGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGA",
    "CAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTA",
    "GGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGTAGACATAAAAGTCTTCG",
    "CAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTTAATTGGTTTGAAGAACTTTCTTCAGAA",
    "AGCAAGGGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGTAGACATAAAAG",
    "TCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAGGGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTA",
    "AATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGTAGACATAAAAGTCTTCGCACAGTGAAAACTAAAATGGA",
    "ATGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTTAATTGGTTTGAAG",
    "GAATGTTCCCAATAGTAGACATAAAAGTCTTCGCACAGTGAAAACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGT"
]

---

## Day 1: Getting started
You already have enough Python knowledge to start implementing our sequence assembly program. Let's start off by breaking down the problem into more managable parts (this is a skill you will develop very quickly while programming).

Step 1 says we need to compute [pairwise alignments](https://en.wikipedia.org/wiki/Sequence_alignment#Pairwise_alignment). To do this we need 2 things: a way to score alignments, and a way to generate alignments. An easy way to score an alignment is known as the [_edit distance_](https://en.wikipedia.org/wiki/Levenshtein_distance), which is simply the minimum number of changes (character insertions, deletions, or substitutions) that are required to transform one string into another.

### Edit distance: an example
Imagine I want to find the edit distance between the strings "kitten" and "sitting":

1. **k**itten â†’ **s**itten (substitution of "s" for "k")
2. sitt**e**n â†’ sitt**i**n (substitution of "i" for "e")
3. sittin â†’ sittin**g** (insertion of "g" at the end).

So the _edit distance_ between "kitten" and "sitting" is 3.

Your first task is to write the beginnings of a function to compute the edit distance. Instead of a whole string, your input will be 2 characters. Please fill in the template provided for you and ensure it passes the test cases...

In [None]:
def edit_distance(queryA, queryB):
    if _:
        return _
    
    return _
    
assert edit_distance('A', 'T') == 1
assert edit_distance('G', 'G') == 0

Don't worry, you will soon extend this to longer sequences but it's a good start for now.

Another task we will need to do from step 3 of our algorithm is "merge" fragments. Your next task is to write a merging function that takes 2 fragments and merges them end-to-end in order. You can fill in the template below and ensure it passes the test cases...

In [None]:
def merge(fragA, fragB):
    return _
    
assert merge("", "ATG") == "ATG"
assert merge("ATG", "") == "ATG"
assert merge("ATG", "CCT") == "ATGCCT"
assert merge("A", "TG") == "ATG"

Well done! You have now completed the beginnings of the sequence assembly program. It doesn't look like much now but it's a great foundation for tomorrow. Congratulate yourself, you've earned it!

---

## Day 2: Sequence assembly made easy!
![How to draw an owl](https://i.kym-cdn.com/photos/images/newsfeed/000/572/078/d6d.jpg)

Now that you can iterate over sequences, it's time to complete your _edit distance_ function to work on sequences longer than a single character. We will use a technique called [_dynamic programming_](https://en.wikipedia.org/wiki/Dynamic_programming). _Dynamic programming_ is a technique that allows you to solve big problems by breaking them into smaller and smaller sub-problems that are eventually "trivial". For example, the "trivial problem" when computing the edit distance is the comparison of 2 characters (that we solved yesterday). Dynamic further involves remembering (or [_memoising_](https://en.wikipedia.org/wiki/Memoization)) partial solutions as you build them up. So how is a solution "built up"? Let's have a look at the "kitten"/"sitting" example from before:

Start by setting up a matrix:
![Step 1](images/dp_1.jpg)

Label the rows and columns:
![Step 2](images/dp_2.jpg)

Fill in the "boundary cases":
![Step 3](images/dp_3.jpg)

The general case requires you find the `min()` between three cases:
![Recurrence relation](https://wikimedia.org/api/rest_v1/media/math/render/svg/10554aecc5e56da9be4657acd75b9a67b5e8b394)

If "S" is equal to "K" the substitution cost is `0`, otherwise `1`. The insertion cost and the deletion cost is always `1`, so:
![General case](images/dp_4.jpg)

Then systematically fill out the entire matrix:
![Systematically fill in the matrix](images/dp_6.jpg)

Once the matrix is complete, the bottom-left corner cell will contain the edit distance:
![Edit distance](images/dp_8.jpg)

I encourage you to try this for yourself with pencil and paper. Once you have convinced yourself that you understand how it works, you can implement it in Python. To begin, the memoisation matrix can be represented using a list of lists:

```python
memo = [
   # -  K  I  T  T  E  N
    [0, 0, 0, 0, 0, 0, 0], # -
    [0, 0, 0, 0, 0, 0, 0], # S
    [0, 0, 0, 0, 0, 0, 0], # I
    [0, 0, 0, 0, 0, 0, 0], # T
    [0, 0, 0, 0, 0, 0, 0], # T
    [0, 0, 0, 0, 0, 0, 0], # I
    [0, 0, 0, 0, 0, 0, 0], # N
    [0, 0, 0, 0, 0, 0, 0]  # G
]
```

Lets start by working out how we read and write a value from this matrix. Let's begin by writing a function to write the value `5` to a row and column in this matrix. Please fill out the function below and ensure it passes the test cases:

In [None]:
def set_value_in_matrix(row, column, matrix):
    # Set matrix at (row, column) to 5
    ...
    
m = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
set_value_in_matrix(0, 1, m)
set_value_in_matrix(2, 0, m)
set_value_in_matrix(2, 2, m)
assert m[0][1] == 5
assert m[2][0] == 5
assert m[2][2] == 5

Ok great work, we can read and write to our matrix. Now let's create a matrix of the correct size for our input strings. Remember that the dynamic programming matrix is the size of the input strings + 1. Please fix the function that follows and ensure it passes the tests.

In [None]:
def create_matrix(inputA, inputB):
    row = [0] * 5
    m = []
    for _ in range(5):
        m.append(row.copy())
    
    return m

assert create_matrix('','') == [[0]]
assert create_matrix('dn', 'a') == [[0, 0], [0, 0], [0, 0]]

Now you can initialise the matrix with the boundary cases (a edit distances against empty strings). Remember that the edit distances along the boundary are just the column or row index. For example, this is a matrix with boundary values initialised for the strings "kitten" and "sitting":

```python
memo = [
   # -  K  I  T  T  E  N
    [0, 1, 2, 3, 4, 5, 6], # -
    [1, 0, 0, 0, 0, 0, 0], # S
    [2, 0, 0, 0, 0, 0, 0], # I
    [3, 0, 0, 0, 0, 0, 0], # T
    [4, 0, 0, 0, 0, 0, 0], # T
    [5, 0, 0, 0, 0, 0, 0], # I
    [6, 0, 0, 0, 0, 0, 0], # N
    [7, 0, 0, 0, 0, 0, 0]  # G
]
```

Please fix the function that follows and ensure it passes the tests.

In [None]:
def init_matrix(inputA, inputB):
    m = create_matrix(inputA, inputB)
    rows = 0
    cols = 0
    
    for _ in range(rows):
        ...
        
    for _ in range(cols):
        ...
    
    return m

assert init_matrix('', '') == [[0]]
assert init_matrix('dn', 'a') == [[0, 1], [1, 0], [2, 0]]

Now you're finally ready to complete the implementation of your `edit_distance()` function. You can use your `init_matrix()` function. The function template is given below, please complete and fix the function and ensure it passes the test cases.

In [None]:
def edit_distance(queryA, queryB):
    m = init_matrix(queryA, queryB)
    rows = 0
    cols = 0

    for col in range(1, cols):
        for row in range(1, rows):
            if queryA[row - 1] == queryB[col - 1]:
                cost = 0
            else:
                cost = 1
            
            m[row][col] = min([m[row-1][col-1],
                               m[row][col - 1],
                               m[row - 1][col]])
            
    return m[0][0]

assert edit_distance('A', 'T') == 1
assert edit_distance('G', 'G') == 0
assert edit_distance('kitten', 'sitting') == 3
assert edit_distance('', '') == 0
assert edit_distance('ABCD', 'EFGH') == 4
assert edit_distance('ABCD', 'ZBCZ') == 2

Great work! You can now compute the edit distance between arbitrary strings!
Now it's time to generate alignments... luckily you're already most of the way there ðŸ˜Š

### Generating alignments


In [None]:
from project import assemble, needleman_wunsch

assemble(fragments, needleman_wunsch)