# BMI/CS 576 HW1
The objectives of this homework are to practice

* with the basic algorithms for sequence assembly
* reasoning about graphs and paths for the sequence assembly task

## HW policies
Before starting this homework, please read over the [homework policies](https://canvas.wisc.edu/courses/167969/pages/hw-policies) for this course.  In particular, note that homeworks are to be completed *individually*.

You are welcome to use any code from the weekly notebooks in your solutions to the HW.

## PROBLEM 1: The greedy algorithm for fragment assembly (60 points)
Write a function, `greedy_assemble`, that takes as input a list of read strings and uses the greedy fragment assembly algorithm to output a *single* superstring that contains all reads as substrings. You must use the graph-based (Hamiltonian path) version of the greedy algorithm. We will assume that:
1. we are assembling a single-stranded sequence and
2. that no read is a substring of any other read.

To keep things simple, for this homeowork, we will allow overlaps of any length (including zero).  In practice, for sequence assembly we would typically require some minimum overlap length.

### Tie-breaking criteria

For the purpose of making this algorithm deterministic, we must establish tiebreaking criteria for edges in the overlap graph that have the same weight. For two edges with the same weight, we will first choose the edge whose source vertex read is first in lexicographical order. If the source vertices are identical, then we choose the edge whose target vertex read is first in lexicographical order. For example, if e1 = ATCGGA → GGAT and e2 = ATCGGA → GGAA, we will attempt to use edge e2 first because GGAA < GGAT according to lexicographical order.

In [1]:
# Code for PROBLEM 1
# You are welcome to develop your code as a separate Python module
# and import it here if that is more convenient for you.
def greedy_assemble(reads):
    """Assembles a set of reads using the graph-based greedy algorithm.
    
    Args:
        reads: a list of strings
    Returns:
        A string that is a superstring of the input reads
    """
    ### BEGIN SOLUTION
    import assemble
    return assemble.greedy_assemble(reads)
    ### END SOLUTION

Tests for `greedy_assemble` are provided at the bottom of this notebook.

## PROBLEM 2: Assembling a small subset of Ebola virus reads (10 points)

Included with this notebook is the file `ebola_reads.txt` which is small subset of the Illumina reads used to assemble the genome of an isolate of the Ebola virus, which caused a major epidemic in West Africa. 

Use your greedy assemble function to assemble these reads. Once correctly assembled, these reads form a short segment of the genome of this virus. To allow your assembler to succeed, the reads have been cleaned of errors and have have been oriented so that they all come from the same strand of the genome.  You may find the following function below of use, which produces a list of reads from the contents of a file.

In [2]:
def read_strings_from_file(filename):
    return [line.rstrip() for line in open(filename)]

In [3]:
### BEGIN SOLUTION TEMPLATE=code for assembling the ebola reads
ebola_reads = read_strings_from_file("ebola_reads.txt")
ebola_assembly = greedy_assemble(ebola_reads)
print(ebola_assembly)
print(len(ebola_assembly))
### END SOLUTION

ATTAAGAAAAACTGCTTATTGGGTCTTTCCGTGTTTTAGATGAAGCAGTTGACATTCTTCCTCTTGATATTAAATGGCTACACAACATACCCAATACCCAGACGCCAGGTTATCATCACCAATTGTATTGGACCAATGTGACCTTGTCACTAGAGCTTGCGGGTTGTATTCATCATACTCCCTTAATCCGCAACTACGCAACTGTAAACTCCCGAAACATATATACCGTTTAAAATATGATGTAACTGTTACCAAGTTCTTAAGTGATGTACCAGTGGCGACATTGCCCATAGATTTCATAGTCCCAATTCTTCTCAAGGCACTATCAGGCAATGGGTTCTGTCCTGTTGAGCCGCGGTGCCAACAGTTCTTAGATGAAATTATTAAGTACACAATGCAAGATGCTCTCTTCCTGAAATATTATCTCAAAAATGTGGGTGCTCAAGAAGACTGTGTTGATGACCACTTTCAAGAAAAAATCTTATCTTCAATTCAGGGCAATGAATTTTTACATCAAATGTTTTTCTGGTATGACCTGGCTATTTTAACTCGAAGGGGTAGATTAAATCGAGGAAACTCTAGATCAACGTGGTTTGTTCATGATGATTTAATAGACATCTTAGGCTATGGGGACTATGTTTTTTGGAAGATCCCAATTTCACTGTTACCACTGAACACACAAGGAATCCCCCATGCTGCTATGGATTGGTATCAGACATCAGTATTCAAAGAAGCGGTTCAAGGGCATACACACATTGTTTCTGTTTCTACTGCCGATGTCTTGATAATGTGCAAAGATTTAATTACATGTCGATTCAACACAACTCTAATCTCAAAAATAGCAGAGGTTGAGGACCCAGTTTGCTCTGATTATCCCAATTTTAAGATTGTGTCTATGCTTTACCAGAGCGGAGATTACTTACTCTCCATATTAGGGTCTGATGGGTATAAAATCATTAAGTTTCTCGAACCATTGTGCTTGGCTAAAATTCAATTGTGC

Once you have assembled the genomic segment, use the [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) web service to search the NCBI database of proteins with your assembled sequence. You should use BLASTX with its default settings. Based on the results of your BLASTX search, which gene is contained within this genomic segment?

### BEGIN SOLUTION  TEMPLATE=*YOUR ANSWER TO PROBLEM 2 HERE*
The protein-coding gene contained within this genomic segment is a polymerase, specifically, an RNA polymerase.
### END SOLUTION

### BONUS CHALLENGE
A subset of Illumina reads from an Ebola virus genome sequencing experiment that cover the entire genome are included in the file `ebola_full_genome_reads.txt`. Can you get your code to assemble these reads in under 2 minutes? Sorry, no extra credit here, just personal satisfaction!

## PROBLEM 3: SBH graphs and Eulerian paths (20 points) 
For the following strings, (i) give the k = 3 spectrum for the string, (ii) draw the SBH graph for the spectrum, (iii) give one Eulerian path and its corresponding string for the SBH graph, and (iv) show whether or not there exists an Eulerian path in the graph that corresponds to the original string.

(a) ATGGCCTGAATCC

(b) ATAGCCTAGCAAT

### BEGIN SOLUTION  TEMPLATE=*YOUR ANSWER TO PROBLEM 3 HERE*
(a) ATGGCCTGAATCC

i. {TGG, TGA, TCC, GGC, GCC, GAA, CTG, CCT, ATG, ATC, AAT}

ii. See figure below
![](3-1-2.PNG)
iii. The three possible Eulerian paths in this graph and their corresponding
strings are:
- ATGGCCTGAATCC: AT -> TG -> GG -> GC -> CC -> CT -> TG -> GA -> AA -> AT -> TC -> CC
- ATGAATCCTGGCC: AT -> TG -> GA -> AA -> AT -> TC -> CC -> CT -> TG -> GG -> GC -> CC
- ATCCTGAATGGCC: AT -> TC -> CC -> CT -> TG -> GA -> AA -> AT -> TG -> GG -> GC -> CC

iv. Yes, there exists an Eulerian path in the graph that corresponds to the
original string. See the first path listed in part (iii) above.

(b)  ATAGCCTAGCAAT

i. {TAG, GCC, GCA, CTA, CCT, CAA, ATA, AGC, AAT}

ii. See figure below.
![](3-2-2.PNG)
iii.  The two possible Eulerian paths in this graph and their corresponding
strings are:
- GCCTAGCAATA: GC -> CC -> CT -> TA -> AG -> GC -> CA -> AA -> AT -> TA
- GCAATAGCCTA: GC -> CA -> AA -> AT -> TA -> AG -> GC -> CC -> CT -> TA

iv.  No, the only two possible Eulerian paths are listed in part (iii) above. Also, note that all Eulerian paths in this graph must start with GC and end with
TA because those are the two unbalanced vertices in the graph. Because a
path spelling the original string must start and end with AT, such a path
cannot be Eulerian.

### END SOLUTION

## PROBLEM 4: Existence of DNA sequence with complete spectrum (10 points)
For any value of $k$, does there exist a DNA sequence that contains every possible $k$-mer exactly once?  Prove your answer.  *Hint: consider the SBH graph for the spectrum of such a sequence, should it exist*

### BEGIN SOLUTION  TEMPLATE=*YOUR ANSWER TO PROBLEM 4 HERE*

Consider the SBH graph for the spectrum that contains all possible $k$-mers.  Each vertex in the graph corresponds to a $(k-1)$-mer.  Let $s$ be the $(k-1)$-mer labeling some vertex in the graph.  Outgoing edges from this vertex are $k$-mers that have $s$ as a prefix.  For DNA sequences, there are four such $k$-mers, $sA$, $sC$, $sG$, and $sT$, and thus the outdegree for any vertex is four.  Similarly, incoming edges to the vertex are $k$-mers that have $s$ as a suffix.  There are four such $k$-mers, $As$, $Cs$, $Gs$, and $Ts$.  Thus the number of incoming edges to any vertex is four.  Since the number of incoming edges equals the number of outgoing edges for each vertex, all vertices are balanced, and thus the graph contains an Eulerian cycle.  Any Eulerian cycle in this graph corresponds to a sequence that contains each $k$-mer exactly once, and thus the answer is **yes**, there exists a DNA sequence that contains every possible $k$-mer exactly once.

### END SOLUTION

### Tests for problem 1

In [4]:
def test_greedy_assemble_with_files(reads_filename, superstring_filename):
    reads = read_strings_from_file(reads_filename)
    [superstring] = read_strings_from_file(superstring_filename)
    assert greedy_assemble(reads) == superstring 

In [5]:
# TEST: greedy_assemble returns a string
sanity_test_reads = read_strings_from_file("tests/test_reads.txt")
assert isinstance(greedy_assemble(sanity_test_reads), str)
print("SUCCESS: greedy_assemble returns a string passed!")

SUCCESS: greedy_assemble returns a string passed!


In [6]:
# TEST: greedy_assemble returns a superstring
def is_superstring(s, reads):
    return all(read in s for read in reads)
assert is_superstring(greedy_assemble(sanity_test_reads), sanity_test_reads)
print("SUCCESS: greedy_assemble returns a superstring passed!")

SUCCESS: greedy_assemble returns a superstring passed!


In [7]:
# TEST: greedy_assemble_small_test_1
small_test1_reads = ["GTT", "ATCTC", "CTCAA"]
assert greedy_assemble(small_test1_reads) == "ATCTCAAGTT"
print("SUCCESS: greedy_assemble_small_test_1 passed!")

SUCCESS: greedy_assemble_small_test_1 passed!


In [8]:
# TEST: greedy_assemble_small_test_2
small_test2_reads = ["CGAAG", "ATCGA", "AGAG", "GGG"]
assert greedy_assemble(small_test2_reads) == "ATCGAAGAGGG"
print("SUCCESS: greedy_assemble_small_test_2 passed!")

SUCCESS: greedy_assemble_small_test_2 passed!


In [9]:
# TEST: greedy_assemble_small_test_3
small_test3_reads = ["C", "T", "G", "A"]
assert greedy_assemble(small_test3_reads) == 'ACGT'
print("SUCCESS: greedy_assemble_small_test_3 passed!")

SUCCESS: greedy_assemble_small_test_3 passed!


In [10]:
# TEST: greedy_assemble large test 1
test_greedy_assemble_with_files("tests/large_test1_reads.txt", "tests/large_test1_superstring.txt")
print("SUCCESS: greedy_assemble large test 1 passed!")

SUCCESS: greedy_assemble large test 1 passed!


In [11]:
# TEST: greedy_assemble_reads_7 (hidden)
### BEGIN HIDDEN TESTS
test_greedy_assemble_with_files("tests/reads7.in.txt", "tests/reads7.out.txt")
print("SUCCESS: greedy_assemble_reads_7 passed!")
### END HIDDEN TESTS

SUCCESS: greedy_assemble_reads_7 passed!


In [12]:
# TEST: greedy_assemble_reads_8 (hidden)
### BEGIN HIDDEN TESTS
test_greedy_assemble_with_files("tests/reads8.in.txt", "tests/reads8.out.txt")
print("SUCCESS: greedy_assemble_reads_8 passed!")
### END HIDDEN TESTS

SUCCESS: greedy_assemble_reads_8 passed!


In [13]:
# TEST: greedy_assemble_reads_9 (hidden)
### BEGIN HIDDEN TESTS
test_greedy_assemble_with_files("tests/reads9.in.txt", "tests/reads9.out.txt")
print("SUCCESS: greedy_assemble_reads_9 passed!")
### END HIDDEN TESTS

SUCCESS: greedy_assemble_reads_9 passed!


In [14]:
# TEST: greedy_assemble_reads_10 (hidden)
### BEGIN HIDDEN TESTS
test_greedy_assemble_with_files("tests/reads10.in.txt", "tests/reads10.out.txt")
print("SUCCESS: greedy_assemble_reads_10 passed!")
### END HIDDEN TESTS

SUCCESS: greedy_assemble_reads_10 passed!


In [15]:
# TEST: greedy_assemble_reads_11 (hidden)
### BEGIN HIDDEN TESTS
test_greedy_assemble_with_files("tests/reads11.in.txt", "tests/reads11.out.txt")
print("SUCCESS: greedy_assemble_reads_11 passed!")
### END HIDDEN TESTS

SUCCESS: greedy_assemble_reads_11 passed!
