# Day 6 notebook

The objectives of this notebook are to practice with the concepts of

* de Bruijn assembly
* sequencing errors
* repeats

This notebook is intended to be solved by hand.  You are welcome to use any code if that helps you.  You are strongly encouraged to work with your group members to understand and solve each problem.

## PROBLEM 1: Minimum overlap lengths and read spectra (1 POINT)
Recall that the de Bruijn approach to genome assembly with shotgun sequencing reads is to approximate the $k$-mer spectrum of the genome by the union of the $k$-mer spectra of the reads and then to use an Eulerian path approach to computing the assembly.  In the absence of sequencing errors, this approach may be successful if the read spectrum is equal to the genome spectrum.

On the other hand, the fragment assembly approach for the same shotgun sequencing data is to find overlaps between pairs of reads and then to compute a superstring (ideally the shortest such superstring).  In practice, one typically only considers overlaps between pairs of reads that are at least of some minimum length, `min_length`.  Thus, in the absence of sequencing errors, this approach may be successful if the reads cover every position of the genome *and* and each pair of adjacent reads (in terms of their position along the genome) overlaps by at least `min_length`.

In this problem, we will explore the relationship between the `min_length` parameter for fragment assembly and the value of $k$ for the de Bruijn approach.

Suppose that a set of error-free shotgun sequencing reads satisfies the requirements for fragment assembly to be successful with a minimum overlap length of `min_length` and that all reads are longer than twice the value of `min_length`.  As a function of `min_length`, what is the largest value of $k$ such that we are guaranteed that the read spectrum is equal to the genome spectrum (such that the de Bruijn approach may be successful)?

*Hint: consider k-mers that come from a region of the genome where two reads overlap by exactly `min_overlap` bases*

### BEGIN SOLUTION TEMPLATE=Your written answer in terms of $k$
$k = min\_length + 1$
### END SOLUTION

For autograding purposes, answer this question for `min_length` = 20 below by assigning the appropriate value to the variable `k`.

In [1]:
### BEGIN SOLUTION TEMPLATE=k = ?
k = 21
### END SOLUTION

In [2]:
# autograding tests for your specific value of k for the case when min_length = 20
assert isinstance(k, int), "k should be assigned an integer value"
### BEGIN HIDDEN TESTS
assert k == 21
### END HIDDEN TESTS

## PROBLEM 2: Sequencing errors and $k$-mers (1 POINT)
Sequencing errors in reads result in the spectra for those reads having potentially false $k$-mers (i.e., $k$-mers that are not in the genome spectrum).  For a read of length $L$, what is the minimum number of base substitution errors in the read such that the read may contain *only* false $k$-mers?

### BEGIN SOLUTION TEMPLATE=Your written answer in terms of $L$ and $k$
$\left\lfloor \frac{L}{k}\right\rfloor$
### END SOLUTION

For autograding purposes, answer this question for `L` = 100 and `k` = 15 below by assigning the appropriate value to the variable `min_errors`.

In [3]:
### BEGIN SOLUTION TEMPLATE=min_errors = ?
min_errors = 100 // 15
### END SOLUTION

In [4]:
# autograding tests for your specific value of min_errors for L = 100 and k = 15
assert isinstance(min_errors, int), "min_errors should be assigned an integer value"
### BEGIN HIDDEN TESTS
assert min_errors == 100 // 15
### END HIDDEN TESTS

## PROBLEM 3: Repeats (1 POINT)
Suppose you are given the following set of reads:

    AAACT
    AAAGG
    AAAGT
    AAATT
    CCAAA
    CTAAA
    GGAAA
    TTAAA

How many shortest superstrings are there for this set of reads?

For autograding purposes, answer this question below by assigning the appropriate value to the variable `num_shortest_superstrings`.

In [5]:
### BEGIN SOLUTION TEMPLATE=num_shortest_superstrings = ?
num_shortest_superstrings = 6
### END SOLUTION

In [6]:
# autograding tests for num_shortest_superstrings
assert isinstance(num_shortest_superstrings, int), "num_shortest_superstrings should be assigned an integer value"
### BEGIN HIDDEN TESTS
assert num_shortest_superstrings == 6
### END HIDDEN TESTS