# Lab 2: Statistics behind Local Alignment
___

### Exercise 1
Say we have a query sequence with length $m$, and the subject (database) of length $n$. We often set $m \ll n$ for the sake of simplicity and reality. Figure out an algorithm to find both the highest perfect match and high-scoring segment with the time complexity of $\mathcal{O}(mn)$. We set in the scoring system the match by +1, and the mismatch by -1.

```python
# the cummulative score
cumulativescore = 0

# minimum score up to now
minscore = 0

# highest cummulative score up to now
maxcumscore = 0

for i in 1:(n-m+1):
    for j in 1:m:
        if q[j] == s[i]:
           ## add some codes here
        else:
           ## add other codes here
        ## add some codes here

```

Do you think your algorithm make sense in sensitively finding all the high-scoring segments? 

### How to determine the statistical significance of nucleotide local alignments?

Here is an example of alignment:
<code>
Q:    1     ttgacctagatgagatgtcgttcacttttactcaggtacagaaaa  45
            |||| |||||||||||| | |||||||||||| || |||||||||
S:    403   ttgatctagatgagatgccattcacttttactgagctacagaaaa  447
</code>

Try to identify high-scoring segments whose score $S$ exceeds a cutoff $x$ using a **local alignment** algorithm (e.g., BLAST).

The scores will follow an extreme value (a.k.a. Gumbel) distribution:
$$
P(S > x) = 1 - e^{-Kmn e^{-\lambda x}}
$$
where $K, \lambda$ are positive parameters that depend on the scoring system and the composition of the sequences being compared.

### How do $\lambda$ relate to the scoring system and the composition of the sequences?

$\lambda$ is the positive unique solution to the following equation:
$$
\sum_{i,j=1}^4 p_i r_j e^{\lambda s_{ij}} = 1
$$
where:
- $p_i$ is the probability of nucleotide $i$ in query sequence;
- $r_j$ is the probability of nucleotide $j$ in the database;
- $s_{ij}$ is the score for nucleotide $i$ and $j$.

#### Question
1. What kind of equation is this?
   - This is a transendental equation here. Let's plug the most simple situation to the equation by setting:
$$
\forall i, j = 1, \ldots, 4,  p_i = r_j = \frac{1}{4} , s_{ij} = \left\{
\begin{array}{ll}
1 & i=j\\
-1 & i \neq j
\end{array}
\right.
$$
then we have:
$$
\frac{1}{4}e^{\lambda} + \frac{3}{4}e^{-\lambda} = 1
$$
Here $\frac{1}{4}$ is the probability of match, while $\frac{3}{4}$ is the probability of mismatch. Finally we could arrive at $\lambda = \ln(3) \approx 1.099$
2. What would happen to $\lambda$ if we double all the scores in the scoring matrix?
   - $\lambda$ would be reduced by a factor of 2, since in the above equation, $p_i$ and $r_j$ does not change. In order to keep the product $\lambda s_{ij}$ fixed, we need to ...

3. What does this tell us about the nature of $\lambda$?
   - We can easily find that $\lambda$ is used to scale the raw score to fit to the compositions of both the query and subject sequences. It won't change the high-scoring segments.

### What scoring matrix to use for DNA?

We usually use a **simple match-mismatch scoring matrix**:
$$
S = \left[\begin{array}{cccc}
1 & m & m & m\\
m & 1 & m & m\\
m & m & 1 & m\\
m & m & m & 1
\end{array}\right]
$$
where $m < 0$ is the **mismatch penalty** and it should be negative.

#### Question

1. Should we use $m=-1$ or $m=-2$, $m=-4$, $\ldots$? And why?

2. Could you please propose some more complicated scoring matrices for application?

### How to choose the mismatch penalty?
It depends on our expectation from the queries.

Let's compute the **target frequencies** (the proportion of match of nucleotide $i$ and $j$ in your target alignment):
$$
q_{ij} = p_i p_j e^{\lambda s_{ij}} \Rightarrow s_{ij} = \frac{\ln(q_{ij}/p_i p_j)}{\lambda}
$$

If you want a region with identity $r$, then $q_{ii} = r/4$ and $q_{ij} = (1-r)/12 (i \neq j)$. We can further set $s_{ii}=1$, then for $s_{ij}$:
$$
m = s_{ij} = s_{ij} / s_{ii} = \frac{\frac{\ln(q_{ij}/p_i p_j)}{\lambda}}{\frac{\ln(q_{ii}/p_i p_i)}{\lambda}} = \frac{\ln(q_{ij}/p_i p_j)}{\ln(q_{ii}/p_i p_i)}
$$
Assume that $p_i = p_j = 1/4$ and $1/4 < r < 1$, then:
$$
m = \frac{\ln(4(1-r)/3)}{\ln(4r)}
$$
Note here that $r$ is the **expected fraction of identities in high-scoring BLAST hits**.

In [20]:
import math
identities = [.75, .95, .99]
m = map(lambda r: math.log(4*(1-r)/3,math.exp(1))/math.log(4*r, math.exp(1)), identities)
print "Identities\tMismatch score"
for i, j in zip(identities, m):
    print "{0:.2f}\t\t{1:.1f}".format(i, j)

Identities	Mismatch score
0.75		-1.0
0.95		-2.0
0.99		-3.1


This means that for a match score of +1 and uniform distribution of the four nucleotide, if we want to achieve a identity of over 95\% in the high scoring segment, we need to set mismatch penalty to be less than -2. 

### Question

All of the following questions are based on the assumption that the match score is +1.

1. In order to obtain almost-perfect alignment, what would you do in choosing a mismatch penalty score?

2. Use a mismatch penalty score of $m=-2$, can you achieve a high-scoring segment with identity of 66%. If yes, tell me why.

### BLAST Exercises

1. If you want to blast a human EST data against the mouse genome, which BLAST program would you choose? BLASTN or TBLASTX? And why?

2. It is almost true that sequence similarity can imply structural/functional similarity. But, is the converse true? Why?

3. If you need to search for divergent protein homologs, which scoring matrix would you use?

4. How do you interpret the two parameters, $K$ and $\lambda$?
   * $K$ and $\lambda$ are two scaling paramters for searching space size $mn$ and the scoring matrix system $S$.

5. What is the raw score and bit score for BLAST?