# Multiple sequence alignment (MSA)

First step in many methods for genomic relatedness determination is to perform multiple alignment of the sequences. Let's say we have the sequences *ATGCGT, ATTTCTGT, ATGCGGT, TTTTGGT* and *GTGGGT*. One possible alignment would be:

<img src="images/msa-ex.png" alt="multiple sequence alignment example" width="100" align="center">

But how can we evaluate whether this alignment is good for determining similarity between the sequences? Are there any better ones? How do we work with multiple sequences at once?

## Alignment representation - matrix

**Multiple alignment of K sequences with maximum length N can be represented as a KxN matrix** (K rows and N columns). For the proposed alignment of our sequences (K=5, N=8) it will look like this:

<img src="images/msa-mat.png" alt="multiple sequence alignment matrix" width="300" align="center">

How could we score this matrix? 

Notice that one column corresponds to 1 nucleotide. Therefore, the score (S) for the whole matrix would be the sum of scores (s) for individual columns. 

<code> mat_score = col_score[0] + col_score[1] + .. + col_score[N-1] </code>

Mathematically, this can be expressed like:

$$
S(matrix) = \sum_{i=1}^{N} s( C_i ) = s( C_0 ) + s( C_1 ) + ... s( C_N )
$$


But how can we calculate the score of a single column?

## Sum of pairs (SP) scoring

The score for a column can be calculated as the sum of scores for all pairs within the column. We are going to score all the 
$$ {K \choose 2} = \frac{K*(K-1)}{2} = \frac{5*4}{2} = 10 $$ pairs: $$(R_0,R_1),(R_0,R_2),(R_0,R_3),(R_0,R_4),(R_1,R_2),(R_1,R_3),(R_1,R_4),(R_2,R_3),(R_2,R_4),(R_3,R_4)$$

Let's say that our scoring function (S) is defined so that match = 3, mismatch = -1, S(x,'-') = -1, S('-','-') = 0. 

Write a script that scores an alignment, according to the sum of pairs principle.

In [None]:
import numpy as np

aligned_sequences = ["GTA","G-A","T-A"]

# Expected output for these sequences: col_score(0)+ col_score(1)+ col_score(2) = 1 + (-2) + 9 = 8

# Define matrix

# Hint: You can use list() to create list of characters from a string and array() from numpy

matrix = ...

# Define pair character scoring function

def S(c1,c2):
#    return score

# Define column scoring function

def col_score(col):
#    return score

print(sum(col_score(col) for col in matrix.T)) # matrix.T is the transposed matrix
        

## Dynamic approach for MSA

Now that we have a method to score an alignment, how do we find the best alignment? Can we apply the dynamic approach that we used for pairwise sequence alignment? 

To compare 2 sequences, we constructed a 2D array which contained the optimal score for the alignment of prefices of the sequences. Every possible alignment could be defined as a path in the array. Similarly, we could calculate the score for the alignment of prefices of 3 sequences. The results would be stored in a 3D matrix and their graphic representation would be a cuboid. Each path within this cuboid would represent an alignment of the 3 sequences.

<a href="https://arxiv.org/pdf/0901.2747.pdf"><img src="images/msa-3d.png" alt="MSA 3D matrix" width="300" align="center"></a>

This principle could be extended to any number of sequences. But is it practical to do so? For K sequences the scoring matrix would be K-dimensional. How would this affect the running time of the algorithm?

See pages 3 and 4 of <a href="https://user.ceng.metu.edu.tr/~tcan/ceng465/Schedule/ceng465_week5.pdf"> this document </a> to understand the complexity of the dynamic approach applied to the MSA problem.

Computing MSA score in reasonable time is actually not an easy problem. There are are multiple methods that have been developed to date, but none is considered optimal. In the next tutorial "T2 - Advanced algorithms for MSA" we will look at various approaches that are implemented in existing tools for multiple sequence alignment.