<h1><center>Protein multiple sequences alignement</center></h1>

author: Rawad Ghostin

# Introduction

In the first part of the project, we looked at multiple ways of aligning 2 sequences together. Very often, many similar sequences are already known to be homologous. These sequences are expected to have similar regions and functions and it would be better if we could align a given sequence to the whole set instead of a specific member. Aligning to the whole set considerably emphasizes how a given sequence deviated from its origins, rather than how it deviated from another specific sequence. Moreover, the degree and type of residues conservation that are put in evidence enhance our capacity of extracting meaningful biological data from the alignments.<br>
There exist multiple methods to study different aspects of the sequences. In this project, we focus on examining the alignment of a set of sequences that belong to the same family, the WW domains. The algorithm is based on the Smith-Waterman implemented in the previous part. This allows us to compare a given protein sequence relatively to a known group of sequences, and consider whether the sequence can/cannot be affiliated to the family.<br>
Throughout this report, we will explain the concepts and implement this process. Then, we will be discussing the solutions produced by the algorithm in an attempt to make sense of the observations.

# Material and methods

## MSA
The wide majority of WW proteins notably have a high similarity score.
In the first part of the project, we looked at multiple ways of aligning 2 protein sequences together. Yet, comparing **sets of sequences** seems to be more relevant in presenting information about the role played by particular residues at particular locations.
Additionally, it emphasizes the degree of residue conservation, how much deviation occurred from the protein origins, instead of another specific protein.
Such alignments are called **MSA** (Multiple sequence alignment).

<figure>
    <img src="img/msa_example.png">
    <caption> Fig1 : Example of MSA</caption>
</figure>


#### Progressive algorithm for producing an MSA
A wide variety of methods exist to produce MSAs.<br>
In our project, we will be using an iterative approach, using the <a href="https://www.ebi.ac.uk/Tools/msa/muscle/">MUSCLE</a> tool.

This technique is based on a stochastic method in which the algorithm starts with a random alignment and ameliorates it iteratively until it approaches a good-enough solution.

<figure>
    <img src="img/iterative.png">
    <caption> Figure 2 : Diagram explaining the flow of an iterative algorithm</caption>
</figure>

##### Evaluating the alignment
Progressive algorithms operate iteratively by producing new generations until a convenient solution is obtained.
We first start with a random population *generation 0* (a set of sequences). For each iteration, the algorithms evaluate the correctness of the current population using the *Sum-of-pairs*, and produces a new generation replacing the current one until a satisfactory evaluation is reached, meaning a certain precision degree is attained.

The Sum-of-pairs is defined as follows:
$$
S(m) = \sum_{k,l} S(m_{k,l}) \textrm{     Total sum of pairs (sum of the scores of columns)}\\
$$

Where

$$
S(m_{k,l}) = \sum_i s(m^k_i, m^l_i) \textrm{ score of a column (sum on all possible pairs)}
$$

And $m^k_i$ is the residue in the sequence $k$ at the position $i$ and $s(m^k_i, m^l_i)$ the score according to the matrix of substitution.<br>
A new generation is derived from the current one using the amelioration process which is explained in the next section.

##### Amelioration process
The amelioration process can be done with multiple techniques (sometimes mixed) like *stochastic hill*, *simulated annealing*, *tabu search* and so forth.
For instance, the *SAGA* algorithm ameliorates the alignment using a set of *crossing operators* and *mutation operators*. Crossing operators emulated the genetic crossing done naturally for a new generation of individuals to be born. Mutation operators emulate mutations happening to new the genome of the new generation.

For each iteration, operators from both categories are activated. They are selected randomly according to a probability law which is regularly updated base on a reward/penalty system, just like in nature.<br>
The operators that are known to produce generations which approach better the required precision are rewarded by increasing the likelihood of their selection in the next iterations.
Likewise, poorly performing operators are penalized and have less probability to be picked up in the future.
Finally, the algorithm takes into consideration a *best-fit* policy, and only selects the best 50% of individuals to form the new population.

Unlike other algorithms which are based on different kind of techniques, e.g *CLUSTALo* for progressive alignment, the *MUSCLE* algorithm has the advantage of being effective for both small and large groups of sequences, thus is our preferred choice.


The MSA produced is put in FASTA format, in that way, we can simply reuse our FASTA parser from the previous part of the project.<br>



### MSA variables and operations 

We can already fix variables notations and operations of an MSA. These are useful for computing a profile, a concept explained in fine detail in the next section.

- $N_{seq}$ represents the number of sequences aligned in the MSA.<br>
- $n_{u,a}$ represents the number of occurence of an amino acid $a$ at the position $u$ of the alignment.<br>
- $f_{u,a}$ represents the proportion that an amino acid $a$ holds on a position $u$. <br>
It is calculated as follows:<br>
$$ 
f_{u,a}=\frac{n_{u,a}}{N_{seq}}
$$


The MSA structure and operations are defined in code as follows, but first we re-define some ADTs and the FASTA parser from the previous part of the project.

In [None]:
from math import log10
from collections.abc import MutableSequence
from copy import deepcopy

In [None]:
### Data structures from the previous part

class AminoAcid:
    __valid_AA = 'ABCDEFGHIKLMNPQRSTVWXYZ-'     # symbols representing an amino acid

    def __init__(self, symbol):
        self.__symbol = None
        self.symbol = symbol        # call to setter

    def __repr__(self):
        return 'AA({})'.format(self.symbol)

    def __str__(self):
        return self.symbol

    def __eq__(self, other):
        return self.symbol == other.symbol

    @property
    def symbol(self):
        return self.__symbol

    @symbol.setter
    def symbol(self, symbol):
        symbol = symbol.upper()
        assert symbol in self.__valid_AA    # asserts that symbol is valid before setting it
        self.__symbol = symbol
 

class AASequence(list):
    def __init__(self, fasta_header=None):
        super().__init__()
        self.fasta_header = fasta_header

    def __str__(self):
        """ Returns AASequence in format : 'ACDEFGHIKLMNPQRSTVWY' """
        return ''.join([str(am_acid) for am_acid in self])

    def __repr__(self):
        """ Returns AASequence in format : AASequence(ACDEFGHIKLMNPQRSTVWY) """
        return 'AASequence({})'.format(str(self))

    def fasta_repr(self):
        assert self.fasta_header is not None
        return self.fasta_header + '\n' + str(self) + '\n'

    def append(self, symbol):
        """
        Extension to the python list.append.
        Converts the str symbol into AminoAcid object before append
        """
        am_acid = AminoAcid(symbol)
        super().append(am_acid)

    def extend(self, symbol_iterable):
        """
        Overwriting python list.extend
        Given an iterable of string symbols, appends them to the sequence
        """
        for symbol in symbol_iterable:
            self.append(symbol)     # call to extended append
            
class Matrix:
    def __init__(self, nrows, ncols, init_val=None):
        self.nrows = nrows
        self.ncols = ncols
        self.init_val = init_val if not isinstance(init_val, MutableSequence) else deepcopy(init_val)
        self.mat = None
        self.reset()        # fill matrix in self.mat

    def __getitem__(self, idx):
        return self.mat[idx]

    def __setitem__(self, idx, newrow):
        assert len(newrow) == self.nrows
        self.mat[idx] = newrow

    def __str__(self):
        res = ''
        for col in self.mat:
            res += ' '.join([repr(x) for x in col]) + '\n'
        return res

    def __repr__(self):
        res = 'Matrix({})'
        content = ', '.join([repr(col) for col in self.mat])
        return res.format(content)

    def reset(self):
        self.mat = [[None for j in range(self.ncols)] for i in range(self.nrows)]
        for i in range(self.nrows):
            for j in range(self.ncols):
                if isinstance(self.init_val, MutableSequence):
                    self.mat[i][j] = deepcopy(self.init_val)
                else:
                    self.mat[i][j] = self.init_val

    def fillrow(self, i, val):
        """ fills row i with val """
        for j in range(self.ncols):
            self.mat[i][j] = val

    def fillcol(self, j, val):
        """ fills col i with val """
        for i in range(self.nrows):
            self.mat[i][j] = val

    def getmax(self):
        """ returns value, row, col of maximum in matirx"""
        maxval = 0, 0, 0  # value, i, j
        for i in range(self.nrows):
            for j in range(self.ncols):
                if self.mat[i][j] > maxval[0]:
                    maxval = self.mat[i][j], i, j
        return maxval
    
    
def parse_fasta_file(filename):
    """
    Parses fasta file line by line. todo: remarks: light on memory, for larger files
    :param filename:  path to fasta file
    :return: list of AASequences
    """
    seq_list = []
    curr_seq = None

    with open(filename, 'r') as f:
        line = f.readline()
        while line:
            line = line.strip()
            # if sequence header, create new instance else continue parsing current one
            if line.startswith('>'):
                curr_seq = AASequence(fasta_header=line)         # create new QQSequence instance
                seq_list.append(curr_seq)       # append reference to the AASequence object
            else:
                curr_seq.extend(line)
            line = f.readline()
    return seq_list   

In [None]:
class MSA:
    def __init__(self, seq_list):
        self.__seq_list = seq_list

    @property
    def N_seq(self):
        """ Returns number of sequences being aligned """
        return len(self.__seq_list)

    @property
    def len(self):
        """ Returns the length of the sequences after being aligned """
        return len(self.__seq_list[0])

    def get_col(self, u):
        """ Helper function, returns the column u in a list format"""
        return [seq[u] for seq in self.__seq_list]

    def n(self, u, a):
        """ Returns the number of occurrence of an amino acid a at a position u"""
        a = AminoAcid(a)
        return self.get_col(u).count(a)

    def f(self, u, a):
        """ Returns the frequency of a at position u"""
        return self.n(u, a) / self.N_seq 

    def __str__(self):
        res = "[*] MSA:\nN_seq=%s; len=%s\n" % (self.N_seq, self.len)
        for seq in self.__seq_list:
            res += str(seq) + '\n'
        return res


## Profiles

In the first part of the project, we looked at multiple ways of aligning 2 sequences together.
 
Very often, many similar sequences are already known to be homologous.
These sequences are expected to have similar regions and functions and it would be better if we could align a given sequence to the whole set instead of a specific member.
In fact, aligning to the whole set considerably emphasises how a given sequence deviated from its origins, rather than how it deviated from another specific sequence.
 
To this end, it would be helpful to find a representation of the essential properties of the set of sequences such that the given sequence could be aligned to this representation.
Such a representation is a matrix called **profile**, it holds compact information about essential properties of an MSA.
 
### PSSM
 
**PSSM**s (Position-specific-scoring-matrix), which are a type of profile, take in consideration the frequency of each aminoacid in each column and its importance on an evolutionary scale.
PSSM, which is a matrix of size $ number\_of\_possible\_aacids * len(MSA)$, contains score representing the probability of finding a given amino acid at a specific position.
The score is calculated base on observed and theoritical/statistical data.
Hence, in order to generate a PSSM, an MSA is required (a set of aligned sequences) to fix prelimary <u>observed</u> values of the group of sequences.
Multiple sources and algorithms exist to perform such alignments.
In this project, we'll be using the **MUSCLE** tool.
Theoritical/statistical data can be obtained from public databases such as Swissprot.

#### Computing a PSSM
There are multiple methods of calculating a PSSM. Methods mainly differ on specifications, like taking in consideration the evolution process, which data (observed/theoritical) is given more importance and so forth.
 
In our case, we'll be using the following formula:
$$
m_{u,a}= log \frac{q_{u,a}}{p(a)}
$$
Where:

- $q_{u,a}$ represents the probability of occurence of the amino acid $a$ in the position $u$ of the alignment.

 - $p(a)$ represents the proability of appearance of the amino acid $a$ in nature. It is computed by Swissprot based using statistical anylisis, evaluating their genetic database composition. The probability is dependent on various evolutive factors. For instance, amino acids naturally produced by homo sapiens tend to be more prevalent than other acids which aren't.
 
 - $m_{u,a}$ is the score representing the likelyhood of finding the amino acid $a$ at the position $u$ in that alignment.

The score is a *log-odds* ratio, which calculated like follows:
$$
log \frac{p(event\_A\_in\_specific\_conditions)}{p(event\_A\_in\_random)}
$$

 
### Overcoming lack of data with pseudocounts
The log-odds formula for computing the PSSM scores $m_{u,a}$ have the characteristic that if an amino acid $a$ is not observed in a column $u$ the result tends towards $-\infty$.
In consequence, the given to-be-aligned sequence won't be able to align on that position $u$.
However, the absence of a specific amino acid in a specific column is most probably due to a lack of data rather than an actual alignment preference. This scheme appears too restrictive.
 
The root of the problem resides in $q_{u,a}$ which can eventually be valued to $0$ on the event of lack of data.
A technique used to overcome this issue is adding *pseudocounts*.
Pseudocounts represent our knowledge of the system statistically (what we observe), before introducing any new residue. They can be used to *infer*  data in particular positions, thus simulating a complete and rich system with no lack of data.<br>

This knowledge can be incorporated in the formula of $q_{u,a}$ as follows:

$$
q_{u,a} = \frac{\alpha \cdot f(u,a) + \beta \cdot p(a)}{\alpha + \beta}
$$


$\alpha$ and $\beta$ are simple scaling parameters.<br>
$\alpha$ scales the importance accorded to the frequency of the amino acid $a$ in the column $u$.<br>
$\beta$ scales the importance accorded to the probability of occurence of the amino acid $a$ in nature.<br>
More abstractly, these parameters weigh the importance we accord to theoritical and observed data in our system.<br>
A division by $ \alpha+\beta$ is required to keep the proportionality holding. In fact, the sum of the $q_{u,a}$ for all amino acid a *must* be contained between $0$ and $1$ according to the first and second Kolmogorov axioms of probability.<br>
We note that in that manner, having $p(a) \in ]0,1]$, $q_{u,a}>0$ for all possibilities, thus eliminating the event of having $m_{u,a} = log 0 = -\infty$.

In our project, we set these factors to be:
$$
\alpha = N_{seq}-1 \text{ and } \beta=\sqrt{N_{seq}}
$$


When there is a lot of "real data", there is little need to weigh up the pseudocount's importance and $\beta$ should be low relatively to $N_{seq}$, whereas when we're dealing with a few quantity of data relying on pseudocounts is required, and $\beta$ should be high relatively to $N_{seq}$. A common technique is setting $\beta=\sqrt{N_{seq}}$ as a balanced weight for pseudocounts.
This might result of $\beta$ being too low for a small $N_{seq}$, however for a high $N_{seq}$, this formula approaches, as desired, $q_{u,a} \approx f_{u,a}$.<br>
As for $\alpha$, setting the paramter to be dependent on $N_{seq}$ seems reasonable: a good parameter would be one that dynamically adapts to the quantity of data available for observation. We note that setting $\alpha=N_{seq}$ would be a bad choice because $\alpha.f_{u,a} = N_{seq}\cdot\frac{n_{u,a}}{N_{seq}}=n_{u,a}$ yielding a formula where the scaling factor for observed data is suppressed.
$$
q_{u,a} = \frac{n_{u,a} + \beta \cdot p(a)}{N_{seq} + \beta}
$$

Whereas, with $\alpha=N_{seq}-1$, $\alpha \cdot f_{u,a}=(N_{seq}-1)f_{u,a}=n_{u,a}-f_{u,a}$.Understandably, this formula takes into consideration the frequency at which the selected amino acid appears.

We define a PSSM structure as follows:

In [None]:
class PSSM:
    # probability of occurence of aminoacids in nature (in %)
    __p = {
        'A': 8.28,  'C': 1.36, 'D': 5.45, 'E': 6.76, 'F': 3.86,
        'G': 7.09, 'H': 2.27, 'I': 5.99, 'K': 5.85, 'L': 9.67, 
        'M': 2.43,'N': 4.05, 'P': 4.68, 'Q': 3.94, 'R': 5.53,
        'S': 6.50,'T': 5.32, 'V': 6.87, 'W': 1.07, 'Y': 2.91,
    }

    def __init__(self, msa, alpha, beta, E):
        self.alpha = alpha
        self.beta = beta
        self.msa = msa
        # PSSM matrix, initially entirely filled with None (no computation yet)
        self.mat = {AA: [None for _ in range(self.msa.len)] for AA in self.aminoacids}
        self.compute()

    @classmethod
    def p(cls, a):
        """ Returns the probability of occurence of aminoacids in nature """
        a = a.upper()
        return cls.__p[a]/100

    @property
    def aminoacids(self):
        """returns a list of possible amino acids"""
        return self.__p.keys()

    def q(self, u, a):
        """ Returns the probability of occurence of an aminoacid a in the position u"""
        return (self.alpha * self.msa.f(u, a) + self.beta * self.p(a)) / (self.alpha + self.beta)

    def m(self, u, a):
        """
        Returns the score of occurence of an aminoacid a in the position u in that alignment.
        """
        if self.p(a) != 0:
            return round(log10(self.q(u, a) / self.p(a)), 3)
        else:
            raise Exception("logic error")

    def compute(self):
        """ Compute the PSSM"""
        for a in self.aminoacids:
            self.mat[a] = [self.m(u, a) for u in range(self.msa.len)]

    def __str__(self):
        res = "[*] PSSM: alpha=%s; beta=%s;\n" % (alpha, beta)
        for a in self.mat.keys():
            res += str(a) + "|"
            res += ' '.join(map(str, self.mat[a])) + '\n'
        return res

    def __getitem__(self, item):
        return self.mat[str(item)]


## Aligning to a profile
The methods for generating an MSA and PSSM being now established, we can focus on aligning a new sequence to a profile.
Aligning a sequence to a profile is a way of adding a new sequence to a known-related group of sequences.

The alignment is based on the previously implement **Smith-Waterman** algorithm, a technique for local sequence alignment. 

A score matrix, $S$, is used to evaluate alignment scores. The size of S is $len(seq\_to\_align) * len(aligned\_msa\_seq)$.<br>

<img src="img/profile_allure.png">


For the same reasons as explained in the previous report, the first column and row of the matrix are initialized to $0$ and represent a gap happening.

$S$ is computed using dynamic programming techniques and according to the following formula:

$$
\begin{equation}
    S(i, j)= max
    \begin{cases}
        S(i-1,j-1) + PSSM(seq[i-1],j-1) \\
        S(i-1, j) + PSSM('-',j) \\
        S(i,j-1) + PSSM('-',j-1)\\
        0\\
    \end{cases}
\end{equation}
$$

Where:

- $S(i,j)$ represents the score of aligning the $i^{th}$ amino acid of the given sequence with the $j^{th}$ column of the PSSM.
- $(i,j)$ is the position in the score matrix.
- PSSM(a, u) is the PSSM score of having an amino acid $a$ at the column $u$.
- Pssm('-',u) represents having a gap at the position $u$ of the alignment.


In this formula, we evaluate various possibilities for each step in the alignment. We select the possibility that yields the best score.

- $S(i-1,j-1)+PSSM(seq[i-1],j-1)$ represents the score of keeping the current amino acid in the sequence at the current position, taking in consideration the PSSM score of doing so.<br>
The position for accessing the PSSM matrix is decremented by 1 on both axes, because of the additional first (gap) row/column into the matrix.

- $S(i-1, j) + PSSM('-',j)$ represents the score of keeping the current amino acid in the sequence at the current position, taking in consideration that a gap happened in the MSA at the position $j$, therefore penalizing the score by the according penalty.
- $S(i,j-1) + PSSM('-',j-1)$ represents the score of inserting a gap in the sequence, taking into consideration the penalty of a gap at this position.
- 0 prevents having negative scores. The reason is that negative scores produce alignments where the high similarity regions are sparsely spread across the alignment. In this case, we prefer to find regions of the protein that are similar in a "condensed" way. Therefore, negative values are set to 0 and not considered.


In our case gaps are penalised linearily according to the previously defined formula:

$$
g(n_{gap}) = - E \cdot n_{gap}
$$
Where $E=$penalty of gap and $n_{gap}=$number of gaps.<br>
$$

Therefore, the score matrix formula can be simplified to:
$$
\begin{equation}
    S(i, j)= max
    \begin{cases}
        S(i-1,j-1) - E \\
        S(i-1, j) - E \\
        S(i,j-1) - E\\
        0\\
    \end{cases}
\end{equation}
$$


Similarly to Smith-Waterman, the algorithm first computes the score matrix, while logging the path taken in a `PathMatrixMSA` (a modified version of the previous `PathMatrix`).
Then, having computed the score matrix, the algorithm proceeds to generate a solution with the function `next_sequence`.
Solutions are produced by retracing the path with the `tracepath` function while marking cells into the `blacklist`.


The structures defining `PathMatrixMSA` and a solution are defined as follows:

In [None]:
class PathMatrixMSA(Matrix):
    """ Structure to log the path taken while calculating the score matrix """
    DIAG, LEFT, TOP = 0, 1, 2

    def __init__(self, seq, pssm):
        self.pssm = pssm
        self.seq = seq
        self.init_val = [False, False, False]
        super().__init__(ncols=self.pssm.msa.len+1, nrows=len(seq)+1, init_val=self.init_val)

    def isdiag(self, i, j):
        return self.mat[i][j][self.DIAG]

    def isleft(self, i, j):
        return self.mat[i][j][self.LEFT]

    def istop(self, i, j):
        return self.mat[i][j][self.TOP]

    def isnone(self, i, j):
        return not any(self.mat[i][j])      # [False, False, False]

    def setdiag(self, i, j):
        self.mat[i][j][self.DIAG] = True

    def setleft(self, i, j):
        self.mat[i][j][self.LEFT] = True

    def settop(self, i, j):
        self.mat[i][j][self.TOP] = True

    def erase(self, i, j):
        """ reset cell to default [false, false, false] """
        self.mat[i][j] = deepcopy([False, False, False])

    def isdivergent(self, i, j):
        """ checks if multiple origins possible to the cell"""
        return self.mat[i][j].count(True) > 1

    def fillrow(self, i, diag, left, top):
        for j in range(self.ncols):
            # diag, left and top are booleans
            x = [None, None, None]
            x[self.DIAG] = diag
            x[self.TOP] = top
            x[self.LEFT] = left
            self.mat[i][j] = deepcopy(x)

    def fillcol(self, j, diag, left, top):
        for i in range(self.nrows):
            # diag, left and top are booleans
            x = [None, None, None]
            x[self.DIAG] = diag
            x[self.TOP] = top
            x[self.LEFT] = left
            self.mat[i][j] = deepcopy(x)
            

class SolutionBasic:
    def __init__(self, s2, startpos, endpos):
        self.s2 = s2
        self.startpos = startpos
        self.endpos = endpos

    def __str__(self):
        res = self.s2 + '\n'
        res += "Position : %d - %d\n" % (self.startpos, self.endpos)
        res += 'Length brute : %d\n' % len(self.s2)
        res += 'Length net : %d\n' % len([x for x in self.s2 if x!='-'])
        return res

The algorithm for aligning sequences to a profile is defined below as `PSSMAligner`.
The structure takes in parameter a PSSM, a sequence to be aligned to the PSSM, the penalty $E$.
An additional parameter $l$ designates how many solutions need to be computed.


In [None]:
class PSSMAligner:
    def __init__(self, seq, pssm, E, l):
        self.seq = seq
        self.pssm = pssm
        self.E = E
        self.l = l

        self.S_mat = Matrix(nrows=len(self.seq)+1, ncols=pssm.msa.len +1)
        # S matrix initialization
        self.S_mat.fillrow(i=0, val=0)
        self.S_mat.fillcol(j=0, val=0)

        self.path_mat = PathMatrixMSA(seq=self.seq, pssm=self.pssm)
        self.blacklist = Matrix(nrows=len(self.seq)+1, ncols=pssm.msa.len +1, init_val=False)
        self.solutions = []  # Holds structures of type SolutionBasic

    
    def display(self):
        """ display solutions """
        hr = '-' * 20   # horizontal line
        print(hr, '[*] Solution(s)', hr, sep='\n')
        print('- seq: %s' % self.seq, '- E=%.2f' % self.E, sep='\n')
        print('- Seq len : %s' % len(self.seq))
        print('- PSSM dim : %s x %s' % (len(self.pssm.mat.keys()), self.pssm.msa.len))
        print('- S dim : %s x %s' % (self.S_mat.nrows, self.S_mat.ncols))
        print('- # solutions : ', len(self.solutions),'\n')
        print('\n',self.pssm.msa,'\n')
        for sol in self.solutions:
            print(sol, end='\n\n')
        print(hr)

    def S(self, i, j):
        """ Computes S(i,j) according to the formula"""
        if self.S_mat[i][j] is None:   # dynamic programing
            aa = self.seq[i-1]
            diag_score = self.S(i - 1, j - 1) + self.pssm[aa][j-1]
            top_score = self.S(i-1, j) + self.E
            left_score = self.S(i, j-1) + self.E
            self.S_mat[i][j] = max(0, diag_score, top_score, left_score)

            # log path, diagonal is favoured
            if self.S_mat[i][j] == diag_score:
                self.path_mat.setdiag(i, j)
            elif self.S_mat[i][j] == top_score:
                self.path_mat.settop(i, j)
            elif self.S_mat[i][j] == left_score:
                self.path_mat.setleft(i, j)
        return self.S_mat[i][j]
    
    def setzero(self, i, j):
        """ mark cell as already treated, and set values to 0"""
        self.blacklist[i][j] = True
        self.S_mat[i][j] = 0
        self.path_mat.erase(i, j)

    def tracepath(self, i, j):
        """ Given initial position (i,j) find sequence alignement and the path taken
        according to the SmithWaterman algorithm"""
        s2 = ''
        path = []

        while not self.S_mat[i][j] == 0:
            path.append((i,j))
            if self.path_mat.isdiag(i, j):
                s2 += str(self.seq[i - 1])
                i = i-1
                j = j-1

            elif self.path_mat.isleft(i, j):
                s2 += '-'
                i = i
                j = j-1

            elif self.path_mat.istop(i, j):
                s2 += str(self.seq[i - 1])
                i = i-1
                j = j
            else:
                raise Exception('Logic error')
        return s2, path

    def next_solution(self):
        """
        Generates aligned sequences
        """
        # get max and calculate path
        value, maxi, maxj = self.S_mat.getmax()
        s, path = self.tracepath(i=maxi, j=maxj)
        s = s[::-1]

        # mark and set to 0
        for i, j in path:
            self.setzero(i, j)

        # erase submatrix starting from last position before reaching the 0 to absolute bottom right position
        init_i, init_j = path[-1]
        for i in range(init_i, self.S_mat.nrows):
            for j in range(init_j, self.S_mat.ncols):
                if not self.blacklist[i][j]:
                    self.S_mat[i][j] = None

        # recalculate matrix
        for i in range(init_i, self.S_mat.nrows):
            for j in range(init_j, self.S_mat.ncols):
                if not self.blacklist[i][j]:
                    self.S_mat[i][j] = self.S(i, j)

        # start_i and end_i
        return SolutionBasic(s2=s, startpos=path[-1][0], endpos=maxi)

    def run(self):
        # fill the score matrix
        assert self.S_mat.nrows == len(self.seq)+1 and self.S_mat.ncols == self.pssm.msa.len+1
        for row in range(1, self.S_mat.nrows):
            for col in range(1, self.S_mat.ncols):
                self.S_mat[row][col] = self.S(row, col)

        # compute solutions
        for _ in range(self.l):
            self.solutions.append(self.next_solution())

## Reducing the dataset
We get our dataset of sequences from *SMART*, a public genetic database. For continuity, our study will focus on WW protein sequences that belong to homo sapiens (humans). The data is in FASTA format and is stored in `to-be-aligned.fasta`.

The dataset contains 320 different (but similar) WW sequences.
A large number of these sequences are highly similar, thus, we suppose that reducing the dataset can help us approximate the end results, in a more optimized and confined way.
 
In an attempt to filter out our dataset, we consider sequences with a similarity greater than 60% to be duplicated.
Therefore, we want to generate a reduced dataset where all pair of sequences have a similarity less than $THRESHOLD=60\%$.
 
We use a heuristic-based algorithm, it works by linearly grouping sequences in a way that all pairs of sequences of a group have a similarity rate less than 60%
Then, the group holding the more sequences will be selected as the new reduced dataset.
We choose the largest group so our results would be based on more data, statically making them more accurate.
 
More specifically, in our case, we compare 2 sequences by aligning them with the Needleman-Wunsch algorithm developed in the previous part.
The evolutionary scores are computed with an insertion gap of $I=12$ and an extension gap of $E=2$.
The **BLOSUM62** substitution matrix was used for these alignments. Although other substitution matrices could have been used, we observed after some testing that all matrices produce (more or less) good enough results. These tests and results are deemed out of scope for this report.


The resulting dataset is stored in `to-be-aligned-reduced.fasta`, it contains 15 sequences.
 
For optimization purposes, we used threading techniques (*thread pool*), to compute the comparison with all sequences of a group in parallel.
A thread pool is a fixed number of threads complementarily working on a shared queue of tasks.
 
The exact script used for reducing the dataset, *reduce_dataset.py*, yields the file 'to-be-aligned-reduced.fasta' and is joint to the notebook.
The script contains some code overhead required to function (ADTs, parsers, Needleman-Wunsch implementation from project1). The less-useful code is written inside the tags "#### OVERHEAD - start/end".

# Results and discussions
We can start by defining some constants, mainly file names and the `PENALTY` value.
Having gathered the sequences from *MUSCLE*, we parse the generated FASTA files in order to construct our structures, notably the MSAs and PSSMs. The alignment parameters $\alpha$ and $\beta$ are computed according to the number of sequences ($N_seq$) in the MSA.

The gap penalty is set to $-0.20$. The value is set heuristically based on trials. Stricter penalties are not satisfactory because they (almost) don't produce gaps in the alignment.

In [None]:
# MSA files
MSA_FILE = 'rsrc/msaresults-muscle.fasta'
MSA_REDUCED_FILE = 'rsrc/msaresults-reduced-muscle.fasta'
# protein sequences to be aligned
PROT_FILE = 'rsrc/protein-sequences.fasta'
# settings
PENALTY = -0.20

# parsed fastas
seq_list = parse_fasta_file(PROT_FILE)
msa_seq_list = parse_fasta_file(MSA_FILE)
msa_reduced_seq_list = parse_fasta_file(MSA_REDUCED_FILE)

# msa and pssm
msa = MSA(msa_seq_list)
alpha = msa.N_seq-1
beta = msa.N_seq**0.5
pssm = PSSM(msa=msa, alpha=alpha, beta=beta, E=PENALTY)

# msa_reduced and pssm_reduced
msa_reduced = MSA(msa_reduced_seq_list)
alpha_reduced = msa_reduced.N_seq-1
beta_reduced = msa_reduced.N_seq**0.5 
pssm_reduced = PSSM(msa=msa_reduced, alpha=alpha_reduced, beta=beta_reduced, E=PENALTY)

## PSSM evaluation
The scores of a PSSM are effective for obtaining and representing alignments, nonetheless, they do not conveniently *display* the conservation or preference of amino acids at particular locations. This information is significant because it shows the essential *functional* locations of the family. **Logos** are a graphical representation effective enough to identify these fundamental/primary residues.

A logo is computed by calculating the quantity of information of each column $u$ in the sequence.
$$
I_u = log_2 20 - H_u \\
$$

where $H_u$ is the incertitude, for which the formula is:
$$
H_u = - \sum f_{u,a}log_2(f_{u,a})
$$

The contribution of each residue is  $ f_{u,a} \cdot I_u$.
The in-depth explanation of these formulae is deemed out of scope for our project.
For more details see C.Shannon's pioneering work of information theory: <a href="http://www.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf">A Mathematical Theory of Communication</a>

For convenience, we compute the logos using <a href="http://weblogo.threeplusone.com">Weblogo</a>, an online tool.
 <figure>
 <figcaption>Weblogo generated based on the PSSM of the MSA of the 320 sequences</figcaption>
  <img src="img/logo_muscle.png">
  
</figure> 
<br><br>
 <figure>
   <figcaption>Weblogo generated based on the PSSM of the MSA of the reduced dataset (15 sequences)</figcaption>
  <img src="img/logo_muscle_reduced.png">
</figure> 
<br>

The "accepted" logo of the WW domains available on <a href="https://pfam.xfam.org"></a>.
 <figure>
  <img src="img/logo_HMM.png">
</figure> 

As we can observe, all 3 figures are highly similar. There appear to be 2 sites where the $W$ amino acid is substantially dominant and conservated, at position 6 and 72.
Therefore we can deduce that this characteristic is highly essential in the function of the protein, also giving the name to the family, **WW** domains.<br>
Other areas are also quite conservated, like the amino acid P at position 76.
Unlike the HMM-logo, we can notice a gap of information from position 20-22 to 65 in both of our logos, the reasons are the gaps at these positions in our alignments.

Besides, we can deduce that the PSSM we computed based on the MSAs are valid. <br>
As a consequence, this also verifies that our reduction of the dataset indeed produces approximated results correctly, as expected.

Additionally, we note that the PSSM of the reduced dataset presents less "pronounced" information due to a more significant incertitude in information as a consequence of having fewer quantity of data.

# Alignment evaluation

In order to test our algorithm in various situations, we will first evaluate the alignment of sequences in the `protein-sequences.fasta` with both MSAs (full and reduced).
Moreover, we will be discussing the results and try to make sense of it by proposing hypotheses 

## Alignment based on the complete dataset

In the following, we align our sequences from `protein-sequences.fasta` with the PSSM of the complete dataset (320 sequences).




In [None]:
for seq in seq_list:
    pa = PSSMAligner(seq=seq, pssm=pssm, E=PENALTY, l=4)
    pa.run()
    pa.display()

## Alignment based on the reduced dataset
In the following, we align our sequences from `protein-sequences.fasta` with the PSSM of the reduced dataset (320 sequences).


In [None]:
for seq in seq_list:
    pa = PSSMAligner(seq=seq, pssm=pssm_reduced, E=PENALTY, l=4)
    pa.run()
    pa.display()

## Comparison of solutions

The comparison of both alignments is presented as follows.<br>
As we can observe, both alignments do not produce exactly the same results.
Although at least one solution is common between the 2 parties, the alignment produced with the reduced dataset tend to diverge after some solutions
The alignments produced with the reduced dataset tend to diverge from the ones produced with the complete dataset, after the first or second solution.<br>
However, we are still able to identify similar chunks from both ends.<br>
For instance, for the sequence 0, we can observe similarities between the third and the fourth solution of the complete and reduced dataset respectively.


LHAQPIIS<span style="background-color:rgb(51, 153, 254)">IRVWGV-GR-D--SGRERDFAYVARD--KLT</span> <br>and<br>
<span style="background-color:rgb(51, 153, 254)">I-R-VW--GV-GRDSGRERDFAYVARDKLT</span>QMLK--C-H

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-l6li{font-size:10px;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-l6li">protein ID</th>
    <th class="tg-l6li">alignmets<br></th>
    <th class="tg-l6li">alignments reduced<br></th>
  </tr>
  <tr>
    <td class="tg-l6li">O00213</td>
    <td class="tg-l6li">DLPAGW-MRVQD--TSG-TYYW-HIPT-------------------------GTTQWEPPGR<br><br>SL--GWVEM--TEEELAP-GRSSV--AVN-N<br><br>LHAQPIISIRVWGV-GR-D--SGRERDFAYVARD--KLT<br><br><br>PLPQEE-EKLPPR-NTNP--GIKC-FAV<br></td>
    <td class="tg-l6li">DLPAGW-MRVQD--TSG-TYYW-HIPT-------------------GTTQWEPPGR<br><br>-TT----QW<br><br>WATLSQG-SPSYGSPEDTDSF-WNPNAFET-DS--DLPA-GWMR--V-----QDTSGTYYW-HIPT<br><br>I-R-VW--GV-GRDSGRERDFAYVARDKLTQMLK--C-H<br></td>
  </tr>
  <tr>
    <td class="tg-l6li">P46937</td>
    <td class="tg-l6li">PLPDGWEQAMTQ-DGEIYYINHKN--------------------------KTTSWLDPRL<br><br>PLPAGWEMAKT-SSGQR-YFLNHID-------------------------QTTTWQDPR<br><br>PVSSPG-MSQELRTM-TTNSSDPFLNSG-TY---HSRDEST<br><br>PRTPDDFLN-SVDEM--D--TGDT--IN-QST<br></td>
    <td class="tg-l6li">PLPDGWEQAM-T-QDGEIYYINH-------K-N-----------KTTSWLDPR<br><br>PLPAGWE-M--AKTSSGQR-YF-LNHIDQT---------------------TTWQDPR<br><br>QNPVSSPG-MSQELRTMTTNSSDPFLNSG-T-YHSRD-EST<br><br>DFLNS--VDEM-D--TG-DT--IN-QST<br></td>
  </tr>
  <tr>
    <td class="tg-l6li">O75554</td>
    <td class="tg-l6li">DL-ISGASQWEKPEGFQGDLKKTAVKTVWVEGLSEDG-FTYYYNTET-------------------------GESRWEKPDD<br><br>D-PSKGRWVEGIT-SEGYHYYYD-LIS-------------------------G-ASQWEKP<br><br>--YHYY<br><br>WVEGLSEDGFTYYYNTE-TGESRWEKP-DDFI-PH-T<br></td>
    <td class="tg-l6li">TVWVEGLS-EDG-FTYYYN--T------E--T---------GE-SRWEKPDD<br><br>D-PSKGRWVEGIT-SEG-YHYYYDL-IS-------------------G-ASQWEKP<br><br>NPYGEWQEIKQEVESHEEVDLELPSTENE--YV-STS<br><br>-FTYY<br></td>
  </tr>
</table>

Therefore, we can deduce that reducing the dataset produce approximated solutions to the original dataset.. 
It is reasonable to suppose that the complete dataset produces more accurate solutions. In fact, the full dataset contains more entries which gives a more authentic significance to $f_{u,a}$ while calculating the PSSM. Additionally, since the reduction algorithm operates heuristically, its correctness is not proven for all cases, and the results depend in the order in which the sequences have been read.<br>
On the other hand, the alignments based on the complete dataset take in consideration **all** possible WW protein sequences, therefore all relevant information is guaranteed to be considered during the computation.


## Comparison using Uniprot
For integrity, we compare our alignments to the online database <a href="https://www.uniprot.org">Uniprot</a>.
Uniprot offers varying detailed genetical and biological information about proteins.
Consulting the *family & domains* sections, we can learn what protein domains are present in the seqence and at which postion the domain resides. This information is quite useful to validate the exactitude of our alignments.

Below is a table comparing our alignments'postions with uniprot.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-lboi{border-color:inherit;text-align:left;vertical-align:middle}
.tg .tg-0lax{text-align:left;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-lboi">protein ID</th>
    <th class="tg-lboi">Position of WW (Uniprot)</th>
    <th class="tg-lboi">Position of WW full dataset</th>
    <th class="tg-0lax">Position of WW reduced dataset</th>
  </tr>
  <tr>
    <td class="tg-lboi">O00213</td>
    <td class="tg-lboi">253 – 285</td>
    <td class="tg-lboi">254-285</td>
    <td class="tg-0lax">254-285</td>
  </tr>
  <tr>
    <td class="tg-lboi">P46937</td>
    <td class="tg-lboi">171 – 204&nbsp;&nbsp;&nbsp;(WW1)<br>230 – 263 (WW2)<br></td>
    <td class="tg-lboi">172-203<br>231-263</td>
    <td class="tg-0lax">172-203<br>231-264</td>
  </tr>
  <tr>
    <td class="tg-0pky">O75554</td>
    <td class="tg-0pky">122 – 155&nbsp;&nbsp;(WW1)<br>163 – 196&nbsp;&nbsp;(WW2)<br></td>
    <td class="tg-0pky">122-142<br>169-183<br></td>
    <td class="tg-0lax">122-141<br>167-182<br></td>
  </tr>
</table>

As we can observe, in the protein O00213, the WW domain resides at the position 253-285 which is equal (more or less) to the positions produced by our results, 254-285 for both the full and reduced dataset.<br>
In the protein P46937, there exists 2 WW domains at position 171-204 and 230-263 respectively, which is equal in  both datasets that produce the position 172-203 for WW1 and 231-263/264 for WW2.<br>
In the protein O75554, there exists 2 WW domains at position 122-155 and 163-196 respectively, which is equal in  both datasets that produce the position 122-141/142 for WW1 and 167/169-182/183 for WW2.<br>
We note that for P46937 and O75554, since there are 2 WW domains it is required to have $l\ge2$. The solutions produced in this instance (and that are matching the table above) are the first 2 solutions.

We can deduce that both of our alignments produce conviently accurate results compared to Uniport and therefore, are correct. 

# Conclusion

In conclusion, we examine in this report protein sequence alignment against an MSA using a PSSM. Key principles are explained and illustrated throughout the report.

The algorithms are implemented and tested in an attempt to make sense of different concepts. The results are analyzed, compared to external tools for verification, and discussed accordingly in an attempt to produce hypotheses about the outcome.

# Reference
- http://smart.embl.de
- https://www.uniprot.org/
- https://www.ebi.ac.uk/Tools/msa/muscle/
- https://www.comp.nus.edu.sg/~ksung/algo_in_bioinfo/slides/Ch6_MSA.pdf
- http://www.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf
- http://weblogo.threeplusone.com
- Introduction to Bioinformatics, M.Zvelebil and J.OBaum
- https://uv.ulb.ac.be/pluginfile.php/1489420/mod_resource/content/1/L4%20Alignement%20de%20plusieurs%20sequences.pdf
- http://pfam.xfam.org
- Profile analysis: Detection of distantly related proteins , M.Gribskov, A.D.McLachlan, D.Eisenberg
