# Malign

`malign` is a set of experimental methods for sequence alignment that allow:
    
- a unique alphabet for each sequence
- computation of k-best alignments
- usage of assymetric information
- perform single-pass true multiple alignment (instead of combination of pairwise)
- while usable with any type of sequence, including biological, it is intended for linguistic usage mostly

This notebook explores ideas on how to use it.

In [1]:
# Import the library
import malign
from malign import print_alms, print_malms

The most stupid alignment method implemented in `malign` is the "dumb" one. It returns a single alignment for a number of sequences, padding spaces to the left and right without even considering the values of the sequences. It is used for bootstrapping a number of other alignment methods.

In [2]:
# Perform pairwise dumb alignment
seq_a = 'tra'
seq_b = 'fatata'
print_alms(malign.pw_align(seq_a, seq_b, method="dumb"))

# Perform multiwise dumb alignment
seqs = ['tra', 'fra', 'batata', 'virp', 'x']
print_malms(malign.multi_align(seqs, method="dumb"))

A 0 (0.50 / 0.50): - t r a - -
B 0 (0.50 / 0.50): f a t a t a
0 A {'seq': ['-', 't', 'r', 'a', '-', '-'], 'score': 0.5}
0 B {'seq': ['-', 'f', 'r', 'a', '-', '-'], 'score': 0.5}
0 C {'seq': ['b', 'a', 't', 'a', 't', 'a'], 'score': 1.0}
0 D {'seq': ['-', 'v', 'i', 'r', 'p', '-'], 'score': 0.6666666666666667}
0 E {'seq': ['-', '-', 'x', '-', '-', '-'], 'score': 0.16666666666666663}


New advanced methods begin with our modified Needleman-Wunsch. The methods work on multidimensional transition matrices that include all symbols and gaps. All correspondences must be provided, even when dealing with the same alphabet.

For illustration, let's start with a common DNA base matrix for pairwise sequence alignment:

  |   | A  |  G |  C |  T | -  |
  |---|----|----|----|----|----|
  | A | 10 | -1 | -3 | -4 | -5 |
  | G | -1 | 7  | -5 | -3 | -5 |
  | C | -3 | -5 | 9  |  0 | -5 |
  | T | -4 | -3 |  0 |  8 |-5  |
  | - | -5 | -5 | -5 | -5 | 0  |

Note that in this case the matrix is symmetric, but this is not necessary as we'll see later. This matrix is available as `malign.utils.DNA_MATRIX`.

Here we start by performing some pairwise alignments on very simple genetic sequences. Note that the parameter `k` allows to return at most k alignments, in terms of best scores, if those are computed. This method is different from normal WN algorithm because it first computes alignments in terms of `seq_a` to `seq_b`, then in terms of `seq_b` to `seq_a`, and takes the mean score as representatitve of the alignment. The results are equal in this case, as the matrix is symmetric.

In [3]:
# Perform pairwise, modifier NW alignment
print("===== Alignment 1")
seq_a = "GATTACA"
seq_b = "A"
print_alms(malign.pw_align(seq_a, seq_b, k=2, method="nw", matrix=malign.utils.DNA_MATRIX))

print("===== Alignment 2")
seq_a = "GATTACA"
seq_b = "ATTT"
print_alms(malign.pw_align(seq_a, seq_b, k=2, method="nw", matrix=malign.utils.DNA_MATRIX))

===== Alignment 1
A 0 (0.00 / -4.00): G A T T A C A
B 0 (-8.00 / -4.00): - - - - - - A
A 1 (0.00 / -5.00): G A T T A C A
B 1 (-10.00 / -5.00): - - - - A - -
===== Alignment 2
A 0 (0.00 / -4.50): G A T T A C A
B 0 (-9.00 / -4.50): - A T T - T -


But what if the matrix is not symmetric? This is not usually the case for genetic data, but let's for sake of experience simulate a condition where sequence B never has a T aligned with a C (from the "point of view" of A, the normal affinities hold). We simulate this by changing the matrix for (C, T), which is now different (not symmmetric) from (T, C), and check the results.

In [6]:
# Perform pairwise, assymetric NW
matrix = malign.utils.DNA_MATRIX.copy()
matrix['C', 'T'] = -99
print_alms(malign.pw_align(seq_a, seq_b, k=4, method="nw", matrix=matrix))

A 0 (-3.00 / -5.50): G A - T T A C A
B 0 (-8.00 / -5.50): - A T T T - - -
A 1 (-3.00 / -5.50): G A T - T A C A
B 1 (-8.00 / -5.50): - A T T T - - -
A 2 (-3.00 / -5.50): G A T T - A C A
B 2 (-8.00 / -5.50): - A T T T - - -
A 3 (-3.00 / -5.50): G A T T A C A -
B 3 (-8.00 / -5.50): - A T T - - - T


You can see that the algorithm now does its best not to align a C in sequence a to a T in sequence B. This can go to extremes, as in the first example below, even though the inverse alignment still work as it commonly does.

In [8]:
print("===== Alignment 1")
print_alms(malign.pw_align("GATTACA", "GATTATA", k=2, method="nw", matrix=matrix))
print("===== Alignment 2")
print_alms(malign.pw_align("GATTATA", "GATTACA", k=2, method="nw", matrix=matrix))

===== Alignment 1
A 0 (-3.00 / -3.00): G A T T A - C A
B 0 (-3.00 / -3.00): G A T T A T - A
A 1 (-3.00 / -3.00): G A T T A C - A
B 1 (-3.00 / -3.00): G A T T A - T A
===== Alignment 2
A 0 (0.00 / 0.00): G A T T A T A
B 0 (0.00 / 0.00): G A T T A C A
