# Malign

`malign` is a set of experimental methods for sequence alignment that allow:
    
- a unique alphabet for each sequence
- computation of k-best alignments
- usage of assymetric information
- perform single-pass true multiple alignment (instead of combination of pairwise)
- while usable with any type of sequence, including biological, it is intended for linguistic usage mostly

This notebook explores ideas on how to use it.

In [1]:
# Import the library
import malign
from malign import print_alms, print_malms

The most stupid alignment method implemented in `malign` is the "dumb" one. It returns a single alignment for a number of sequences, padding spaces to the left and right without even considering the values of the sequences. It is used for bootstrapping a number of other alignment methods.

In [2]:
# Perform pairwise dumb alignment
seq_a = 'tra'
seq_b = 'fatata'
print_alms(malign.pw_align(seq_a, seq_b, method="dumb"))

# Perform multiwise dumb alignment
seqs = ['tra', 'fra', 'batata', 'virp', 'x']
print_malms(malign.multi_align(seqs, method="dumb"))

A 0 (0.50 / 0.50): - t r a - -
B 0 (0.50 / 0.50): f a t a t a
0 A (0.57) : ['-', 't', 'r', 'a', '-', '-']
0 B (0.57) : ['-', 'f', 'r', 'a', '-', '-']
0 C (0.57) : ['b', 'a', 't', 'a', 't', 'a']
0 D (0.57) : ['-', 'v', 'i', 'r', 'p', '-']
0 E (0.57) : ['-', '-', 'x', '-', '-', '-']


New advanced methods begin with our modified Needleman-Wunsch. The methods work on multidimensional transition matrices that include all symbols and gaps. All correspondences must be provided, even when dealing with the same alphabet.

For illustration, let's start with a common DNA base matrix for pairwise sequence alignment:

  |   | A  |  G |  C |  T | -  |
  |---|----|----|----|----|----|
  | A | 10 | -1 | -3 | -4 | -5 |
  | G | -1 | 7  | -5 | -3 | -5 |
  | C | -3 | -5 | 9  |  0 | -5 |
  | T | -4 | -3 |  0 |  8 |-5  |
  | - | -5 | -5 | -5 | -5 | 0  |

Note that in this case the matrix is symmetric, but this is not necessary as we'll see later. This matrix is available as `malign.utils.DNA_MATRIX`.

Here we start by performing some pairwise alignments on very simple genetic sequences. Note that the parameter `k` allows to return at most k alignments, in terms of best scores, if those are computed. This method is different from normal WN algorithm because it first computes alignments in terms of `seq_a` to `seq_b`, then in terms of `seq_b` to `seq_a`, and takes the mean score as representatitve of the alignment. The results are equal in this case, as the matrix is symmetric.

In [3]:
# Perform pairwise, modifier NW alignment
print("===== Alignment 1")
seq_a = "GATTACA"
seq_b = "A"
print_alms(malign.pw_align(seq_a, seq_b, k=2, method="nw", matrix=malign.utils.DNA_MATRIX))

print("===== Alignment 2")
seq_a = "GATTACA"
seq_b = "ATTT"
print_alms(malign.pw_align(seq_a, seq_b, k=2, method="nw", matrix=malign.utils.DNA_MATRIX))

===== Alignment 1
A 0 (0.00 / -4.00): G A T T A C A
B 0 (-8.00 / -4.00): - - - - - - A
A 1 (0.00 / -5.00): G A T T A C A
B 1 (-10.00 / -5.00): - - - - A - -
===== Alignment 2
A 0 (0.00 / -4.50): G A T T A C A
B 0 (-9.00 / -4.50): - A T T - T -


But what if the matrix is not symmetric? This is not usually the case for genetic data, but let's for sake of experience simulate a condition where sequence B never has a T aligned with a C (from the "point of view" of A, the normal affinities hold). We simulate this by changing the matrix for (C, T), which is now different (not symmmetric) from (T, C), and check the results.

In [4]:
# Perform pairwise, assymetric NW
matrix = malign.utils.DNA_MATRIX.copy()
matrix['C', 'T'] = -99
print_alms(malign.pw_align(seq_a, seq_b, k=4, method="nw", matrix=matrix))

A 0 (-3.00 / -5.50): G A - T T A C A
B 0 (-8.00 / -5.50): - A T T T - - -
A 1 (-3.00 / -5.50): G A T - T A C A
B 1 (-8.00 / -5.50): - A T T T - - -
A 2 (-3.00 / -5.50): G A T T - A C A
B 2 (-8.00 / -5.50): - A T T T - - -
A 3 (-3.00 / -5.50): G A T T A C A -
B 3 (-8.00 / -5.50): - A T T - - - T


You can see that the algorithm now does its best not to align a C in sequence a to a T in sequence B. This can go to extremes, as in the first example below, even though the inverse alignment still work as it commonly does.

In [5]:
print("===== Alignment 1")
print_alms(malign.pw_align("GATTACA", "GATTATA", k=2, method="nw", matrix=matrix))
print("===== Alignment 2")
print_alms(malign.pw_align("GATTATA", "GATTACA", k=2, method="nw", matrix=matrix))

===== Alignment 1
A 0 (-3.00 / -3.00): G A T T A - C A
B 0 (-3.00 / -3.00): G A T T A T - A
A 1 (-3.00 / -3.00): G A T T A C - A
B 1 (-3.00 / -3.00): G A T T A - T A
===== Alignment 2
A 0 (0.00 / 0.00): G A T T A T A
B 0 (0.00 / 0.00): G A T T A C A


The assymetry starts to make more sense when we start dealing with different alphabets, or at least with sequences that indeed from different domains. Take the example below, manually crafted, where we prepare a matrix to align the orthography of Italian and Russian cognates, both written with the standard orthographies. Note how the matrix is assymetric even in terms of gaps: there are some cases where a gap is more or less likely in the correspondence -- sometimes they are even higher than the correspondence to other graphemes.

  |   | а  |  т |  о |  м | Я  | к  | в  | -  |
  |---|----|----|----|----|----|----|----|----|
  | a | 10 | -4 |  3 | -3 | 5  | -4 | -4 | -1 |
  | t | -4 | 10 | -4 | -3 | -4 | -1 | -3 | -3 |
  | o |  4 | -4 | 10 | -5 | 2  | -4 | -2 | 0  |
  | m | -2 | -1 | -4 |  9 |-3  | -2 | 2  | 1  |
  | G | -3 | 1  | -3 | -1 | 4  | 3  | -2 | 2  |
  | i | 5  | -3 | 4  | 0  | 6  | -3 | -3 | 2  |
  | c | -4 | -1 | -5 | -4 | -5 | 4  | 1  | -3 |
  | - | 1  | -2 | -1 | 2  | 2  | -4 | 2  | 0  |

In [6]:
matrix = {("a", "а"):10,
("a", "т"):-4,
("a", "о"):3,
("a", "м"):-3,
("a", "Я"):5,
("a", "к"):-4,
("a", "в"):-4,
("a", "-"):-1,

("t", "а"):-4,
("t", "т"):10,
("t", "о"):-4,
("t", "м"):-3,
("t", "Я"):-4,
("t", "к"):-1,
("t", "в"):-3,
("t", "-"):-3,

("o", "а"):4,
("o", "т"):-4,
("o", "о"):10,
("o", "м"):-5,
("o", "Я"):2,
("o", "к"):-4,
("o", "в"):-2,
("o", "-"):0,

("m", "а"):-2,
("m", "т"):-1,
("m", "о"):-4,
("m", "м"):9,
("m", "Я"):-3,
("m", "к"):-2,
("m", "в"):2,
("m", "-"):1,

("G", "а"):-3,
("G", "т"):1,
("G", "о"):-3,
("G", "м"):-1,
("G", "Я"):4,
("G", "к"):3,
("G", "в"):-2,
("G", "-"):2,

("i", "а"):5,
("i", "т"):-3,
("i", "о"):4,
("i", "м"):0,
("i", "Я"):6,
("i", "к"):-3,
("i", "в"):-3,
("i", "-"):2,

("c", "а"):-4,
("c", "т"):-1,
("c", "о"):-5,
("c", "м"):-4,
("c", "Я"):-5,
("c", "к"):4,
("c", "в"):1,
("c", "-"):-3,

("-", "а"):1,
("-", "т"):-2,
("-", "о"):-1,
("-", "м"):2,
("-", "Я"):2,
("-", "к"):-4,
("-", "в"):2,
("-", "-"):0,}

print("===== Alignment 1")
print_alms(malign.pw_align("atomo", "атом", k=2, method="nw", matrix=matrix))
print("===== Alignment 2")
print_alms(malign.pw_align("Giacomo", "Яков", k=4, method="nw", matrix=matrix))

===== Alignment 1
A 0 (0.00 / -1.50): a t o m o
B 0 (-3.00 / -1.50): а т о м -
===== Alignment 2
A 0 (0.00 / -4.50): G i a c o m o
B 0 (-9.00 / -4.50): - Я - к о в -


We can combine more domains for a multiple NW alignment; unlike most multiple NW methods, we don't build a tree internally because an evolutionary perspective does not necessarily apply in our cases (even in face of a unrooted tree); details are provided in the paper. Here we extend our matrix a bit with Greek cognates. Note that, for showcasing, we are only setting the actual correspondences, guiding the alignment a bit too much, and letting the algorithm fill the missing parts. Note that we add correspondences to two graphemes in Italian, "v", "s" which were not used in the example above.

In [8]:
# Build a very simple Italian <-> Greek matrix

italian_greek = {    
  ("a", "α"):10,
  ("o", "α"):4,
  ("i", "α"):5,
  ("t", "τ"):10,
  ("c", "τ"):2,
  ("a", "ο"):6,
  ("o", "ο"):10,
  ("i", "ο"):3,
  ("m", "μ"):10,
  ("a", "Ι"):2,
  ("i", "Ι"):7,
  ("t", "κ"):2,
  ("c", "κ"):8,
  ("a", "ω"):4,
  ("o", "ω"):10,
  ("m", "β"):3,
  ("v", "β"):5,
  ("s", "ς"):-3,
}
    
# Fill the matrix
ita_alphabet = list(set([key[0] for key in italian_greek]))
grk_alphabet = list(set([key[1] for key in italian_greek]))
italian_greek = malign.utils.fill_matrix(ita_alphabet, grk_alphabet, italian_greek, mismatch=-5, gap=-5)

# Combine the two matrices into a single one, show a couple of examples
full_matrix = malign.utils.combine_matrices(matrix, italian_greek)
for key in [('-', 'к', 'ο'), ('i', 'а', 'Ι'), ('m', 'в', 'β')]:
    print(key, full_matrix[key])

# Do maligns
print("===== Alignment 1")
seqs = ['atomo', "атом", "ατομο"]
print_malms(malign.multi_align(seqs, method="nw", k=4, matrix=full_matrix))

print("===== Alignment 2")
seqs = ["Giacomo", "Яков", "Ιακωβος"]
print_malms(malign.multi_align(seqs, method="nw", k=4, matrix=full_matrix))
    




('-', 'к', 'ο') -4.5
('i', 'а', 'Ι') 6.0
('m', 'в', 'β') 2.5
===== Alignment 1
0 A (7.60) : ('a', 't', 'o', 'm', 'o')
0 B (7.60) : ('а', 'т', 'о', '-', 'м')
0 C (7.60) : ('α', 'τ', 'ο', 'μ', 'ο')
1 A (8.90) : ('a', 't', 'o', 'm', 'o')
1 B (8.90) : ('а', 'т', 'о', 'м', '-')
1 C (8.90) : ('α', 'τ', 'ο', 'μ', 'ο')
===== Alignment 2
0 A (3.67) : ('G', 'i', 'a', 'c', 'o', 'm', 'o', '-')
0 B (3.67) : ('-', '-', 'Я', '-', '-', 'к', 'о', 'в')
0 C (3.67) : ('-', 'Ι', 'α', 'κ', 'ω', 'β', 'ο', 'ς')
1 A (4.30) : ('G', 'i', 'a', 'c', 'o', 'm', 'o', '-')
1 B (4.30) : ('-', '-', 'Я', 'к', '-', '-', 'о', 'в')
1 C (4.30) : ('-', 'Ι', 'α', 'κ', 'ω', 'β', 'ο', 'ς')
2 A (4.05) : ('G', 'i', 'a', 'c', 'o', 'm', 'o', '-')
2 B (4.05) : ('-', '-', 'Я', 'к', 'о', '-', 'в', '-')
2 C (4.05) : ('-', 'Ι', 'α', 'κ', 'ω', 'β', 'ο', 'ς')
3 A (3.92) : ('G', 'i', 'a', 'c', 'o', 'm', 'o', '-')
3 B (3.92) : ('-', 'Я', '-', 'к', 'о', '-', 'в', '-')
3 C (3.92) : ('-', 'Ι', 'α', 'κ', 'ω', 'β', 'ο', 'ς')
