# Aggregation alignment sandbox
### Primary issue 14 MARCH 2020
With the original scoring algorithm, it was "ok" to just aggregate scores from the left side. However, this doesn't make any sense with separated scores for each ion type. For y ions, we want to aggregate from the right. Example

Sequence: `MALWARMSTRV`

For our b ion scores, we want to aggregate kmer scores from the left. So in order to identify this sequence, we want to do something like this

```
MAL
MALWA
MALWARM
...
MALWARMSTRV
```

Aggregating here will capture the rise and fall of the score.

### Solution

However, the only full hit we get for the y ion score is the last one. In order to get better aggregation scores for the y ions, we want the following:

```
        TRV
      MSTRV
        ...
MALWARMSTRV
```

This should give us a better aggregation score

### import scoring tools 

In [2]:
import sys
sys.path.append('/Users/zacharymcgrath/Documents/Layer_Research/Proteomics_Experiments/Database_Experiments/src')

from scoring import comparisons
from analysis import  aggregations

### mock data

In [3]:
sequence = 'MALWARMSTRVK'
ks = [4, 6, 8, 10, 12]
make_mers = lambda k, seq: [seq[i: i+k] for i in range(len(seq) - k + 1)]
kmers = {'k={}'.format(k): make_mers(k, sequence) for k in ks}
print(kmers)

{'k=4': ['MALW', 'ALWA', 'LWAR', 'WARM', 'ARMS', 'RMST', 'MSTR', 'STRV', 'TRVK'], 'k=6': ['MALWAR', 'ALWARM', 'LWARMS', 'WARMST', 'ARMSTR', 'RMSTRV', 'MSTRVK'], 'k=8': ['MALWARMS', 'ALWARMST', 'LWARMSTR', 'WARMSTRV', 'ARMSTRVK'], 'k=10': ['MALWARMSTR', 'ALWARMSTRV', 'LWARMSTRVK'], 'k=12': ['MALWARMSTRVK']}


### aligners for b and y side

In [4]:
def print_b_alignment(kmers, seq):
    print(seq)
    for i, kmer in enumerate(kmers):
        print(' ' * i + kmer)
        
def print_y_alignment(kmers, seq):
    print(seq)
    for i in range(len(kmers)-1, -1, -1):
        print(' ' * i + kmers[i])
        
def print_b_aligned_kmers(kmers, seq):
    print(seq)
    for k, mers in kmers.items():
        print(mers[0])

def print_y_aligned_kmers(kmers, seq):
    print(seq)
    for k, mers in kmers.items():
        print(' ' * (len(seq) - len(mers[-1])) + mers[-1])
        
print_y_aligned_kmers(kmers, sequence)

MALWARMSTRVK
        TRVK
      MSTRVK
    ARMSTRVK
  LWARMSTRVK
MALWARMSTRVK


### Scoring the current way

In [5]:
score_mers = lambda mer, seq, ion: [comparisons.compare_sequence_sequence_ion_type(m, seq, ion) for m in mer]
score_kmers = lambda mers, seq, ion: {k: score_mers(mers[k], seq, ion) for k in mers}
b_scored = score_kmers(kmers, sequence, 'b')
y_scored = score_kmers(kmers, sequence, 'y')
print(b_scored)
print('')
print(y_scored)

{'k=4': [1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 0.0], 'k=6': [1.9166666666666667, 0.25, 0.0, 0.0, 0.0, 0.0, 0.25], 'k=8': [2.5833333333333335, 0.25, 0.0, 0.0, 0.0], 'k=10': [3.25, 0.25, 0.0], 'k=12': [3.9166666666666665]}

{'k=4': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.25], 'k=6': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.9166666666666667], 'k=8': [0.0, 0.0, 0.0, 0.0, 2.5833333333333335], 'k=10': [0.0, 0.0, 3.25], 'k=12': [3.9166666666666665]}


We see that the scores are the same, just in different positions. We need to account for this when aggregating somehow. We should add a lot of zeros to the left of the longer kmers, but it still leaves us with "how do we keep track of this"

#### Note
We already have a function for padding zeros to the right of for longer kmers (as their list is shorter) so we can do this to the left. We just need to somehow keep track of it

In [8]:
from analysis import score_utils

normalized_bs = {}
normalized_ys = {}

for k in b_scored:
    normalized_bs[k], _ = score_utils.pad_scores(b_scored[k], b_scored['k=4'])
    normalized_ys[k], _ = score_utils.pad_scores(y_scored[k], y_scored['k=4'], side='l')
    normalized_bs[k] += [0 for _ in range(3)]
    normalized_ys[k] = [0 for _ in range(3)] + normalized_ys[k]
    
print(normalized_bs)
print('')
print(normalized_ys)

{'k=4': [1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0, 0, 0], 'k=6': [1.9166666666666667, 0.25, 0.0, 0.0, 0.0, 0.0, 0.25, 0, 0, 0, 0, 0], 'k=8': [2.5833333333333335, 0.25, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0], 'k=10': [3.25, 0.25, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'k=12': [3.9166666666666665, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

{'k=4': [0, 0, 0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.25], 'k=6': [0, 0, 0, 0, 0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.9166666666666667], 'k=8': [0, 0, 0, 0, 0, 0, 0, 0.0, 0.0, 0.0, 0.0, 2.5833333333333335], 'k=10': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0, 0.0, 3.25], 'k=12': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3.9166666666666665]}


In [40]:
b_agged = aggregations.__z_score_sum(normalized_bs)
y_agged = aggregations.__z_score_sum(normalized_ys)
print(b_agged)
print('')
print(y_agged)

[13.113797840011015, -0.9527575573559024, -1.8198739859607125, -1.8198739859607125, -1.8198739859607125, -1.8198739859607125, -1.2417963668908392, -1.8198739859607125, -1.8198739859607125]

[-1.6485630655196284, -1.6485630655196284, -1.6485630655196284, -1.6485630655196284, -1.6485630655196284, -1.6485630655196284, -1.6485630655196284, -1.6485630655196284, 13.188504524157029]


### Explanation
We got the same aggregation here from the padding for both b and y, just reversed which makes sense. This is what we want. Below we'll do one the way its currently running to make a point of it

In [42]:
current_bs = {}
current_ys = {}

for k in b_scored:
    current_bs[k], _ = score_utils.pad_scores(b_scored[k], b_scored['k=4'])
    current_ys[k], _ = score_utils.pad_scores(y_scored[k], y_scored['k=4'])
    
b_agged_current = aggregations.__z_score_sum(current_bs)
y_agged_current = aggregations.__z_score_sum(current_ys)
print(b_agged_current)
print('')
print(y_agged_current)

[13.113797840011015, -0.9527575573559024, -1.8198739859607125, -1.8198739859607125, -1.8198739859607125, -1.8198739859607125, -1.2417963668908392, -1.8198739859607125, -1.8198739859607125]

[2.8504187197371644, -1.6485630655196284, 2.084634586076433, -1.6485630655196284, 1.318850452415703, -1.6485630655196284, 0.5530663187549727, -1.6485630655196284, -0.21271781490575842]


We get the same results for the b scores, but the y scores dont make much sense and we cant get much from it. Better to use the correct aggregations