# Nate k-mer splicing algo

This notebook is designed to test implementations of different algorithm ideas to test the accuracy and efficiency of different algorithm ideas.

# Theoretical setup

Let $S_1$, $S_2$ be sets of length k. Let G be the target set containing some subset of each set.

# Notes

* In reality, not all of $S_1$ and $S_2$ are known.
* Need efficient way to compare many different sets
* Need to determine some kind of tiebreaker
* Precursor masses are usually known within 10 ppm

# Comparison Algorithm

In [31]:
%%time
k = 4 #k will always be known
S_1 = [5,6,17,21] #defining sample parameters
S_2 = [3,4,12,13]

G = [5,6,12,13]
G_val = 0 #Integer to hold what parts of G have been "figured out"

score = 0 #start at 0

for i in range(k):
    if (S_1[i] == G[G_val]):
        score = score + 1
        G_val = G_val + 1
        
for j in range(k):
    if (G_val < k and S_2[j] == G[G_val]):
        score = score + 1
        G_val = G_val + 1
        
print(score)

4
Wall time: 1e+03 µs


# Realistic Algorithm

The algorithm above is very simplified in a very abstract environment. Realistically, we would need to account for the bio aspects such as incomplete information.

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src import gen_spectra
from src import utils
from src.scoring import scoring
from src.objects import Spectrum

In [3]:
gen_spectra.gen_spectrum("MALWAR", 1)

{'spectrum': [132.04776143499998,
  203.084875435,
  316.16893943499997,
  502.248252435,
  573.285366435,
  729.386477435,
  175.118952135,
  246.156066135,
  432.235379135,
  545.319443135,
  616.356557135,
  747.397042135],
 'precursor_mass': 747.3970421}

In [4]:
utils.ppm_to_da(1000,20)

0.02

In [5]:
%%time
def Compare(extended_b, extended_y): #Compare is designed to score the set of b and y ions to see if these two sets make up the target set
                      #k should be defined but I don't know the variable
                      #defining sample parameters

    G_val = 0 #Integer to hold what parts of the parent sequence which have been "figured out"

    score = 0 #start at 0
    for i in range(k):
        if (extended_b[i] == parent_sequence[G_val]):
            score = score + 1
            G_val = G_val + 1

    for j in range(k):
        if (G_val < k and extended_y[j] == parent_sequence[G_val]):
            score = score + 1
            G_val = G_val + 1

    print(score)

Wall time: 0 ns


# Other ideas

* Determining good cut off. When do we know that a sequence is "good enough". If this can be determined, we can cut down a lot of the candidates and do a detailed search of the remaining few.
* Checked each value with list of 20 amino acids

# Idea 2

* The goal of this algorithm is to identify the sequence from a spectrum.
* This is done by reading through until we find the first element. Then we run through and grab whatever pieces we can from there. Way to fill in later?
* The algorithm below is pretty brute force and there are more elegant solutions but the most important thing is to be fairly certain what peptide is analyzed
* Should be modified to use a database search for additional confidence
* * Possibly performing a database search to check for missing information
* Way to learn when the "divide" happens
* * Could check database to get set of possible extensions for each amino acid but that could be very computationally heavy.
* Algorithm could look through b ions and create an alignment based on present information then create an alignment from y ions and check if they agree. If they don't agree?

# Idea 2 pseudocode

* Loop through all of the extended b-ions
* Build target sequence from b-ions
* Build target sequence from y-ions
* * While building sequence if we skip an ion we can use the database to fill it in (what does it mean to skip an ion?)
* * Note that the length of the target sequence can be estimated from the precursor mass
* Compare and use database to validate sequence

In [4]:
%%time
Amino_name = ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
Amino_mass = [71.0788, 156.1875, 114.1038, 115.0886, 103.1388, 129.1155, 128.1307, 57.0519, 137.1411, 113.1594, 113.1594, 128.1741, 131.1926, 147.1766, 97.1167, 87.0782, 101.1051, 186.2132, 163.1760, 99.1326]
tolerance = .01 #Guess as to what the allowable tolerance is
Target_Sequence = []
Target_Sequence.len() = precursor_estimate #This is estimated from the precursor mass
temp_total = 0
total = 0

def find_sequence(extended_b, extended_y): #Input: a list of extended b and y k-mers. Output: A hybrid sequence
    for i in range(k): #Checking b-ions
        for j in range (20): #For each listed amino acid
            if ((extended_b[i] + temp_total <= Amino_mass[j] + tolerance) and (extended_b[i] + temp_total >= Amino_mass[j] - tolerance)): #If the mass matches the mass of an amino acid within the tolerance.
                Target_Sequence.append(Amino_name)
                temp_total = temp_total + Amino_mass[j]
    
    temp_total = 0
    for i in range (k,0): #Checking for the y-ions
        for j in range (20):
            if ((extended_y[i] + temp_total <= Amino_mass[j] + tolerance) or (extended_y[i] + temp_total >= Amino_mass[j] - tolerance)): #If the mass matches the mass of an amino acid within the tolerance.
                if (Target_Sequence[Target_Sequence.len() - i] == Amino_name[j]):
                    score = score + 1
                temp_total = temp_total + Amino_mass[j]
                
    for k in range(Target_Sequence.len()): #checking precursor mass
        total = total + Target_Sequence[k]
    if (total > precursor_mass + tolerance):
        print("bad match")
        return
        
    return Target_Sequence, score #The sequence with the highest score is the most correct

SyntaxError: can't assign to function call (<unknown>, line 5)

# Idea 3 

* Loop through b ions until match is found. Insert into target sequence
* Loop through y ions until match is found. Insert into end of target sequence
* Check if b ions match this sequence and if y ions match this sequence.
* If not, use database to check expected extensions according to precursor mass since at this point we will be able to know where the divide is

# Current algorithm idea

* extend all the b and y k-mers



# Current ideas

* Optimizing current algorithm which generates all possible extensions of base b,y k-mers and then cuts off from the ends until combined mass matches precursor
* * Dynamic approach could shorten time

* Creating a b tree and y tree and seeing what we can create from that while staying under precursor mass
* * Use of trees could make dynamic program appealing

# How to do dynamic approach to first idea?

* Dynamic improvements could potentially happen in two places
* * First is during extension of base b and y k-mers
* * Second is during scoring

* First idea seems more valuable because that is where the most time is being taken up

# base k-mer extension principle

* Starting with a 3-mer (in both the b and y ions), generate all possible extentions so that the theoretical mass > precursor mass

# Dynamic approach to base k-mer extensions

* Original problem: let A denote an array defined below. 
* $A := \{x \in Database \ starting\ with \ 3-mer: mass(x) > mass_{precursor} + tolerance\}$

* subproblems: Find the maximum of A[i,...,n] and A[0,...,i]
* The set of all subproblems will find the solution to the original problem because finding the maximum of all subsets of A is also the maximum of A.

* Possibly burrows-wheeler transform at start to save space

* Generate trie where the nodes represent extensions and the edges represent scores

* Kind of like Knapsack problem where the size of the bag are the missing extensions and the score is the element to maximize. 

* Go both ways and compare sequences?

# Example

* Target sequence MALQTWAR
* x_given MAL
* y_given WAR
* Start with  
* $$best_{ext} = max(A[0,1,2,...,])$$
* Generate all possible extensions from MAL - RRT
* Generate all possible extensions from RRT - MAL

# Scoring algorithm for an extension

* if there is a match of masses between observed spectra and theoretical spectra (including tolerance), then we add to the score

# Complexity to keep track of 

* Comparing observed and theoretical spectra $\Theta(n)$
* Developing theoretical spectra $\Theta(n)$
* Building table to backtrack $\Theta(n^2)$

# Other approach to optimize

* Extending the base k-mers out far too much
* The problem is that the dataset size is too large

# To ask Zach in meetings

* Is math imported twice in utils.py?
* When running gen_spectrum above the precursor mass is not what it should be. This is due to the calc masses function returning the doubly charged precursor as the precursor value. Should precursor just be max mass. If gen_spectrum is run with only singly charged parameter there are two different precursor values. Should this be possible give that this is theoretical spectrum
* Why is split_hybrid labelled as __split.hybrid
* Setup in gen_spectra not recognizing sequence?
* Run through the failures if possible