# Processing the spectra into a graph
The mgf spectra is loaded into model where it is processed by ```1_spectra-preprocessing``` scripts. The processed spectra is then sent to the ```process_spectrum_Graph```.

This notebook explores the behaviour of ```process_spectrum_Graph```

In [None]:
from typing import List
import numpy as np

In [None]:
# config cell
config = {
    'max_num_peaks': 400,
    'aa_mass_tolerance': 0.05,
}
vocab_reverse = ['A',
                 'R',
                 'N',
                 'Nmod',
                 'D',
                 # 'C',
                 'Cmod',
                 'E',
                 'Q',
                 'Qmod',
                 'G',
                 'H',
                 'I',
                 'L',
                 'K',
                 'M',
                 'Mmod',
                 'F',
                 'P',
                 'S',
                 'T',
                 'W',
                 'Y',
                 'V',
                 ]

In [None]:
def process_spectrum_graph(spectrum: List):
    """
    This function takes in the process mass spectrum dataset

    Parameters:


    """

    scan, peptide_ids, spectrum_mz, spectrum_intensity, peptide_mass, pep_charge = spectrum

    max_num_peaks = config['max_num_peaks']

    aa_edge_precision = config['aa_mass_tolerance'] # the range around the mass we should look

    mp = peptide_mass

    spectrum_intensity = np.divide(spectrum_intensity, max(spectrum_intensity)) # normalize the intensity peaks

    peaks = np.stack([spectrum_mz, spectrum_intensity], axis=1)
    #np.stack joins two arrays at the specified axis.
    # In this case we are joining the arrays so that they form spectrum_mz-spectrum_intensity pairs. In the same way zip works

    b_or_y, diffs = match_peaks(peptide_ids=peptide_ids,
                                spectrum_mz=spectrum_mz,
                                tolerance=aa_edge_precision)



The b and y ions and their mass differences is calculated with the ```match_peaks``` function that takes in the peptide sequence in the form of their vocab ids, the spectrum_mz and the tolerance and matches the peptides to the spectrum_mz.

The functions ```fragments_mgf``` create theoretical fragments from the peptide sequence.

In [None]:
def match_peaks(peptide_ids, spectrum_mz, tolerance, _8ions=False):
    """
    Matching the peaks of the mass spectrum to the peptide sequence

    """
    peptide_str = [vocab_reverse[i] for i in peptide_ids] # the peptide is converted back to string
    # these functions create theoretical fragments
    if _8ions: # not sure what are _8ions
        true_peaks, tp_ions, tp_frags = fragments_mgf_8ion(peptide=peptide_str)
    else:
        true_peaks, tp_ions, tp_frags = fragments_mgf(peptide=peptide_str)

    # we now match the peaks in the spectra to the theoretical peaks
    matched_peaks = []
    matched_diffs = []
    # for each mass in the spectrum within_tol finds the true_peak that finds the difference
    for mz in spectrum_mz:
        log_diff = within_tol(mz, true_peaks, rtol=0, atol=tolerance)

        if np.any(log_diff[..., 0]):
            closest_idx = np.argmin(np.log_diff[...,1])


