# Run a test of hypedsearch with generated data
The following steps describe how the test works
1. Load a fasta database
2. Generate
    1. Hybrid proteins
    2. Peptides
    3. Hybrid peptides from the hybrid proteins
3. Generate spectra for all the peptides created
4. Run hypedsearch with the .fasta file (no hybrid proteins included) and the spectra files
5. Load the summary.json file created
6. Determine what number of alignments were correct

## 1. Load fasta database

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.file_io import fasta

fasta_file = '../data/databases/100prots.fasta'
database = fasta.read(fasta_file, True)

database = {x['name']: x for x in database}

## 2.  Generate the peptides, hybrid proteins and peptides

In [2]:
from sequence_generation import proteins, peptides
test_directory = '../data/testing_output/'

num_hybs = 5
min_length= 5
max_length = 35
num_peptides = 100
min_cont = 3 #min contribution for each side of a hybrid

# make hybrid proteins
hyb_prots = proteins.generate_hybrids([x for _, x in database.items()], num_hybs, min_contribution=max_length)
# create peptides
non_hybrid_peps = peptides.gen_peptides([x for _, x in database.items()], num_peptides, min_length=min_length, max_length=max_length, digest='random', dist='beta')
# create hybrid peptides
hyb_peps = peptides.gen_peptides(hyb_prots, num_hybs, min_length=min_length, max_length=max_length, digest='random', min_contribution=min_cont, hybrid_list=True)

all_proteins_raw = [x for _,x in database.items()] + hyb_prots
all_peptides_raw = non_hybrid_peps + hyb_peps

peptides = {}
for i, pep in enumerate(all_peptides_raw):
    peptides[i] = pep
    peptides[i]['scan_no'] = i

Generating hybrid protein 0/5[0%]Generating hybrid protein 1/5[20%]Generating hybrid protein 2/5[40%]Generating hybrid protein 3/5[60%]Generating hybrid protein 4/5[80%]
Finished generating hybrid proteins


## 2.1 Save this info so that I can analyze it later from Neo-Fusion

In [3]:
import json
experiment_info_file_name = 'experiment_info.json'

exp = {'database': fasta_file, 'peptides': peptides}
with open(test_directory + experiment_info_file_name, 'w') as o:
    json.dump(exp, o)


## 2.2 Load data if available instead of creating it

In [4]:
# import json

# expfile = '../data/testing_output/experiment_info.json'
# exp = json.load(open(expfile, 'r'))
# peptides = exp['peptides']

## 3. Generate spectra

In [5]:
from src.spectra import gen_spectra
from src.utils import utils
from sequence_generation import write_spectra

utils.make_dir(test_directory)

spectra = []
sorted_keys = [int(c) for c in peptides.keys()]
sorted_keys.sort()
for k in sorted_keys:
    pep = peptides[k]
    cont = gen_spectra.gen_spectrum(pep['sequence'])
    spec = cont['spectrum']
    pm = cont['precursor_mass']
    spectra.append({'spectrum': spec, 'precursor_mass': pm})
write_spectra.write_mzml('testSpectraFile', spectra, output_dir=test_directory)


Determination of memory status is not supported on this 
 platform, measuring for memoryleaks will never fail


'../data/testing_output/testSpectraFile.mzML'

## 4. Run hypedsearch

In [7]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from src import runner
from time import time

test_directory = '../data/testing_output/'
fasta_file = '../data/databases/100prots.fasta'

args = {
    'spectra_folder': test_directory,
    'database_file': fasta_file,
    'output_dir': test_directory,
    'min_peptide_len': 3,
    'max_peptide_len': 35,
}
st = time()
runner.run(args)
print('\nTotal runtime: {} seconds'.format(time() - st))



Loading database...
Adding protein 42/100 to tree

KeyboardInterrupt: 

## 5. Load the summary json

In [8]:
import json
test_directory = '../data/testing_output/'

summary = json.load(open(test_directory + 'summary.json', 'r'))
print(test_directory)

../data/testing_output/


#### summary format
Each entry is of the form
```python
{
    '<file-name>_<scan_number>':{
        'spectrum': {...},
        'alignments': [{...}],
        'b_scores': [{...}],
        'y_score': [{...}]
    }
}
```
The only attribute we're really interested in is the alignments attribute which has the form
```python
{
    'alignments': [{
        'b_alignment': {...},
        'y_alignment': {...}, 
        'spectrum': [...],
        'protein': str,
        'alignment_score': float,
        'sequence': str,
        'hybrid': bool,
        'hybrid_sequence': str
    }, ...
    ]
}
```

## 6. Determine which number of alignments were correct
This needs to be broken down into hybrid and non hybrid peptides to get some stats on how well its doing

In [9]:
n = 5
non_hyb_stats = {i: {
        'correct': 0,
        'correct_parent': 0,
        'correct_length': 0,
        'correct_start': 0,
        'correct_end': 0
    } for i in range(n)}
non_hyb_stats['count'] = 0

hyb_stats ={i: {
        'left_correct_parent': 0,
        'right_correct_parent': 0, 
        'correct_start': 0,
        'correct_end': 0,
        'correct_length': 0,
        'correct_sequence': 0,
        'correct': 0
    } for i in range(n)}
hyb_stats['count'] = 0

wrong_hybrid_alignemnts = []
wrong_nonhybrid_alignments = []

In [10]:
def hyb_calc(result, real_pep):
    hyb_stats['count'] += 1
    is_correct = False
    for i in range(min(n, len(result))):
        res = result[i]
        result_hybrid = bool(res['hybrid'])
        if not result_hybrid: 
            continue
        
        left_corrparent = real_pep['left_parent_name'] == res['protein'].split('~')[0]
        right_corrparent = real_pep['right_parent_name'] == res['protein'].split('~')[1]
        corr_start = real_pep['left_parent_starting_position'] == res['b_alignment']['kmer']['start_position']
        corr_end = real_pep['right_parent_ending_position'] == res['y_alignment']['kmer']['end_position']
        corr_len = len(real_pep['sequence']) == len(res['sequence'])
        corr_seq = real_pep['sequence'] == res['sequence']
        
        if not corr_seq: 
            print('real pep seq: {} \t result seq: {}'.format(real_pep['sequence'], res['sequence']))
            continue
        hyb_stats[i]['left_correct_parent'] += 1 if left_corrparent else 0
        hyb_stats[i]['right_correct_parent'] += 1 if right_corrparent else 0
        hyb_stats[i]['correct_start'] += 1 if corr_start else 0
        hyb_stats[i]['correct_end'] += 1 if corr_end else 0
        hyb_stats[i]['correct_length'] += 1 if corr_len else 0
        hyb_stats[i]['correct_sequence'] += 1 if corr_seq else 0
        hyb_stats[i]['correct'] += 1 if left_corrparent and right_corrparent and corr_start and corr_end and corr_len else 0
        
        is_correct = corr_seq
            
    if not is_correct:
        wrong_hybrid_alignemnts.append((result, real_pep))

In [11]:
def non_hyb_calc(result, real_pep):
    non_hyb_stats['count'] += 1
    iterrange = min(n, len(result))
    for i in range(iterrange):
        if result[i]['protein'] != real_pep['parent_name']:
            continue
        corrlen = len(result[i]['sequence']) == len(real_pep['sequence'])
        resstartpos = min(result[i]['b_alignment']['kmer']['start_position'], result[i]['y_alignment']['kmer']['start_position'])
        resendpos = min(result[i]['b_alignment']['kmer']['end_position'], result[i]['y_alignment']['kmer']['end_position'])
        corrstart = resstartpos == real_pep['starting_position']
        corrend = resendpos == real_pep['ending_position']
        
        non_hyb_stats[i]['correct_parent'] += 1 
        non_hyb_stats[i]['correct_length'] += 1 if corrlen else 0
        non_hyb_stats[i]['correct_start'] += 1 if corrstart else 0
        non_hyb_stats[i]['correct_end'] += 1 if corrend else 0
        non_hyb_stats[i]['correct'] += 1 if (corrend and corrlen and corrstart) else 0
        
        if i != 0 and not (corrend and corrlen and corrstart):
            wrong_nonhybrid_alignments.append((result, real_pep))
        return


In [12]:
expfile = '../data/testing_output/experiment_info.json'
exp = json.load(open(expfile, 'r'))

scan_no_keyed_results = {x['spectrum']['scan_number']: x for _, x in summary.items()}
sorted_keys = [int(c) for c in exp['peptides'].keys()]

for k in sorted_keys:
    pep = exp['peptides'][str(k)]
    if k not in scan_no_keyed_results:
        continue
    if 'hybrid' in pep['peptide_name'].lower():
        hyb_calc(scan_no_keyed_results[k]['alignments'], pep)
    else:
        non_hyb_calc(scan_no_keyed_results[k]['alignments'], pep)

real pep seq: MDKKHKEE 	 result seq: MDKKDKHKEE
real pep seq: MDKKHKEE 	 result seq: MDKKDKHKEE
real pep seq: MDKKHKEE 	 result seq: MDKKDKHKEE
real pep seq: MDKKHKEED 	 result seq: MMEKDKHKEED
real pep seq: MDKKHKEED 	 result seq: MDKKDKHKEED
real pep seq: MDKKHKEED 	 result seq: MDKKDKHKEED
real pep seq: MDKKHKEEDK 	 result seq: MDKKDKHKEEDK
real pep seq: MDKKHKEEDK 	 result seq: MDKKDKHKEEDK
real pep seq: MDKKHKEEDK 	 result seq: MDKKDKHKEEDK
real pep seq: MDKKHKEEDKG 	 result seq: MDKKDKHKEEDKG
real pep seq: MDKKHKEEDKG 	 result seq: MDKKDKHKEEDKG
real pep seq: MDKKHKEEDKG 	 result seq: MDKKDKHKEEDKG
real pep seq: MDKKHKEEDKGS 	 result seq: MDKKDKHKEEDKGS
real pep seq: MDKKHKEEDKGS 	 result seq: MDKKDKHKEEDKGS
real pep seq: MDKKHKEEDKGS 	 result seq: MDKKDKHKEEDKGS
real pep seq: KMDKKHKE 	 result seq: KMKDKHKE
real pep seq: KMDKKHKE 	 result seq: KMDKKDKHKE
real pep seq: KMDKKHKEE 	 result seq: KMKDKHKEE
real pep seq: KMDKKHKEE 	 result seq: KMDKKDKHKEE
real pep seq: KMDKKHKEE 	 re

In [13]:
percent = lambda a, b: (a* 100 // b)

printstat = lambda name, stat: '{}{}\n'.format(name, str(stat).rjust(60-len(name), '.'))

secbreak = ''.join(['=' for _ in range(60)])
headbreak = ''.join(['-' for _ in range(60)])
nhcount = non_hyb_stats['count']
topalign = non_hyb_stats[0]
otheralign = {}
for i in range(1, n):
    for stat in topalign.keys():
        if stat not in otheralign:
            otheralign[stat] = 0
        otheralign[stat] += non_hyb_stats[i][stat]

######################## NON HYBRID PRETTY PRINTING ############################

nonhybsum = 'NON HYBRID STATS\n' + headbreak + '\n'
nonhybsum += printstat('number of peptides', nhcount) 
nonhybsum += 'Top alignment\n\n'
nonhybsum += printstat('correct alignment', topalign['correct'])
nonhybsum += printstat('%', percent(topalign['correct'], nhcount)) 
nonhybsum += printstat('correct protein', topalign['correct_parent']) 
nonhybsum += printstat('%', percent(topalign['correct_parent'], nhcount))
nonhybsum += printstat('correct starting position', topalign['correct_start'])
nonhybsum += printstat('%', percent(topalign['correct_start'], nhcount))
nonhybsum += printstat('correct ending position', topalign['correct_end'])
nonhybsum += printstat('%', percent(topalign['correct_end'], nhcount))
nonhybsum += printstat('correct length', topalign['correct_length'])
nonhybsum += printstat('%', percent(topalign['correct_length'], nhcount))

nonhybsum += '\n2 to {} alignment\n\n'.format(n)
nonhybsum += printstat('number of peptides', nhcount) 
nonhybsum += printstat('correct alignment', otheralign['correct'])
nonhybsum += printstat('%', percent(otheralign['correct'], nhcount)) 
nonhybsum += printstat('correct protein', otheralign['correct_parent']) 
nonhybsum += printstat('%', percent(otheralign['correct_parent'], nhcount))
nonhybsum += printstat('correct starting position', otheralign['correct_start'])
nonhybsum += printstat('%', percent(otheralign['correct_start'], nhcount))
nonhybsum += printstat('correct ending position', otheralign['correct_end'])
nonhybsum += printstat('%', percent(otheralign['correct_end'], nhcount))
nonhybsum += printstat('correct length', otheralign['correct_length'])
nonhybsum += printstat('%', percent(otheralign['correct_length'], nhcount))
nonhybsum += '\n' + secbreak + '\n\n'

############################ HYBRID PRETYY PRINTING ##############################
hcount = hyb_stats['count']
topalignh = hyb_stats[0]
otheralignh = {}
for i in range(1, n):
    for stat in topalignh.keys():
        if stat not in otheralignh:
            otheralignh[stat] = 0
        otheralignh[stat] += hyb_stats[i][stat]

hybsum = 'HYBRID STATS\n'+ headbreak + '\n'
hybsum += printstat('number of peptides', hcount)
hybsum += 'Top alignment\n\n'
hybsum += printstat('correct alignment', topalignh['correct'])
hybsum += printstat('%', percent(topalignh['correct'], hcount))
hybsum += printstat('correct sequence', topalignh['correct_sequence'])
hybsum += printstat('%', percent(topalignh['correct_sequence'], hcount))
hybsum += printstat('correct left parent', topalignh['left_correct_parent'])
hybsum += printstat('%', percent(topalignh['left_correct_parent'], hcount))
hybsum += printstat('correct right parent', topalignh['right_correct_parent'])
hybsum += printstat('%', percent(topalignh['right_correct_parent'], hcount))
hybsum += printstat('correct starting position', topalignh['correct_start'])
hybsum += printstat('%', percent(topalignh['correct_start'], hcount))
hybsum += printstat('correct ending position', topalignh['correct_end'])
hybsum += printstat('%', percent(topalignh['correct_end'], hcount))
hybsum += printstat('correct length', topalignh['correct_length'])
hybsum += printstat('%', percent(topalignh['correct_length'], hcount))

hybsum += '\n2 to {} alignment\n\n'.format(n)
hybsum += printstat('correct alignment', otheralignh['correct'])
hybsum += printstat('%', percent(otheralignh['correct'], hcount))
hybsum += printstat('correct sequence', otheralignh['correct_sequence'])
hybsum += printstat('%', percent(otheralignh['correct_sequence'], hcount))
hybsum += printstat('correct left parent', otheralignh['left_correct_parent'])
hybsum += printstat('%', percent(otheralignh['left_correct_parent'], hcount))
hybsum += printstat('correct right parent', otheralignh['right_correct_parent'])
hybsum += printstat('%', percent(otheralignh['right_correct_parent'], hcount))
hybsum += printstat('correct starting position', otheralignh['correct_start'])
hybsum += printstat('%', percent(otheralignh['correct_start'], hcount))
hybsum += printstat('correct ending position', otheralignh['correct_end'])
hybsum += printstat('%', percent(otheralignh['correct_end'], hcount))
hybsum += printstat('correct length', otheralign['correct_length'])
hybsum += printstat('%', percent(otheralignh['correct_length'], hcount))
print(nonhybsum + hybsum)

NON HYBRID STATS
------------------------------------------------------------
number of peptides.......................................100
Top alignment

correct alignment.........................................98
%.........................................................98
correct protein...........................................98
%.........................................................98
correct starting position.................................98
%.........................................................98
correct ending position...................................98
%.........................................................98
correct length............................................98
%.........................................................98

2 to 5 alignment

number of peptides.......................................100
correct alignment..........................................1
%..........................................................1
correct protein...................

## For each of the hybrid missed alignments, see what was chosen instead

In [14]:
for i, bad in enumerate(wrong_hybrid_alignemnts):
    print('\nCorrect sequence: \t {} \t {}'.format(bad[1]['sequence'], bad[1]['hybrid_sequence']))
    print('Attempted alignments')
    for b in bad[0]:
        print('\t{}\t{}'.format(b['sequence'], b['hybrid_sequence']))

    


Correct sequence: 	 MDKKHKEE 	 MDK-KHKEE
Attempted alignments
	MDKKDKHKEE	MDK-KDKHKEE
	MDKKDKHKEE	MDK-KDKHKEE
	MDKKDKHKEE	MDK-KDKHKEE

Correct sequence: 	 MDKKHKEED 	 MDK-KHKEED
Attempted alignments
	MMEKDKHKEED	MME-KDKHKEED
	MDKKDKHKEED	MDK-KDKHKEED
	MDKKDKHKEED	MDK-KDKHKEED

Correct sequence: 	 MDKKHKEEDK 	 MDK-KHKEEDK
Attempted alignments
	MDKKDKHKEEDK	MDK-KDKHKEEDK
	MDKKDKHKEEDK	MDK-KDKHKEEDK
	MDKKDKHKEEDK	MDK-KDKHKEEDK

Correct sequence: 	 MDKKHKEEDKG 	 MDK-KHKEEDKG
Attempted alignments
	MDKKDKHKEEDKG	MDK-KDKHKEEDKG
	MDKKDKHKEEDKG	MDK-KDKHKEEDKG
	MDKKDKHKEEDKG	MDK-KDKHKEEDKG

Correct sequence: 	 MDKKHKEEDKGS 	 MDK-KHKEEDKGS
Attempted alignments
	MDKKDKHKEEDKGS	MDK-KDKHKEEDKGS
	MDKKDKHKEEDKGS	MDK-KDKHKEEDKGS
	MDKKDKHKEEDKGS	MDK-KDKHKEEDKGS

Correct sequence: 	 KMDKKHKEE 	 KMDK-KHKEE
Attempted alignments
	KMKDKHKEE	KM-KDKHKEE
	KMDKKDKHKEE	KMDK-KDKHKEE
	KMDKDKHKEE	KMD-KDKHKEE

Correct sequence: 	 KMDKKHKEED 	 KMDK-KHKEED
Attempted alignments
	KMDKKDKHKEED	KMDK-KDKHKEED
	KMDKDKHKEED	

In [21]:
print(len(wrong_hybrid_alignemnts))

119
