# Run a test of hypedsearch with generated data
The following steps describe how the test works
1. Load a fasta database
2. Generate
    1. Hybrid proteins
    2. Peptides
    3. Hybrid peptides from the hybrid proteins
3. Generate spectra for all the peptides created
4. Run hypedsearch with the .fasta file (no hybrid proteins included) and the spectra files
5. Load the summary.json file created
6. Determine what number of alignments were correct

## 1. Load fasta database

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.file_io import fasta

fasta_file = '../data/databases/100prots.fasta'
database = fasta.read(fasta_file, True)

database = {x['name']: x for x in database}

## 2.  Generate the peptides, hybrid proteins and peptides

In [2]:
from sequence_generation import proteins, peptides
test_directory = '../data/testing_output/'

num_hybs = 5
min_length= 5
max_length = 35
num_peptides = 100
min_cont = 3 #min contribution for each side of a hybrid

# make hybrid proteins
hyb_prots = proteins.generate_hybrids([x for _, x in database.items()], num_hybs, min_contribution=max_length)
# create peptides
non_hybrid_peps = peptides.gen_peptides([x for _, x in database.items()], num_peptides, min_length=min_length, max_length=max_length, digest='random', dist='beta')
# create hybrid peptides
hyb_peps = peptides.gen_peptides(hyb_prots, num_hybs, min_length=min_length, max_length=max_length, digest='random', min_contribution=min_cont, hybrid_list=True)

all_proteins_raw = [x for _,x in database.items()] + hyb_prots
all_peptides_raw = non_hybrid_peps + hyb_peps

peptides = {}
for i, pep in enumerate(all_peptides_raw):
    peptides[i] = pep
    peptides[i]['scan_no'] = i

Generating hybrid protein 0/5[0%]Generating hybrid protein 1/5[20%]Generating hybrid protein 2/5[40%]Generating hybrid protein 3/5[60%]Generating hybrid protein 4/5[80%]
Finished generating hybrid proteins


## 2.1 Save this info so that I can analyze it later from Neo-Fusion

In [3]:
import json
experiment_info_file_name = 'experiment_info.json'

exp = {'database': fasta_file, 'peptides': peptides}
with open(test_directory + experiment_info_file_name, 'w') as o:
    json.dump(exp, o)


## 2.2 Load data if available instead of creating it

In [4]:
# import json

# expfile = '../data/testing_output/experiment_info.json'
# exp = json.load(open(expfile, 'r'))
# peptides = exp['peptides']

## 3. Generate spectra

In [5]:
from src.spectra import gen_spectra
from src import utils
from sequence_generation import write_spectra

utils.make_dir(test_directory)

spectra = []
sorted_keys = [int(c) for c in peptides.keys()]
sorted_keys.sort()
for k in sorted_keys:
    pep = peptides[k]
    cont = gen_spectra.gen_spectrum(pep['sequence'])
    spec = cont['spectrum']
    pm = cont['precursor_mass']
    spectra.append({'spectrum': spec, 'precursor_mass': pm})
write_spectra.write_mzml('testSpectraFile', spectra, output_dir=test_directory)


Determination of memory status is not supported on this 
 platform, measuring for memoryleaks will never fail


'../data/testing_output/testSpectraFile.mzML'

## 4. Run hypedsearch

In [6]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from src import runner
from time import time

test_directory = '../data/testing_output/'
fasta_file = '../data/databases/100prots.fasta'

args = {
    'spectra_folder': test_directory,
    'database_file': fasta_file,
    'output_dir': test_directory,
    'min_peptide_len': 3,
    'max_peptide_len': 35,
}
st = time()
runner.run(args)
print('\nTotal runtime: {} seconds'.format(time() - st))



Loading database...
Adding protein 100/100 to tree
Done.
Building hashes for kmers...
Indexing database for k=35...
Done
Looking at kmer 55051/55092
Done.
Analyzing spectra file 1/1[0%]

Analyzing spectrum 345/345[99%]
Finished search. Writting results to ../data/testing_output/...

Total runtime: 243.0017910003662 seconds


## 5. Load the summary json

In [7]:
import json
test_directory = '../data/testing_output/'

summary = json.load(open(test_directory + 'summary.json', 'r'))
print(test_directory)

../data/testing_output/


#### summary format
Each entry is of the form
```python
{
    '<file-name>_<scan_number>':{
        'spectrum': {...},
        'alignments': [{...}],
        'b_scores': [{...}],
        'y_score': [{...}]
    }
}
```
The only attribute we're really interested in is the alignments attribute which has the form
```python
{
    'alignments': [{
        'b_alignment': {...},
        'y_alignment': {...}, 
        'spectrum': [...],
        'protein': str,
        'alignment_score': float,
        'sequence': str,
        'hybrid': bool,
        'hybrid_sequence': str
    }, ...
    ]
}
```

## 6. Determine which number of alignments were correct
This needs to be broken down into hybrid and non hybrid peptides to get some stats on how well its doing

In [8]:
n = 5
non_hyb_stats = {i: {
        'correct': 0,
        'correct_parent': 0,
        'correct_sequence': 0,
    } for i in range(n)}
non_hyb_stats['count'] = 0

hyb_stats ={i: {
        'left_correct_parent': 0,
        'right_correct_parent': 0, 
        'correct_sequence': 0,
        'correct': 0
    } for i in range(n)}
hyb_stats['count'] = 0

wrong_hybrid_alignemnts = []
wrong_nonhybrid_alignments = []

In [9]:
def hyb_calc(result, real_pep):
    hyb_stats['count'] += 1
    is_correct = False
    for i in range(min(n, len(result))):
        res = result[i]
        result_hybrid = 'hybrid_sequence' in res
        if not result_hybrid: 
            continue
        
        left_corrparent = real_pep['left_parent_name'] in res['left_proteins']
        right_corrparent = real_pep['right_parent_name'] in res['right_proteins']
        corr_seq = real_pep['sequence'] == res['sequence']
        
        hyb_stats[i]['left_correct_parent'] += 1 if left_corrparent else 0
        hyb_stats[i]['right_correct_parent'] += 1 if right_corrparent else 0
        hyb_stats[i]['correct_sequence'] += 1 if corr_seq else 0
        hyb_stats[i]['correct'] += 1 if left_corrparent and right_corrparent and corr_seq else 0
        
        is_correct = corr_seq
        
        if is_correct:
            break
            
    if not is_correct:
        print('appending {} to bad for real pep {}'.format([x['sequence'] for x in result], real_pep['sequence']))
        wrong_hybrid_alignemnts.append((result, real_pep))

In [10]:
def non_hyb_calc(result, real_pep):
    non_hyb_stats['count'] += 1
    iterrange = min(n, len(result))
    for i in range(iterrange):
        if real_pep['parent_name'] not in result[i]['proteins']:
            continue

        corrseq = result[i]['sequence'] == real_pep['sequence']
        corrprotein = real_pep['parent_name'] in result[i]['proteins']
        
        non_hyb_stats[i]['correct_parent'] += 1 if corrprotein else 0 
        non_hyb_stats[i]['correct_sequence'] += 1 if corrseq else 0
        non_hyb_stats[i]['correct'] += 1 if (corrseq and corrprotein) else 0
        
        if i != 0 and not (corrprotein and corrseq):
            wrong_nonhybrid_alignments.append((result, real_pep))
        return


In [11]:
expfile = '../data/testing_output/experiment_info.json'
exp = json.load(open(expfile, 'r'))

scan_no_keyed_results = {x['spectrum']['scan_number']: x for _, x in summary.items()}
sorted_keys = [int(c) for c in exp['peptides'].keys()]

for k in sorted_keys:
    pep = exp['peptides'][str(k)]
    if k not in scan_no_keyed_results:
        continue
    if 'hybrid' in pep['peptide_name'].lower():
        hyb_calc(scan_no_keyed_results[k]['alignments'], pep)
    else:
        non_hyb_calc(scan_no_keyed_results[k]['alignments'], pep)

appending ['RPSQFM', 'RSPQMF', 'RSPFQM'] to bad for real pep RSPQFM
appending ['VQVDDDD', 'VVQDDDD', 'VQVDAEED'] to bad for real pep VQVDDD
appending ['NVVGADDD', 'NVQVDDDD', 'NVVGADDDD'] to bad for real pep NVQVDDD
appending ['RNVQVDDDD', 'RNVQVDAEED', 'RNVQVDEAED'] to bad for real pep RNVQVDDD
appending ['SRVNARDDD', 'SRNVQVDDDD', 'SRNVQVDEAED'] to bad for real pep SRNVQVDDD
appending ['FSRNVQVDDDD', 'FSRNVQVNRLC', 'FSRNVQVDWGSGSDTLRCL'] to bad for real pep FSRNVQVDDD
appending ['VFSRNVQVDDDD', 'VFSRNVQVEVNARDDD', 'VFSRNVQVNARDDD'] to bad for real pep VFSRNVQVDDD
appending ['TVFSRNVQVDDDD', 'TVFSRNVQVNARDDD', 'TVFSRNVQVEVNARDDD'] to bad for real pep TVFSRNVQVDDD


In [12]:
percent = lambda a, b: (a* 100 // b)

printstat = lambda name, stat: '{}{}\n'.format(name, str(stat).rjust(60-len(name), '.'))

secbreak = ''.join(['=' for _ in range(60)])
headbreak = ''.join(['-' for _ in range(60)])
nhcount = non_hyb_stats['count']
topalign = non_hyb_stats[0]
otheralign = {}
for i in range(1, n):
    for stat in topalign.keys():
        if stat not in otheralign:
            otheralign[stat] = 0
        otheralign[stat] += non_hyb_stats[i][stat]

######################## NON HYBRID PRETTY PRINTING ############################

nonhybsum = 'NON HYBRID STATS\n' + headbreak + '\n'
nonhybsum += printstat('number of peptides', nhcount) 
nonhybsum += 'Top alignment\n\n'
nonhybsum += printstat('correct alignment', topalign['correct'])
nonhybsum += printstat('%', percent(topalign['correct'], nhcount)) 
nonhybsum += printstat('correct protein', topalign['correct_parent']) 
nonhybsum += printstat('%', percent(topalign['correct_parent'], nhcount))
nonhybsum += printstat('correct sequence', topalign['correct_sequence'])
nonhybsum += printstat('%', percent(topalign['correct_sequence'], nhcount))

nonhybsum += '\n2 to {} alignment\n\n'.format(n)
nonhybsum += printstat('number of peptides', nhcount) 
nonhybsum += printstat('correct alignment', otheralign['correct'])
nonhybsum += printstat('%', percent(otheralign['correct'], nhcount)) 
nonhybsum += printstat('correct protein', otheralign['correct_parent']) 
nonhybsum += printstat('%', percent(otheralign['correct_parent'], nhcount))
nonhybsum += printstat('correct sequence', otheralign['correct_sequence'])
nonhybsum += printstat('%', percent(otheralign['correct_sequence'], nhcount))
nonhybsum += '\n' + secbreak + '\n\n'

############################ HYBRID PRETYY PRINTING ##############################
hcount = hyb_stats['count']
topalignh = hyb_stats[0]
otheralignh = {}
for i in range(1, n):
    for stat in topalignh.keys():
        if stat not in otheralignh:
            otheralignh[stat] = 0
        otheralignh[stat] += hyb_stats[i][stat]

hybsum = 'HYBRID STATS\n'+ headbreak + '\n'
hybsum += printstat('number of peptides', hcount)
hybsum += 'Top alignment\n\n'
hybsum += printstat('correct alignment', topalignh['correct'])
hybsum += printstat('%', percent(topalignh['correct'], hcount))
hybsum += printstat('correct sequence', topalignh['correct_sequence'])
hybsum += printstat('%', percent(topalignh['correct_sequence'], hcount))
hybsum += printstat('correct left parent', topalignh['left_correct_parent'])
hybsum += printstat('%', percent(topalignh['left_correct_parent'], hcount))
hybsum += printstat('correct right parent', topalignh['right_correct_parent'])
hybsum += printstat('%', percent(topalignh['right_correct_parent'], hcount))

hybsum += '\n2 to {} alignment\n\n'.format(n)
hybsum += printstat('correct alignment', otheralignh['correct'])
hybsum += printstat('%', percent(otheralignh['correct'], hcount))
hybsum += printstat('correct sequence', otheralignh['correct_sequence'])
hybsum += printstat('%', percent(otheralignh['correct_sequence'], hcount))
hybsum += printstat('correct left parent', otheralignh['left_correct_parent'])
hybsum += printstat('%', percent(otheralignh['left_correct_parent'], hcount))
hybsum += printstat('correct right parent', otheralignh['right_correct_parent'])
hybsum += printstat('%', percent(otheralignh['right_correct_parent'], hcount))

print(nonhybsum + hybsum)

NON HYBRID STATS
------------------------------------------------------------
number of peptides.......................................100
Top alignment

correct alignment........................................100
%........................................................100
correct protein..........................................100
%........................................................100
correct sequence.........................................100
%........................................................100

2 to 5 alignment

number of peptides.......................................100
correct alignment..........................................0
%..........................................................0
correct protein............................................0
%..........................................................0
correct sequence...........................................0
%..........................................................0


HYBRID STATS
-------------------

## For each of the hybrid missed alignments, see what was chosen instead

In [14]:
for i, bad in enumerate(wrong_hybrid_alignemnts):
    print('\nCorrect sequence: \t {} \t {}'.format(bad[1]['sequence'], bad[1]['hybrid_sequence']))
    print('Attempted alignments')
    for b in bad[0]:
        print('\t{}\t{}'.format(b['sequence'], b['hybrid_sequence'] if 'hybrid_sequence' in b else b['sequence']))



Correct sequence: 	 RSPQFM 	 RSP-QFM
Attempted alignments
	RPSQFM	RPS-QFM
	RSPQMF	RSP-QMF
	RSPFQM	RSP-FQM

Correct sequence: 	 VQVDDD 	 VQV-DDD
Attempted alignments
	VQVDDDD	VQV-DDDD
	VVQDDDD	VVQ-DDDD
	VQVDAEED	VQV-DAEED

Correct sequence: 	 NVQVDDD 	 NVQV-DDD
Attempted alignments
	NVVGADDD	NVVGA-DDD
	NVQVDDDD	NVQV-DDDD
	NVVGADDDD	NVVGA-DDDD

Correct sequence: 	 RNVQVDDD 	 RNVQV-DDD
Attempted alignments
	RNVQVDDDD	RNVQV-DDDD
	RNVQVDAEED	RNVQV-DAEED
	RNVQVDEAED	RNVQV-DEAED

Correct sequence: 	 SRNVQVDDD 	 SRNVQV-DDD
Attempted alignments
	SRVNARDDD	SRV-NARDDD
	SRNVQVDDDD	SRNVQV-DDDD
	SRNVQVDEAED	SRNVQV-DEAED

Correct sequence: 	 FSRNVQVDDD 	 FSRNVQV-DDD
Attempted alignments
	FSRNVQVDDDD	FSRNVQV-DDDD
	FSRNVQVNRLC	FSRNVQV-NRLC
	FSRNVQVDWGSGSDTLRCL	FSRNVQV-DWGSGSDTLRCL

Correct sequence: 	 VFSRNVQVDDD 	 VFSRNVQV-DDD
Attempted alignments
	VFSRNVQVDDDD	VFSRNVQV-DDDD
	VFSRNVQVEVNARDDD	VFSRNVQV-EVNARDDD
	VFSRNVQVNARDDD	VFSRNVQV-NARDDD

Correct sequence: 	 TVFSRNVQVDDD 	 TVFSRNVQV-DDD
Attempted

In [21]:
print(len(wrong_hybrid_alignemnts))

119
