# Testing indexing a database
Due to the amount of time it takes to run the current iteration, we need to do some indexing of the database

#### Proposed idea
Create a new proprietary database from a .fasta one
1. Identify all the 3-mers in the database
2. Using the 3-mers as keys to a hash map, list all the (protein, starting_pos, ending_pos) pairings in the value
3. Only look through the 3-mers that are interesting


## import the fasta file 

In [22]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from src.file_io import fasta
fasta_file_name = '../../testing framework/data/databases/17029prots.fasta'
database = fasta.read(fasta_file_name)


## find all the k-mers in the database and keep track of them

In [23]:
kmers = {}
kmer_len = 3
for entry in database:
    for i in range(len(entry['sequence']) - kmer_len + 1):
        mer = entry['sequence'][i: i+kmer_len]
        if mer not in kmers:
            kmers[mer] = []
        pairing = (entry['name'], i, i+kmer_len-1)
        kmers[mer].append(pairing)
        
print(kmers['YCN'])


[('AGAL_MOUSE', 221, 223), ('CRNS1_MOUSE', 785, 787), ('ABCA1_MOUSE', 353, 355), ('CALCR_MOUSE', 87, 89), ('ADAM5_MOUSE', 459, 461), ('CALRL_MOUSE', 62, 64), ('ADCY9_MOUSE', 908, 910), ('COBA1_MOUSE', 1635, 1637), ('BDH_MOUSE', 286, 288), ('BIR1A_MOUSE', 268, 270), ('CNTP1_MOUSE', 994, 996), ('EIF3L_MOUSE', 159, 161), ('ADA2B_MOUSE', 415, 417), ('ADAM2_MOUSE', 460, 462), ('ACHG_MOUSE', 126, 128), ('CO5A2_MOUSE', 1324, 1326), ('COPD_MOUSE', 477, 479), ('CP086_MOUSE', 185, 187), ('ATS7_MOUSE', 842, 844), ('CPNE3_MOUSE', 370, 372), ('CNTP4_MOUSE', 624, 626), ('ADA2A_MOUSE', 430, 432), ('ARHG4_MOUSE', 171, 173), ('ARHG9_MOUSE', 196, 198), ('ALD2_MOUSE', 48, 50), ('FBX32_MOUSE', 39, 41), ('BIR1E_MOUSE', 268, 270), ('CRBG3_MOUSE', 344, 346), ('CNTP2_MOUSE', 627, 629), ('CNTP2_MOUSE', 741, 743), ('5HT6R_MOUSE', 311, 313), ('SC6A8_MOUSE', 142, 144), ('S15A5_MOUSE', 518, 520), ('SBP2L_MOUSE', 671, 673), ('TTC28_MOUSE', 670, 672), ('FLRT1_MOUSE', 37, 39), ('AMPN_MOUSE', 795, 797), ('FA76A_MOUSE'

## Write this all to a file
We should store all of the info (the fasta databse and meta data) in one file. Format may be something like
```python3
{
    'metadata': <kmer dict>,
    'database': {
        <protein_name>: {
            id: <id>
            sequence: <sequence>
        }
    }
}
```
pickle dump it to a file with the extension '.fastax'


In [24]:
import pickle
filename = fasta_file_name + 'x'

filecontents = {'metadata': kmers, 'database': {}}

for entry in database: 
    indexed_entry = {
        'sequence': entry['sequence']
    }
    # add extra info if available
    if 'id' in entry:
        indexed_entry['id'] = entry['id']
    if 'human_readable_name' in entry: 
        indexed_entry['human_readable_name'] = entry['human_readable_name']
        
    filecontents['database'][entry['sequence']] = indexed_entry
    

print(filecontents)


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [25]:
# dump it to a file
import pickle

with open(filename, 'wb') as o:
    pickle.dump(filecontents, o)