# Exploring the reference database
Let's see what properties we can find :)

## Structure
In `genome/`, there's multiple sub-folder, we will start with `Bacteria`
It then contains all recorded species/strands in individual folders


## Content of each species/strand folder
In each folder there's:
- .ASN with 
 - `taxname "Acetobacter pasteurianus IFO 3283-32"`
 - `db "taxon", tag id 634457`
 - `genus "Acetobacter", species "pasteurianus"`
 - `mod { {subtype strain, subname "IFO 3283" }, { subtype substrain, subname "IFO 3283-32" } },`
 - `lineage "Bacteria; Proteobacteria; Alphaproteobacteria; Rhodospirillales; Acetobacteraceae; Acetobacter",`
- .FAA
 - with multiple ">gi|384064451|ref|YP_005479409.1| hypothetical protein APA32_44160 [Acetobacter pasteurianus IFO 3283-32]"
 - and probably the amino-acid sequence for each of these proteins
- .FFN
 - multiple ">gi|384064450|ref|NC_017102.1|:c562-116 Acetobacter pasteurianus IFO 3283-32 plasmid pAPA32-040, complete sequence"
 - probably DNA sequence
- .FNA
 - Also DNA
- .GBK : Human readable format with most info !
 - have an identifier `/db_xref="taxon:634457"`
- .GFF with `##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=634457`
- .RPT
 - seem good with simple Python INI config file format: 
   - `DNA  length = 3035`
   - `Taxname: Acetobacter pasteurianus IFO 3283-32`
   - `Taxid: 634457`


http://defindit.com/readme_files/ncbi_file_extension_format.html

What we need is the taxo id, name, and the DNA, which can be found in:
 - .gbk for the taxo and name
 - .fna for the sequence

#### File marker
https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly <br>
`NC_	Genomic	Complete genomic molecule, usually reference assembly`

#### Status
https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_status_codes/?report=objectonly <br>
in `COMMENT` : VALIDATED > REVIEWED > PROVISIONAL > ...


## Coding
### Import and Paths

In [2]:
import os
import pandas as pd
import configparser
import pickle
from Bio import SeqIO
from time import time
from tqdm import tqdm_notebook as tqdm

In [3]:
path_ref_db = "/mnt/genomeDB/ncbi/genomes/Bacteria/"
path_kmer_freq = "/home/sjriondet/Data/Kmer_frequencies/"

In [4]:
os.chdir(path_ref_db)

## Functions

Counting kmer frequencies

In [5]:
nucleotides = "ACGT"

In [6]:
k = 4

In [7]:
def read_fna(file_path):
    with open(file_path) as f:
        rec = f.readlines()
        return "".join(rec[1:]).replace("\n", "")

In [8]:
def combinaisons(combi, n, instances=nucleotides):
    if n == 1:
        return combi
    else:
        return [f"{a}{n}" for a in combinaisons(combi, n-1) for n in instances]

In [9]:
def kmers_dic(n):
    return {a:0 for a in combinaisons(nucleotides, n)}

In [10]:
def count_kmers(seq, kmer_dic, n):
    for kmer in window(seq, n):
        kmer_dic[kmer] += 1

In [11]:
def window(fseq, window_size=53):
    for i in range(len(fseq) - window_size + 1):
        yield fseq[i:i+window_size]

## Tests

In [12]:
os.chdir("Acetobacter_pasteurianus_IFO_3283_32_uid158375")

In [13]:
rec = read_fna("NC_017102.fna")
len(rec)

3204

### 3-mer

In [14]:
kmer_3 = kmers_dic(3)

In [15]:
# %%timeit
count_kmers(rec, kmer_3, 3)

### 4-mer

In [16]:
kmer_4 = kmers_dic(4)

In [17]:
kmer_4

{'AAAA': 0,
 'AAAC': 0,
 'AAAG': 0,
 'AAAT': 0,
 'AACA': 0,
 'AACC': 0,
 'AACG': 0,
 'AACT': 0,
 'AAGA': 0,
 'AAGC': 0,
 'AAGG': 0,
 'AAGT': 0,
 'AATA': 0,
 'AATC': 0,
 'AATG': 0,
 'AATT': 0,
 'ACAA': 0,
 'ACAC': 0,
 'ACAG': 0,
 'ACAT': 0,
 'ACCA': 0,
 'ACCC': 0,
 'ACCG': 0,
 'ACCT': 0,
 'ACGA': 0,
 'ACGC': 0,
 'ACGG': 0,
 'ACGT': 0,
 'ACTA': 0,
 'ACTC': 0,
 'ACTG': 0,
 'ACTT': 0,
 'AGAA': 0,
 'AGAC': 0,
 'AGAG': 0,
 'AGAT': 0,
 'AGCA': 0,
 'AGCC': 0,
 'AGCG': 0,
 'AGCT': 0,
 'AGGA': 0,
 'AGGC': 0,
 'AGGG': 0,
 'AGGT': 0,
 'AGTA': 0,
 'AGTC': 0,
 'AGTG': 0,
 'AGTT': 0,
 'ATAA': 0,
 'ATAC': 0,
 'ATAG': 0,
 'ATAT': 0,
 'ATCA': 0,
 'ATCC': 0,
 'ATCG': 0,
 'ATCT': 0,
 'ATGA': 0,
 'ATGC': 0,
 'ATGG': 0,
 'ATGT': 0,
 'ATTA': 0,
 'ATTC': 0,
 'ATTG': 0,
 'ATTT': 0,
 'CAAA': 0,
 'CAAC': 0,
 'CAAG': 0,
 'CAAT': 0,
 'CACA': 0,
 'CACC': 0,
 'CACG': 0,
 'CACT': 0,
 'CAGA': 0,
 'CAGC': 0,
 'CAGG': 0,
 'CAGT': 0,
 'CATA': 0,
 'CATC': 0,
 'CATG': 0,
 'CATT': 0,
 'CCAA': 0,
 'CCAC': 0,
 'CCAG': 0,
 'CC

In [18]:
%%timeit
count_kmers(rec, kmer_4, 4)

1000 loops, best of 3: 1.29 ms per loop


In [19]:
%%timeit
kmer_4[max(kmer_4, key=kmer_4.get)]

10000 loops, best of 3: 26.4 µs per loop


In [20]:
%%timeit
max(kmer_4.values())

100000 loops, best of 3: 6.09 µs per loop


In [21]:
%%timeit
for k in kmer_4:
    kmer_4[k] /= 4

10000 loops, best of 3: 33.6 µs per loop


In [22]:
%%timeit
kmer_4.update((x, y/4) for x, y in kmer_4.items())

10000 loops, best of 3: 45.4 µs per loop


In [23]:
def kmer_pkl_path(k, fna_path):
    path_gbk = fna_path.replace(".fna", ".gbk")
    assert os.path.isfile(path_gbk), f"{fna_path} DOESN'T have a .gbk file ??"
    
    with open(path_gbk) as gbk:
        description=gbk.read()  #.replace('\n', '')
        
    identificator = 'db_xref="taxon:'
    taxo_start = description.find(identificator)
    taxo = description[taxo_start+len(identificator):
                       taxo_start+description[taxo_start:].find('"\n')]
    assert len(taxo) < 10, f"The taxo id search failed, found an id of length {len(taxo)}..."
    
    # TODO: ADD full path of the original file in the file name, or maybe in the .pkl
    
    return os.path.join(path_kmer_freq, str(k), taxo + ".pkl")

In [24]:
def kmer_freq_to_file(kmer_dic, freq_path):
    with open(freq_path, 'wb') as f_out:
        pickle.dump(kmer_dic, f_out)

In [25]:
fna_path = "/mnt/genomeDB/ncbi/genomes/Bacteria/Aciduliprofundum_boonei_T469_uid43333/NC_013926.gbk"
path_gbk = fna_path.replace(".fna", ".gbk")
with open(path_gbk) as gbk:
    description = gbk.read()
identificator = 'db_xref="taxon:'
taxo_start = description.find(identificator)
taxo = description[taxo_start+len(identificator):
                   taxo_start+description[taxo_start:].find('"\n')]

In [26]:
len(taxo)

6

In [27]:
taxo

'439481'

In [28]:
description.find('db_xref="')

4336

In [176]:
description.find('"\n')

4231

## Loop through all bacteria and retrieve the kmer spectrum

In [29]:
def count_all(scanning=path_ref_db, k=4):
    start = time()
    n = 0
    failed = 0
    for folder in tqdm(os.scandir(scanning), desc="Species", total=len(os.listdir(scanning))):
        if os.path.isdir(folder):
            files = [f for f in os.scandir(folder) if f.name.endswith(".fna")]
            if len(files) > 0:
                # Check if already done
                kmer_freq_path = kmer_pkl_path(k, files[0].path)
                if os.path.isfile(kmer_freq_path):
                    continue
                    
                try:
                    # Count kmers
                    kmer_count = kmers_dic(k)   # change to deepcopy for speed up
                    rec = read_fna(files[0])    # go through all files
                    count_kmers(rec, kmer_count, k)

                    # Normalise
                    max_val = max(kmer_count.values())
                    for key in kmer_4:
                        kmer_count[key] /= max_val

                    # Save to a file
                    kmer_freq_to_file(kmer_count, kmer_freq_path)
                    n += 1
                
                except:
                    failed += 1
                    print(f"Failed: {files[0].path}")

                if n + failed > 3000:
                    break
                    
    print(f"\n{n+failed} species have been scanned\n"
          f"Success: {n}, failed: {failed} \n"
          f"Took {time()-start:.1f}s to complete")

In [None]:
count_all()

Failed: /mnt/genomeDB/ncbi/genomes/Bacteria/Achromobacter_xylosoxidans_uid205255/NC_021285.fna
Failed: /mnt/genomeDB/ncbi/genomes/Bacteria/Acinetobacter_baumannii_ACICU_uid58765/NC_010605.fna
Failed: /mnt/genomeDB/ncbi/genomes/Bacteria/Acinetobacter_baumannii_TYTH_1_uid176498/NC_018706.fna
Failed: /mnt/genomeDB/ncbi/genomes/Bacteria/Actinobacillus_pleuropneumoniae_serovar_3_JL03_uid58891/NC_010278.fna
Failed: /mnt/genomeDB/ncbi/genomes/Bacteria/Actinobacillus_suis_H91_0380_uid176363/NC_018690.fna


In [47]:
for f in os.scandir():
    if f.name.endswith("fna"):
        print(f"{f.name}\t{os.path.getsize(f):>10,d} bytes")

NC_017102.fna	     3,356 bytes
NC_017103.fna	     3,185 bytes
NC_017111.fna	 2,946,222 bytes
NC_017112.fna	     1,947 bytes
NC_017134.fna	   194,284 bytes
NC_017135.fna	    50,781 bytes
NC_017149.fna	   185,660 bytes


In [74]:
%%timeit
rec = SeqIO.read("NC_017102.fna", "fasta")

1000 loops, best of 3: 535 µs per loop


In [75]:
rec

'CAAATTGCGCTACAGATAGTTTGATAGTCTTCCGAAGTTCCAGAGAGAGGCAGCAGAAACTCTTTTACTTTCGCTTCTCCACCAGCCTGCCTGAAAATTACATAAACTGAATCCATCATTTTTCTCCGGCTAATAACGCATCACCTTTTTCTTTCCACTCCTGCCTTTTGGTGTCGTTTATTTCGGCTTTTGCCATAGAAGCCAAAGCCCCCAGAAGTTCGCCTTTGTCCCATGTTGGATTGCCGGTCTTACTATCGACCAGACCGGCCAGAATAAGCAGGCCAGCAGACTGGATGAGGTTATGCGTTCTTTCCTTGTCCTTGGCGTCTCTTTCTCTCTTGGCGATTTTCGATGATCTGGCCTTTGCTTCGGATGCTTTGGCTTTAGCCTCTTTTACGGCTTCCTGCGCTTTTTCCGACCGGATGAGTGCCTTGAGGTCTCTTTCCTCTTTTGGCGTTCTGTCAGTCTTGTCATAGAGCAAGATCAGGATGCGCTGTTCTGTCGATGGCTTCTCAAGCTGTCGAAGGTTAGCGATATGCGTTTCAATCCGAAAAGAGGCCATAGGTTTTCATCCAAACATAAAACGATTATCCGGCTTTATAGCATAAAAATGGTAGCAAATAGACGAGAGCCAAATTATCCTTCCTTGTAGGAAGGCGCGCTTAGACGTTTCTCATAAATGAGAAACAGCACGTTAGGGGTAAACCCCTTAAAACCCCGTTTTAACTTCAAAGGATTGTGGTGTGGCGATTGGTCGTTTGTCGATGAAGGTTGGAGGCAAGGGCAAAGCCTGCGCCCACTCCAACTACATCGAACGCCAAGGCAAATATGCCCATCGCCTAGAAACTGGTGAACGACTGGTTGCGACCGGCAGCGGCAACATGCCGGAATGGGCGCAATCCTCATCACAGTTCTGGAAGGCAGCGGATGAGCATGAACGGGCGAATGGCACCACATACCGCGAAATGGAAATAGCCCTGCCGCGTGAACTGGATGACA

In [86]:
mer2 = {f"{a}{b}":0 for a in nucleotides for b in nucleotides}

In [87]:
mer2 

{'AA': 0,
 'AC': 0,
 'AG': 0,
 'AT': 0,
 'CA': 0,
 'CC': 0,
 'CG': 0,
 'CT': 0,
 'GA': 0,
 'GC': 0,
 'GG': 0,
 'GT': 0,
 'TA': 0,
 'TC': 0,
 'TG': 0,
 'TT': 0}