# Database

A metabolite database is represented by two types of files: a struct and a mapping. These databases are used in the
annotation process ran by SmartPeak.

Both files are in :abbr:`TSV (tab-separated values)` format. Functions to handle these files have to be imported from
the ``io`` module. 

In [3]:
from BFAIR.io import struct, mapping

## Struct

Struct files contain 4 columns of metabolite ID, formula, SMILES, and InChI. Only the first two columns are used by SmartPeak.

In [7]:
struct_path = Path("../../tests/test_data/struct_test.tsv")

struct_test = struct.read(struct_path)
struct_test.head()

Unnamed: 0,id,formula,unused_smiles,unused_inchi
0,glc__D,C6H12O6,smiles,inchi
1,gln__L,C5H10N2O3,smiles,inchi
2,glu__L,C5H9NO4,smiles,inchi
3,glx,C2H2O3,smiles,inchi
4,h2o,H2O,smiles,inchi


## Mapping

A mapping file summarizes a struct file by grouping metabolites by their formula. These data structures contain 3 columns of mass, formula, and metabolite IDs. The mass column is unused, so its value is typically 0.

In [12]:
mapping_test = mapping.from_struct(struct_test)
mapping_test.head()

Unnamed: 0,unused_mass,formula,ids
0,0,C10H14N5O7P,[amp]
1,0,C10H15N5O10P2,[adp]
2,0,C10H16N5O13P3,[atp]
3,0,C21H28N7O14P2,[nad]
4,0,C21H29N7O14P2,[nadh]


In [14]:
from tabulate import tabulate

In [20]:
print(tabulate(mapping_test.iloc[:5], headers="keys", showindex=False))

  unused_mass  formula        ids
-------------  -------------  --------
            0  C10H14N5O7P    ['amp']
            0  C10H15N5O10P2  ['adp']
            0  C10H16N5O13P3  ['atp']
            0  C21H28N7O14P2  ['nad']
            0  C21H29N7O14P2  ['nadh']


In [17]:
struct_test.iloc[:5]

Unnamed: 0,id,formula,unused_smiles,unused_inchi
0,glc__D,C6H12O6,smiles,inchi
1,gln__L,C5H10N2O3,smiles,inchi
2,glu__L,C5H9NO4,smiles,inchi
3,glx,C2H2O3,smiles,inchi
4,h2o,H2O,smiles,inchi
