# Porting genome scale metabolic models for metabolomics
- from CarveMe

**source**
- https://github.com/cdanielmachado/carveme

**Use cobra to parse SBML models whereas applicable**

Not all models comply with the formats in cobra. Models from USCD and Thiele labs should comply.

**Base our code on metDataModel**

Each model needs a list of Reactions, list of Pathways, and a list of Compounds.
It's important to include with Compounds with all linked identifiers to other DBs (HMDB, PubChem, etc), and with formulae (usually charged form in these models) when available.
We can alwasy update the data later. E.g. the neural formulae can be inferred from charged formula or retrieved from public metabolite database (e.g., HMDB) if linked.
Save in Python pickle and in JSON.

**No compartmentalization**
- After decompartmentalization,
  - transport reactions can be removed - they are identified by reactants and products being the same.
  - redundant reactions can be merge - same reactions in diff compartments become one.

Minghao Gong, 2022-04-26

In [1]:
# !pip install cobra --user --ignore-installed ruamel.yaml
# !pip install --upgrade metDataModel # https://github.com/shuzhao-li/metDataModel/ 
# !pip install --upgrade numpy pandas

In [2]:
import cobra # https://cobrapy.readthedocs.io/en/latest/io.html#SBML
from metDataModel.core import Compound, Reaction, Pathway, MetabolicModel
import requests
import sys
import re

sys.path.append("/Users/gongm/Documents/projects/mass2chem/")
sys.path.append("/Users/gongm/Documents/projects/JMS/JMS/JMS")
from mass2chem.formula import *
from jms.formula import *
from jms.utils.gems import *
from jms.utils.git_download import *

In [3]:
# download the most updated Rat-GEM.xml
model_name = 'Clostridium_sporogenes-Carve'
file_name = 'Clostridium_sporogenes_carveme.xml'
local_path = output_fdr = f'../testdata/draftGenome-GEM-JMS/CarveMe/'

In [4]:
# Read the model via cobra
model = cobra.io.read_sbml_model(os.path.join(local_path,file_name))

Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  1.000e+00  ratio =  1.000e+00
Problem data seem to be well scaled


In [5]:
model

0,1
Name,Clostridium_sporogenes_ATCC_15579
Memory address,0x07fc463a378e0
Number of metabolites,1219
Number of reactions,1715
Number of groups,0
Objective expression,1.0*Growth - 1.0*Growth_reverse_699ae
Compartments,"cytosol, periplasm, extracellular space"


In [6]:
# metabolite entries, readily convert to list of metabolites
model.metabolites[33]

0,1
Metabolite identifier,nadh_c
Name,Nicotinamide adenine dinucleotide - reduced
Memory address,0x07fc463a6d700
Formula,C21H27N7O14P2
Compartment,C_c
In 128 reaction(s),"ILEDHr, M1PD, IMPD, EAR160x, EAR161x, PPND, NHFRBO, EAR180x, 4MBZDH, ICDHx, PGCD, AKGDH, ALCD2x, NADH10, NADH7, PERD, SBTPD, FMNRx, NADFADOR, OIVD3, EAR60x, FRNDPR2r_1, NADH8, PXMO, OIVD1r, SHCHD2,..."


In [7]:
# CarveMe has some particularly strange 
[x.formula for x in model.metabolites]

['C20H21N7O7',
 'C10H12N5O10P2',
 'C10H12N5O13P3',
 'C5H8NO4',
 'HPO4;HO4P',
 'C27H52O5',
 'C27H52O5',
 'C31H60O5',
 'C31H60O5',
 'C31H56O5',
 'C31H56O5',
 'C35H68O5',
 'C35H68O5',
 'C35H64O5',
 'C35H64O5',
 'C39H76O5',
 'C39H76O5',
 'C39H72O5',
 'C39H72O5',
 'H2O',
 'C3H6O',
 'C3H8O2',
 'C3H8O2',
 'C6H12O6',
 'C6H12O6',
 'H2O',
 'H2O2',
 'H4N;NH4',
 'O2',
 'H',
 'C9H11N2O12P2',
 'C15H22N2O17P2',
 'C21H26N7O14P2',
 'C21H27N7O14P2',
 'C21H32N7O16P3S',
 'C21H25N7O17P3',
 'C21H26N7O17P3',
 'H',
 'C5H9O4',
 'C6H11O4',
 'C6H9O6PS',
 'H2O',
 'C7H14N2O4',
 'C7H5NO4',
 'C19H36O7P1',
 'C19H36O7P1',
 'C17H36NO7P1',
 'C17H36NO7P1',
 'C19H40NO7P1',
 'C19H40NO7P1',
 'C19H38NO7P1',
 'C19H38NO7P1',
 'C21H44NO7P1',
 'C21H44NO7P1',
 'C21H42NO7P1',
 'C21H42NO7P1',
 'C23H48NO7P1',
 'C23H48NO7P1',
 'C23H46NO7P1',
 'C23H46NO7P1',
 'C10H12N5O7P',
 'C12H23O2',
 'C29H58N1O8P1',
 'HO7P2;P2HO7',
 'C33H66N1O8P1',
 'C14H27O2',
 'C33H62N1O8P1',
 'C14H25O2',
 'C16H31O2',
 'C37H74N1O8P1',
 'C16H29O2',
 'C37H70N1O8P1

In [8]:
model.metabolites[33].__dict__

{'_id': 'nadh_c',
 'name': 'Nicotinamide adenine dinucleotide - reduced',
 'notes': {'FORMULA': 'C21H27N7O14P2',
  'BioCyc': 'META:NADH',
  'SEED Compound': 'cpd00004',
  'UniPathway Compound': 'UPC00004',
  'KEGG Compound': 'C00004',
  'BioPath Molecule': 'NADH',
  'MetaNetX (MNX) Chemical': 'MNXM10',
  'Reactome': '192305;194697;29362;73473',
  'Human Metabolome Database': 'HMDB01487'},
 '_annotation': {},
 '_model': <Model Clostridium_sporogenes_ATCC_15579 at 0x7fc463a378e0>,
 '_reaction': {<Reaction 4MBZALDH at 0x7fc449961b50>,
  <Reaction 4MBZDH at 0x7fc44993c790>,
  <Reaction ABUTD at 0x7fc4499838e0>,
  <Reaction ACALD at 0x7fc4499830d0>,
  <Reaction ACOAD1 at 0x7fc44999dee0>,
  <Reaction ACOAD2 at 0x7fc4499a7ee0>,
  <Reaction ACOAD4_1 at 0x7fc4499a7f70>,
  <Reaction ACOAD5_1 at 0x7fc44997b040>,
  <Reaction ACTD2 at 0x7fc4499ccf70>,
  <Reaction AKGDH at 0x7fc449a12850>,
  <Reaction ALAD_L at 0x7fc449a12df0>,
  <Reaction ALCD19 at 0x7fc449a2eb80>,
  <Reaction ALCD1 at 0x7fc4499f8d

In [9]:
# reaction entries, Readily convert to list of reactions
model.reactions[33]

0,1
Reaction identifier,2AGPGAT160
Name,2-acyl-glycerophospho-glycerol acyltransferase (n-C16:0)
Memory address,0x07fc4498fceb0
Stoichiometry,"2agpg160_c + atp_c + hdca_c --> amp_c + pg160_c + ppi_c  2-Acyl-sn-glycero-3-phosphoglycerol (n-C16:0) + ATP + Hexadecanoate (n-C16:0) --> AMP + Phosphatidylglycerol (dihexadecanoyl, n-C16:0) + Diphosphate"
GPR,EDU36256_1
Lower bound,0.0
Upper bound,1000.0


In [10]:
model.reactions[33].__dict__

{'_id': '2AGPGAT160',
 'name': '2-acyl-glycerophospho-glycerol acyltransferase (n-C16:0)',
 'notes': {'MetaNetX (MNX) Equation': 'MNXR68143'},
 '_annotation': {},
 '_gene_reaction_rule': 'EDU36256_1',
 'subsystem': '',
 '_genes': {<Gene EDU36256_1 at 0x7fc449820220>},
 '_metabolites': {<Metabolite 2agpg160_c at 0x7fc463a6dfd0>: -1.0,
  <Metabolite atp_c at 0x7fc463a6d250>: -1.0,
  <Metabolite hdca_c at 0x7fc463a6dd60>: -1.0,
  <Metabolite amp_c at 0x7fc463a6db80>: 1.0,
  <Metabolite pg160_c at 0x7fc463adc310>: 1.0,
  <Metabolite ppi_c at 0x7fc463a6dc10>: 1.0},
 '_model': <Model Clostridium_sporogenes_ATCC_15579 at 0x7fc463a378e0>,
 '_lower_bound': 0.0,
 '_upper_bound': 1000.0}

In [11]:
# NO group or pathway information
# model.groups[33]

In [12]:
model.metabolites[33].__dict__

{'_id': 'nadh_c',
 'name': 'Nicotinamide adenine dinucleotide - reduced',
 'notes': {'FORMULA': 'C21H27N7O14P2',
  'BioCyc': 'META:NADH',
  'SEED Compound': 'cpd00004',
  'UniPathway Compound': 'UPC00004',
  'KEGG Compound': 'C00004',
  'BioPath Molecule': 'NADH',
  'MetaNetX (MNX) Chemical': 'MNXM10',
  'Reactome': '192305;194697;29362;73473',
  'Human Metabolome Database': 'HMDB01487'},
 '_annotation': {},
 '_model': <Model Clostridium_sporogenes_ATCC_15579 at 0x7fc463a378e0>,
 '_reaction': {<Reaction 4MBZALDH at 0x7fc449961b50>,
  <Reaction 4MBZDH at 0x7fc44993c790>,
  <Reaction ABUTD at 0x7fc4499838e0>,
  <Reaction ACALD at 0x7fc4499830d0>,
  <Reaction ACOAD1 at 0x7fc44999dee0>,
  <Reaction ACOAD2 at 0x7fc4499a7ee0>,
  <Reaction ACOAD4_1 at 0x7fc4499a7f70>,
  <Reaction ACOAD5_1 at 0x7fc44997b040>,
  <Reaction ACTD2 at 0x7fc4499ccf70>,
  <Reaction AKGDH at 0x7fc449a12850>,
  <Reaction ALAD_L at 0x7fc449a12df0>,
  <Reaction ALCD19 at 0x7fc449a2eb80>,
  <Reaction ALCD1 at 0x7fc4499f8d

## Port metabolite

In [13]:
def port_metabolite(M):
    # convert cobra Metabolite to metDataModel Compound
    Cpd = Compound()
    Cpd.src_id = remove_compartment_by_split(M.id,'_')
    Cpd.id = remove_compartment_by_split(M.id,'_')              # temporarily the same with the source id
    Cpd.name = M.name
    Cpd.charged_formula = M.formula
    Cpd.db_ids = [[model_name,Cpd.src_id]] # using src_id to also reference Rat-GEM ID in db_ids field
    for k,v in M.notes.items():
        if isinstance(v,list):
            Cpd.db_ids.append([[k,x] for x in v])
        else: 
            if ":" in v:
                Cpd.db_ids.append([k,v.split(":")[1]])
            else:
                Cpd.db_ids.append([k,v])
    
    inchi_list = [x[1].split('=')[1] for x in Cpd.db_ids if x[0] == 'inchi']
    if len(inchi_list) ==1:
        Cpd.inchi = inchi_list[0]
    elif len(inchi_list) >1:
        Cpd.inchi = inchi_list
        
    return Cpd

In [14]:
myCpds = []
for i in range(len(model.metabolites)):
    myCpds.append(port_metabolite(model.metabolites[i]))

In [15]:
len(myCpds)

1219

In [16]:
# remove duplicated compounds
myCpds = remove_duplicate_cpd(myCpds)

In [17]:
len(myCpds)

854

In [18]:
def fetch_CarveMe_GEM_charge_formula(compound_list,Bigg_json_path,overwrite = True):
    with open(Bigg_json_path,'r') as f:
        list_Bigg_cpd = json.load(f)
    Bigg_dict = {}
    for Bigg_cpd in list_Bigg_cpd:
        Bigg_dict.update({Bigg_cpd['id']:Bigg_cpd})
    new_cpd_list = []
    for myCpd in compound_list:
        for k,v in Bigg_dict.items():
            if myCpd.id == k:
                if overwrite == True:
                    myCpd.charge = v['charge']
                    myCpd.charged_formula = v['charged_formula'] # the Bigg json is using `identifiers` rather than `db_ids`
                break
        new_cpd_list.append(myCpd)
    return new_cpd_list

In [19]:
with open('../jms/data/staged/Bigg_FM_CG_updated0427.json','r') as f:
    list_Bigg_cpd = json.load(f)
Bigg_dict = {}
for Bigg_cpd in list_Bigg_cpd:
    Bigg_dict.update({Bigg_cpd['id']:Bigg_cpd})

In [20]:
Bigg_dict['idon__L']

{'id': 'idon__L',
 'name': 'L-Idonate',
 'identifiers': [['BIGG', 'idon_DASH_L_p'],
  ['BIGG', 'idon_L_p'],
  ['BIGG', 'idon__L'],
  ['BIGG', 'idon__L_p'],
  ['KEGG', 'C00770'],
  ['CHEBI', '13126'],
  ['CHEBI', '17796'],
  ['CHEBI', '21335'],
  ['CHEBI', '21336'],
  ['CHEBI', '57659'],
  ['CHEBI', '58494'],
  ['CHEBI', '6250'],
  ['BioCyc', 'L-IDONATE'],
  ['MetaNetX', 'MNXM1565'],
  ['inchikey', 'RGHNJXZEOKUKBD-SKNVOMKLSA-M'],
  ['SEED', 'cpd00573']],
 'neutral_formula': '',
 'charge': -1,
 'charged_formula': 'C6H11O7',
 'neutral_mono_mass': 0.0,
 'SMILES': '',
 'inchi': ''}

In [21]:
myCpds = fetch_CarveMe_GEM_charge_formula(myCpds,'../jms/data/staged/Bigg_FM_CG_updated0427.json')

In [22]:
myCpds[20].__dict__

{'internal_id': '',
 'id': 'udp',
 'name': 'UDP',
 'db_ids': [['Clostridium_sporogenes-Carve', 'udp'],
  ['FORMULA', 'C9H11N2O12P2'],
  ['BioCyc', 'UDP;META'],
  ['SEED Compound', 'cpd00014'],
  ['UniPathway Compound', 'UPC00015'],
  ['KEGG Compound', 'C00015;G10619'],
  ['BioPath Molecule', 'Uridine-5-prime-diphosphate'],
  ['MetaNetX (MNX) Chemical', 'MNXM17'],
  ['Reactome', '110096;111814;158602;205687'],
  ['Human Metabolome Database', 'HMDB00295']],
 'neutral_formula': '',
 'neutral_mono_mass': 0.0,
 'charge': -3,
 'charged_formula': 'C9H11N2O12P2',
 'SMILES': '',
 'inchi': '',
 'src_id': 'udp'}

In [23]:
import warnings
def update_neutral_formula_mass(compound_list):
    for myCpd in compound_list:
        try:
            myCpd.neutral_formula = adjust_charge_in_formula(myCpd.charged_formula,myCpd.charge) 
            myCpd.neutral_mono_mass = neutral_formula2mass(myCpd.neutral_formula)
        except:
            warnings.warn(f'{myCpd.id} do not have legal charge or charge formula to convert')

In [24]:
update_neutral_formula_mass(myCpds)

In [25]:
# check those that doesn't have the charge and charged formula
[x.__dict__ for x in myCpds if x.id == 'idon__L']

[{'internal_id': '',
  'id': 'idon__L',
  'name': 'L-Idonate',
  'db_ids': [['Clostridium_sporogenes-Carve', 'idon__L'],
   ['FORMULA', 'C6H11O7'],
   ['BioCyc', 'L-IDONATE'],
   ['SEED Compound', 'cpd00573'],
   ['UniPathway Compound', 'UPC00770'],
   ['KEGG Compound', 'C00770'],
   ['MetaNetX (MNX) Chemical', 'MNXM1565']],
  'neutral_formula': 'C6H12O7',
  'neutral_mono_mass': 196.058302,
  'charge': -1,
  'charged_formula': 'C6H11O7',
  'SMILES': '',
  'inchi': '',
  'src_id': 'idon__L'}]

## Port reactions

In [26]:
[x.__dict__ for x in model.reactions]

[{'_id': '12DGR120tipp',
  'name': '1,2 diacylglycerol transport via flipping (periplasm to cytoplasm, n-C12:0)',
  'notes': {'MetaNetX (MNX) Equation': 'MNXR7'},
  '_annotation': {},
  '_gene_reaction_rule': '',
  'subsystem': '',
  '_genes': set(),
  '_metabolites': {<Metabolite 12dgr120_p at 0x7fc463a6d1c0>: -1.0,
   <Metabolite 12dgr120_c at 0x7fc463a6d220>: 1.0},
  '_model': <Model Clostridium_sporogenes_ATCC_15579 at 0x7fc463a378e0>,
  '_lower_bound': 0.0,
  '_upper_bound': 1000.0},
 {'_id': '12DGR140tipp',
  'name': '1,2 diacylglycerol transport via flipping (periplasm to cytoplasm, n-C14:0)',
  'notes': {'MetaNetX (MNX) Equation': 'MNXR81306'},
  '_annotation': {},
  '_gene_reaction_rule': '',
  'subsystem': '',
  '_genes': set(),
  '_metabolites': {<Metabolite 12dgr140_p at 0x7fc463a6d130>: -1.0,
   <Metabolite 12dgr140_c at 0x7fc463a6d1f0>: 1.0},
  '_model': <Model Clostridium_sporogenes_ATCC_15579 at 0x7fc463a378e0>,
  '_lower_bound': 0.0,
  '_upper_bound': 1000.0},
 {'_id':

In [27]:
# port reactions, to include genes and enzymes
def port_reaction(R):
    new = Reaction()
    new.id = R.id
    new.reactants = [remove_compartment_by_split(m.id,'_') for m in R.reactants] # decompartmentalization
    new.products = [remove_compartment_by_split(m.id,'_') for m in R.products]   # decompartmentalization
    new.genes = [g.id for g in R.genes]
    new.enzymes = R.notes.get('EC Number',[])
    new.db_ids = dict2listOfTuple(R.notes,';') # not sure ; is useful
    return new
    
test99 = port_reaction(model.reactions[200])
test99.__dict__

{'azimuth_id': '',
 'id': 'ALAt2r',
 'source': [],
 'version': '',
 'status': '',
 'reactants': ['ala__L', 'h'],
 'products': ['ala__L', 'h'],
 'enzymes': [],
 'genes': ['EDU37508_1', 'EDU36264_1', 'EDU37559_1'],
 'pathways': [],
 'ontologies': [],
 'species': '',
 'compartments': [],
 'cell_types': [],
 'tissues': [],
 'db_ids': [('MetaNetX (MNX) Equation', 'MNXR636'),
  ('BioCyc', 'META:RXN0-5202')]}

In [28]:
## Reactions to port
myRxns = []
for R in model.reactions:
    myRxns.append( port_reaction(R) )
    
print(len(myRxns))

1715


In [29]:
# remove duplicated reactions after decompartmentalization
myRxns = remove_duplicate_rxn(myRxns)

In [30]:
len(myRxns)

1457

In [31]:
myRxns[0].__dict__

{'azimuth_id': '',
 'id': '2AGPEAT120',
 'source': [],
 'version': '',
 'status': '',
 'reactants': ['2agpe120', 'atp', 'ddca'],
 'products': ['amp', 'pe120', 'ppi'],
 'enzymes': [],
 'genes': ['EDU36256_1'],
 'pathways': [],
 'ontologies': [],
 'species': '',
 'compartments': [],
 'cell_types': [],
 'tissues': [],
 'db_ids': [('MetaNetX (MNX) Equation', 'MNXR81336')]}

## Pathway information will be parsed later when looking back to the original data

## Collected data; now output

In [32]:
from datetime import datetime
today =  str(datetime.today()).split(" ")[0]

In [33]:
today

'2022-04-27'

In [34]:
note = """CarveME model."""

## metabolicModel to export
MM = MetabolicModel()
MM.id = f'az_{model_name}_{today}' #
MM.meta_data = {
            'species': model_name.split('-')[0],
            'version': '',
            'sources': [f'https://github.com/SysBioChalmers/{model_name}, retrieved {today}'], #
            'status': '',
            'last_update': today,  #
            'note': note,
        }
# MM.list_of_pathways = [P.serialize() for P in myPathways]
MM.list_of_reactions = [R.serialize() for R in  myRxns]
MM.list_of_compounds = [C.serialize() for C in myCpds]

In [35]:
import pickle
import os

# Write pickle file
export_pickle(os.path.join(output_fdr,f'{MM.id}.pickle'), MM)

In [36]:
# Write json file
export_json(os.path.join(output_fdr,f'{MM.id}.json'), MM)

In [37]:
# Write dataframe 
import pandas as pd
export_table(os.path.join(output_fdr,f'{MM.id}_list_of_compounds.csv'),MM, 'list_of_compounds')
export_table(os.path.join(output_fdr,f'{MM.id}_list_of_reactions.csv'),MM, 'list_of_reactions')
# export_table(os.path.join(output_fdr,f'{MM.id}_list_of_pathways.csv'),MM, 'list_of_pathways')

## Summary

This ports reactions, pathways and compounds. Gene and enzyme information is now included. 

The exported pickle can be re-imported and uploaded to Database easily.

This notebook, the pickle file and the JSON file go to GitHub repo (https://github.com/shuzhao-li/Azimuth).