## Porting genome scale metabolic models for metabolomics (AGORA)

- to make formats compatible to mummichog
- to link to a common compound table
- from compound table, we generated predicted mass peaks based on formula

As mummichog 3 is under development, treat this as part of development.

*Use cobra to parse SBML models whereas applicable*

Not all models comply with the formats in cobra. Models from USCD and Thiele labs should comply.

*Base our code on metDataModel*

Each model needs a list of Reactions, a list of Pathways, and a list of Compounds. It's important to include Compounds with all linked identifiers to other DBs (HMDB, PubChem, etc), and with formulae (usually charged form in these models) when available. We can always update the data later. E.g. the neural formulae can be retrieved from HMDB if linked. Save in Python pickle and in JSON.

Minghao Gong, 2022-04-21; Georgi Kolishovski, 2021-05-12

In [1]:
# !pip install cobra
# !pip install --upgrade metDataModel

In [2]:
# https://cobrapy.readthedocs.io/en/latest/io.html#SBML
import cobra
from metDataModel.core import Compound, Reaction, Pathway, MetabolicModel
import sys
import os
sys.path.append("/Users/gongm/Documents/projects/JMS/JMS/JMS")
sys.path.append("/Users/gongm/Documents/projects/mass2chem/")
from jms.formula import *
from mass2chem.formula import *
from jms.utils.gems import *
from jms.utils import git_download
from datetime import datetime
today =  str(datetime.today()).split(" ")[0]

# Parse the model of Escherichia_albertii_KF1

In [3]:
url = "https://github.com/VirtualMetabolicHuman/AGORA/blob/master/CurrentVersion/AGORA_1_03/AGORA_1_03_With_Mucins_sbml/Escherichia_albertii_KF1.xml"
os.path.splitext(url.split('/')[-1])[0]

'Escherichia_albertii_KF1'

In [4]:
input_fdr = local_output_dir = "../test/input/test_output/"
output_fdr = "../test/input/test_output/"
file_name = url.split('/')[-1]
model_name = os.path.splitext(file_name)[0]

git_download.git_download_from_file(url = url,local_output_dir = local_output_dir,file_name = file_name)
model = cobra.io.read_sbml_model(os.path.join(local_output_dir,file_name))

Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  1.000e+00  ratio =  1.000e+00
Problem data seem to be well scaled


In [5]:
model.name = model_name

## Note and metadata data for the model

In [6]:
note = f'AGORA cloned from https://github.com/VirtualMetabolicHuman, retrieved from {today}\ .'

meta_data = {
            'species': model.name,
            'version': url.split('/')[-2], # 
            'sources': ['https://github.com/VirtualMetabolicHuman, retrieved {today}'],
            'status': '',
            'last_update': '20210512',
            'note': note,
        }

In [7]:
meta_data

{'species': 'Escherichia_albertii_KF1',
 'version': 'AGORA_1_03_With_Mucins_sbml',
 'sources': ['https://github.com/VirtualMetabolicHuman, retrieved {today}'],
 'status': '',
 'last_update': '20210512',
 'note': 'AGORA cloned from https://github.com/VirtualMetabolicHuman, retrieved from 2022-04-27\\ .'}

In [8]:
model.reactions[33]

0,1
Reaction identifier,3HAD14M16
Name,14-methyl-3-hydroxy-hexa-decanoyl-ACP hydro-lyase
Memory address,0x07fe3986d4c10
Stoichiometry,14m3hpalmACP[c] --> 14mtpalm2eACP[c] + h2o[c]  14-methyl-3-hydroxy-hexa-decanoyl-ACP --> 14-methyl-trans-hexa-dec-2-enoyl-ACP + Water
GPR,g.214925.CDS.1408
Lower bound,0.0
Upper bound,1000.0


In [9]:
model.metabolites[33]

0,1
Metabolite identifier,12mtmrs2eACP[c]
Name,12-methyl-trans-tetra-dec-2-enoyl-ACP
Memory address,0x07fe3b82f4580
Formula,C26H47N2O8PRS
Compartment,c
In 2 reaction(s),"EAR12M14x, 3HAD12M14"


In [10]:
list_all_identifiers(model,'notes')

({'ChEBIID', 'InChIString', 'PubChemID'},
 {'ChEBIID': '17815',
  'InChIString': '1/C3H8O10P2/c4-2(1-12-14(6,7)8)3(5)13-15(9,10)11/h2,4H,1H2,(H2,6,7,8)(H2,9,10,11)/t2-/m1/s1/f/h6-7,9-10H',
  'PubChemID': 'C02737'})

In [11]:
def port_metabolite(M):
    # convert cobra Metabolite to metDataModel Compound
    Cpd = Compound()
    Cpd.src_id = remove_compartment_by_split(M.id,'[') # remove the [c] from eg h2o[c]
    Cpd.id = remove_compartment_by_split(M.id,'[') # remove the [c] from eg h2o[c]
    Cpd.name = M.name
    Cpd.charge = M.charge
    Cpd.charged_formula = M.formula
    Cpd.neutral_formula = adjust_charge_in_formula(M.formula,M.charge)
    Cpd.neutral_mono_mass = neutral_formula2mass(Cpd.neutral_formula)
    Cpd.db_ids = list(M.notes.items())
    mydict = M.notes   # other databaseIDs  are in the notes tag
    Cpd.SMILES= mydict.get("SMILES",None) # not know if this is useful or not
    Cpd.inchi= mydict.get("InChIString",None)
    return Cpd

port_metabolite(model.metabolites[33]).id

'12mtmrs2eACP'

In [12]:
# def fetch_AGORA_GEM_identifiers(compound_list,
#                                 json_path = json_path,
#                                 overwrite = True):
#     with open(json_path,'r') as f:
#         list_vmh_cpd = json.load(f)
#     vmh_dict = {}
#     for vmh_cpd in list_vmh_cpd:
#         vmh_dict.update({vmh_cpd['id']:vmh_cpd})
#     new_cpd_list = []
#     for myCpd in compound_list:
#         for k,v in vmh_dict.items():
#             if myCpd.id == k:
#                 if overwrite == True:
#                     myCpd.db_ids = v['identifiers'] # the vmh json is using `identifiers` rather than `db_ids`
#                 break
#         new_cpd_list.append(myCpd)
#     return new_cpd_list

In [13]:
# port reactions
def port_reaction(R):
    new = Reaction()
    new.id = R.id
    new.reactants = [remove_compartment_by_split(m.id,'[') for m in R.reactants] 
    new.products = [remove_compartment_by_split(m.id,'[') for m in R.products] 
    return new

In [14]:
# pathways, using group as pathway from AGORA. Other models may use subsystem etc.

def port_pathway(P):
    new = Pathway()
    new.id = P.id
    new.source = ['AGORA',]
    new.name = P.name
    new.list_of_reactions = [x.id for x in P.members]
    return new


# Run the function

In [15]:
myCpds = []
for i in range(len(model.metabolites)):
    myCpds.append(port_metabolite(model.metabolites[i]))

print(f'Before decompartmentalization, there are {len(myCpds)} compounds')

# remove duplicated compounds
myCpds = remove_duplicate_cpd(myCpds)

print(f'After decompartmentalization, there are {len(myCpds)} compounds left')

myCpds = fetch_AGORA_GEM_identifiers(myCpds,json_path = '../jms/data/staged/vmh.json',overwrite = True)

Before decompartmentalization, there are 1322 compounds
After decompartmentalization, there are 1150 compounds left


In [16]:
## Reactions to port
myRxns = []
for R in model.reactions:
    myRxns.append( port_reaction(R) )

print(f'Before removing transport reactions, there are {len(myRxns)} reactions')

# remove duplicated reactions after decompartmentalization
myRxns = remove_duplicate_rxn(myRxns)

print(f'After removing transport reactions, there are {len(myRxns)} reactions')

Before removing transport reactions, there are 1724 reactions
After removing transport reactions, there are 1590 reactions


In [17]:
## Pathways to port
myPathways = []
for P in model.groups:
    myPathways.append(port_pathway(P))

# retain the valid reactions in list of pathway
myPathways = retain_valid_Rxns_in_Pathways(myPathways,myRxns)

print(f'There are {len(myPathways)} pathways in the model')

There are 77 pathways in the model


# Collected data; now output

In [18]:
## metabolicModel to export
MM = MetabolicModel()
MM.id = f'az_AGORA_{today}_{model.name}' #
MM.meta_data = meta_data

In [19]:
MM.list_of_pathways = [P.serialize() for P in myPathways]
MM.list_of_reactions = [R.serialize() for R in  myRxns]
MM.list_of_compounds = [C.serialize() for C in myCpds]

# Write pickle file
export_pickle(os.path.join(output_fdr,f'{MM.id}.pickle'), MM)
print(f'Export pickle file')

# Write json file
export_json(os.path.join(output_fdr,f'{MM.id}.json'), MM)
print(f'Export json file')

# Write dataframe 

export_table(os.path.join(output_fdr,f'{MM.id}_list_of_compounds.csv'),MM, 'list_of_compounds')
print(f'Export a table of the list of compounds')

Export pickle file
Export json file
Export a table of the list of compounds


## Summary

This ports reactions, pathways and compounds. Gene and enzyme information is not included. They should be when someone has time to do it.

The exported pickles can be re-imported and uploaded to database easily.

This notebook, the pickle file and the JSON file go to GitHub repo (https://github.com/shuzhao-li/Azimuth).