# PyCoMo: fixing the validation bug #
In some models, checking for mass balance fails with a value error, stating that no elements are found in some metabolites. Here we fix this!

In [1]:
from pathlib import Path
import sys
import cobra
import os
import traceback

### Importing PyCoMo ###
As PyCoMo is currently only available as a local package, the direct path to the package directory needs to be used on import.

In [2]:
import pycomo

2024-07-15 17:39:01,681 - pycomo.helper.multiprocess - INFO - Multiprocess Logger initialized.
2024-07-15 17:39:01,682 - pycomo.pycomo_models - INFO - Logger initialized.


### Functions for fixing the model ###
Include this in your notebook when dealing with models that lead to the validation bug.

Use `fixed_model = fix_missing_elements_validation_error(model)` to prepare the input models before merging (example is given further below).

In [3]:
import cobra
import re

def report_mass_balance(model):
    unbalanced_reactions = cobra.manipulation.validate.check_mass_balance(model)
    if not unbalanced_reactions:
        print("Model is balanced")
    else:
        print("Model is not balanced")
        
def remove_multiple_sbo_terms_from_reactions(model):
    for rxn in model.reactions:
                if isinstance(rxn.annotation.get("sbo"), list):
                    rxn.annotation["sbo"] = rxn.annotation.get("sbo")[0]
    return model

def correct_model_metabolite_formulae(model):
    for metabolite in model.metabolites:
        if metabolite.elements is None:
            correct_metabolite_formula(metabolite)
    return model

def correct_metabolite_formula(metabolite):
    # Metabolite formulae can only contain standard alphabet characters (a-Z), numbers (0-9) and decimal points (.)
    metabolite.formula = re.sub(r'[^a-zA-Z0-9.]+', '', metabolite.formula)    
    
def find_metabolites_with_problematic_formulae(model):
    problematic = []
    for metabolite in model.metabolites:
        if metabolite.elements is None:
            problematic.append(metabolite)
    return problematic
    
def fix_missing_elements_validation_error(model):
    model = model.copy()
    try:
        try:
            report_mass_balance(model)
            print("No errors found")
        except TypeError:
            # This TypeError can come from multiple sbo terms being present in reaction annotations
            model = remove_multiple_sbo_terms_from_reactions(model)
            report_mass_balance(model)
            print("TypeError found: Multiple SBO terms are present in some reactions. Output model has been corrected.")
    except ValueError:
        model = correct_model_metabolite_formulae(model)
        report_mass_balance(model)
        print("ValueError found: Wrongly formatted formula given in some metabolites. " +
              "Output model has been corrected.\n" +
              "The applied correction should be considered a bandaid, " +
              "please look through the individual metabolites for fixing errors in the metabolite formulae!\n" +
              "Use find_metabolites_with_problematic_formulae to see which metabolites should be inspected."
             )
    return model

## Testing with an example community ##

In this tutorial we are using a metabolic model from the AGORA2 collection.

In [4]:
test_model_dir = "../data/bugs/validation_no_elements" # Set the path to a folder, including only the test AGORA2 model
named_models = pycomo.load_named_models_from_dir(test_model_dir)

The models and file names were extracted and stored in named_models. Let's check the contents:

In [5]:
named_models

{'Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8': <Model M_Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8 at 0x24d175133b0>}

Now we create a community metabolic model with two members of the input model

In [6]:
single_org_models = []
for name, model in named_models.items():
    name_a = name + "_a"
    name_b = name + "_b"
    single_org_model = pycomo.SingleOrganismModel(model.copy(), name_a)
    single_org_models.append(single_org_model)
    single_org_model = pycomo.SingleOrganismModel(model.copy(), name_b)
    single_org_models.append(single_org_model)

In [7]:
community_name = "element_bug_community_model"
com_model_obj = pycomo.CommunityModel(single_org_models, community_name)

The cobra model of the community will generated the first time it is needed. We can enforce this now, by calling it via .model

In [8]:
try:
    com_model_obj.model
except Exception as e:
    print(traceback.format_exc())

No community model generated yet. Generating now:
Note: no products in the objective function, adding biomass to it.


Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not id

Note: no products in the objective function, adding biomass to it.


Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not id

Generated community model.


  warn(f"invalid formula (has parenthesis) in '{self.formula}'")


This results in a warning, that some metabolite formulae are wrongly formatted.

## Bug fixing ##
Here is a quick test case with metabolites that contain no formula, or a wrongly formatted one

In [9]:
d1 = cobra.Metabolite("dummy1")
d2 = cobra.Metabolite("dummy2")
d3 = cobra.Metabolite("dummy3", formula="C(O)2")
dr = cobra.Reaction("dummy_r")
dr.add_metabolites({d1:1, d2:-1})

In [10]:
d1.elements

{}

The output of having no formula is an empty set, not None!

In [11]:
dr.check_mass_balance()

{}

Checking mass balance works also for metabolites without formulae, this is not the culprit!

In [12]:
d3.elements

  warn(f"invalid formula (has parenthesis) in '{self.formula}'")


Here, a warning should be displayed, stating that invalid characters (in this case, parentheses) are used. Only standard english alphabetic characters (a-z, A-Z), integers (0-9) and decimal points (.) are allowed in formulae.

### Correcting the model ###
Let's fix the metabolite with the correction function

In [13]:
correct_metabolite_formula(d3)  

In [14]:
d3.elements

{'C': 1, 'O': 2}

Now the formula is read without an error, but we cannot be sure that the formula is correct. If a model has wrongly formatted formulae, they should be manually investigated.

Next, we try to apply the correction to the test model!

In [15]:
single_org_models = []
for name, model in named_models.items():
    model_fixed = fix_missing_elements_validation_error(model)
    name_a = name + "_a"
    name_b = name + "_b"
    single_org_model = pycomo.SingleOrganismModel(model_fixed.copy(), name_a)
    single_org_models.append(single_org_model)
    single_org_model = pycomo.SingleOrganismModel(model_fixed.copy(), name_b)
    single_org_models.append(single_org_model)

  warn(f"invalid formula (has parenthesis) in '{self.formula}'")


Model is not balanced
ValueError found: Wrongly formatted formula given in some metabolites. Output model has been corrected.
The applied correction should be considered a bandaid, please look through the individual metabolites for fixing errors in the metabolite formulae!
Use find_metabolites_with_problematic_formulae to see which metabolites should be inspected.


It appears that the model has metabolites with wrongly formatted metabolite formulae. Let's check which metabolites these are:

In [16]:
find_metabolites_with_problematic_formulae(model)

[<Metabolite M02626[c] at 0x24d18279310>,
 <Metabolite M03083[c] at 0x24d181255e0>]

Here, a warning should be displayed, stating that invalid characters (in this case, parentheses) are used. Only standard english alphabetic characters (a-z, A-Z), integers (0-9) and decimal points (.) are allowed in formulae.

We can now check if our fixed model still has wrongly formatted metabolite formulae:

In [17]:
find_metabolites_with_problematic_formulae(model_fixed)

[]

None are found!

Now we can repeat the community construction with the fixed models

In [18]:
community_name = "element_bug_community_model"
com_model_obj = pycomo.CommunityModel(single_org_models, community_name)

In [19]:
com_model_obj.model

No community model generated yet. Generating now:
Note: no products in the objective function, adding biomass to it.


Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not id

Note: no products in the objective function, adding biomass to it.


Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not identify an external compartment by name and choosing one with the most boundary reactions. That might be complete nonsense or change suddenly. Consider renaming your compartments using `Model.compartments` to fix this.
  return most[0]
Could not id

Generated community model.


0,1
Name,element_bug_community_model
Memory address,24d0497ad20
Number of metabolites,12501
Number of reactions,12952
Number of genes,3232
Number of groups,122
Objective expression,1.0*community_biomass - 1.0*community_biomass_reverse_44dc1
Compartments,"Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8_a_c, Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8_a_e, Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8_a_p, Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8_a_medium, medium, fraction_reaction, Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8_b_c, Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8_b_e, Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8_b_p, Klebsiella_pneumoniae_subsp_pneumoniae_KPNIH8_b_medium"


In [20]:
com_model_obj.summary()

Metabolite,Reaction,Flux,C-Number,C-Flux
_26dap_M_medium,EX__26dap_M_medium,0.02501,7,0.09%
acgam_medium,EX_acgam_medium,0.1501,8,0.64%
adocbl_medium,EX_adocbl_medium,0.1549,72,5.90%
akg_medium,EX_akg_medium,0.04214,5,0.11%
anzp_medium,EX_anzp_medium,8.223,15,65.29%
arg_L_medium,EX_arg_L_medium,0.2467,6,0.78%
asn_L_medium,EX_asn_L_medium,1.374,4,2.91%
ca2_medium,EX_ca2_medium,0.003096,0,0.00%
cgly_medium,EX_cgly_medium,0.003096,5,0.01%
cl_medium,EX_cl_medium,0.003097,0,0.00%

Metabolite,Reaction,Flux,C-Number,C-Flux
ac_medium,EX_ac_medium,-0.1979,2,0.27%
cbl1_medium,EX_cbl1_medium,-0.1549,62,6.54%
co2_medium,EX_co2_medium,-1.865,1,1.27%
for_medium,EX_for_medium,-1.277,1,0.87%
gly_medium,EX_gly_medium,-0.1351,2,0.18%
h2o_medium,EX_h2o_medium,-9.604,0,0.00%
ncam_medium,EX_ncam_medium,-1.668,6,6.82%
nh4_medium,EX_nh4_medium,-2.162,0,0.00%
nzp_medium,EX_nzp_medium,-8.223,15,84.04%
ppi_medium,EX_ppi_medium,-0.4929,0,0.00%


Now everything worked!