# RBC-GEM 0.2.0 --> 0.3.0
The purpose of this notebook is to reduce all redundancies, imbalances, and create a reconstruction that can highlights areas for expansion.

1. Remove reactions that are "duplicated" other than having different directionality
2. Remove pseudoreactions that enable flux consistency to identify dead-ends. Leave exchanges.
3. Remove distinction for transcripts, ensuring only unique genes in model.
4. Change all gene identifiers to HGNC symbols.
5. Chemical formulas and charges are updated for some metabolites.
6. Metabolite formulas are standardized
7. Stoichiometric corrections for reactions
8. Lipids reactions are pooled. 
9. As the model has stoichiometrically altered from the iAB-RBC-283 model, the ID of the model has been officially changed to RBC-GEM.

Bordbar, A., Jamshidi, N. & Palsson, B.O. iAB-RBC-283: A proteomically derived knowledge-base of erythrocyte metabolism that can be used to simulate its physiological and patho-physiological states. BMC Syst Biol 5, 110 (2011). https://doi.org/10.1186/1752-0509-5-110

## Setup
### Import packages

In [25]:
from cobra import Reaction
from cobra.manipulation import remove_genes, rename_genes
from rbc_gem_utils import (COBRA_CONFIGURATION, ROOT_PATH, build_string,
                           get_annotation_df, read_rbc_model, show_versions,
                           split_string, write_rbc_model)
from rbc_gem_utils.annotation import set_sbo_default_annotations
from rbc_gem_utils.qc import standardardize_metabolite_formulas

# Display versions of last time notebook ran and worked
show_versions()


Package Information
-------------------
rbc-gem-utils 0.0.1

Dependency Information
----------------------
cobra      0.29.0
depinfo     2.2.0
matplotlib  3.8.2
memote     0.16.1
notebook    7.0.6
requests   2.31.0
scipy      1.11.4
seaborn    0.13.0

Build Tools Information
-----------------------
pip        23.3.1
setuptools 68.2.2
wheel      0.41.2

Platform Information
--------------------
Darwin  22.6.0-x86_64
CPython        3.12.0


### Define configuration
#### COBRA Configuration

In [26]:
COBRA_CONFIGURATION

Attribute,Description,Value
solver,Mathematical optimization solver,gurobi
tolerance,"General solver tolerance (feasibility, integrality, etc.)",1e-07
lower_bound,Default reaction lower bound,-1000.0
upper_bound,Default reaction upper bound,1000.0
processes,Number of parallel processes,15
cache_directory,Path for the model cache,/Users/zhaiman/Library/Caches/cobrapy
max_cache_size,Maximum cache size in bytes,104857600
cache_expiration,Model cache expiration time in seconds (if any),


## Load RBC-GEM model
### Version: 0.2.0

In [27]:
model = read_rbc_model(filetype="xml")
model

0,1
Name,iAB_RBC_283
Memory address,14ea90680
Number of metabolites,342
Number of reactions,469
Number of genes,349
Number of groups,33
Objective expression,1.0*NaKt - 1.0*NaKt_reverse_db47e
Compartments,"cytosol, extracellular space"


### Reactions
#### Remove "duplicated" reactions

In [28]:
# Make reactions reversible (PMID:1618773)
model.reactions.CRNAT_16_0.lower_bound = -1000
model.reactions.CRNAT_18_9Z.lower_bound = -1000
model.reactions.CRNAT_18_9Z12Z.lower_bound = -1000

model.remove_reactions(
    [
        model.reactions.CRNAT_16_0rbc,
        model.reactions.CRNAT_18_9Zrbc,
        model.reactions.CRNAT_18_9Z12Zrbc,
    ]
)
model

  warn("need to pass in a list")


0,1
Name,iAB_RBC_283
Memory address,14ea90680
Number of metabolites,342
Number of reactions,466
Number of genes,349
Number of groups,33
Objective expression,1.0*NaKt - 1.0*NaKt_reverse_db47e
Compartments,"cytosol, extracellular space"


#### Remove pseudoreactions, leaving exchanges

In [29]:
model.remove_reactions(model.demands + model.sinks)
model.remove_reactions(["NADHload"])
model

  warn("need to pass in a list")


0,1
Name,iAB_RBC_283
Memory address,14ea90680
Number of metabolites,342
Number of reactions,442
Number of genes,349
Number of groups,33
Objective expression,1.0*NaKt - 1.0*NaKt_reverse_db47e
Compartments,"cytosol, extracellular space"


#### Pool lipid reactions

In [30]:
pooled_reactons = {
    "CDIPT": "cdpdag_hs_c + inost_c <=> cmp_c + h_c + pail_hs_c",
    "CDS": "ctp_c + h_c + pa_hs_c --> cdpdag_hs_c + ppi_c",
    "CEPTC": "cdpchol_c + dag_hs_c --> cmp_c + h_c + pc_hs_c",
    "CEPTE": "cdpea_c + dag_hs_c --> cmp_c + h_c + pe_hs_c",
    "DAGK": "atp_c + dag_hs_c --> adp_c + h_c + pa_hs_c",
    "GPAT": "FAcoa_hs_c + glyc3p_c --> coa_c + lpa_hs_c",
    "LPAAT": "FAcoa_hs_c + lpa_hs_c --> coa_c + pa_hs_c",
    "LPCLPLB": "h2o_c + lpc_hs_c --> FA_hs_c + g3pc_c + h_c",
    "PCPLA2": "h2o_c + pc_hs_c --> FA_hs_c + h_c + lpc_hs_c",
    "PI45P5P": "h2o_c + pail45p_hs_c --> pail4p_hs_c + pi_c",
    "PI45PLC": "h2o_c + pail45p_hs_c --> dag_hs_c + h_c + mi145p_c",
    "PI4PLC": "h2o_c + pail4p_hs_c --> dag_hs_c + h_c + mi14p_c",
    "PIPLC": "h2o_c + pail_hs_c --> dag_hs_c + h_c + mi1p__D_c",
    "PI4K": "atp_c + pail_hs_c --> adp_c + h_c + pail4p_hs_c",
    "PI4P5K": "atp_c + pail4p_hs_c --> adp_c + h_c + pail45p_hs_c",
    "PI4PP": "h2o_c + pail4p_hs_c --> pail_hs_c + pi_c",
    "PAPP": "h2o_c + pa_hs_c --> dag_hs_c + pi_c",
}
new_mets = set()
for rid, reaction in pooled_reactons.items():
    reactions = model.reactions.query(lambda x: x.id.startswith(rid))
    subsystem = reactions[0].subsystem
    gpr = reactions[0].gene_reaction_rule
    bounds = reactions[0].bounds

    model.add_reactions([Reaction(rid)])
    new = model.reactions.get_by_id(rid)
    new.build_reaction_from_string(reaction)

    new.subsystem = subsystem
    new.gene_reaction_rule = gpr
    new.bounds = bounds
    for met in new.metabolites:
        if met.formula is None:
            new_mets.add(met)
        met.compartment = met.id[-1:]
    model.remove_reactions(reactions, remove_orphans=True)


# Any additional annotations can be obtained after linking to MetAtlas

unknown metabolite 'cdpdag_hs_c' created
unknown metabolite 'pail_hs_c' created
unknown metabolite 'pa_hs_c' created
unknown metabolite 'dag_hs_c' created
unknown metabolite 'pc_hs_c' created
unknown metabolite 'pe_hs_c' created
unknown metabolite 'FAcoa_hs_c' created
unknown metabolite 'lpa_hs_c' created
unknown metabolite 'lpc_hs_c' created
unknown metabolite 'FA_hs_c' created
unknown metabolite 'pail45p_hs_c' created
unknown metabolite 'pail4p_hs_c' created


### Genes
#### Remove isoforms from GPRs

In [31]:
df_isoforms_to_remove = get_annotation_df(
    getattr(model, "genes"), ["uniprot", "hgnc.symbol"]
)
df_isoforms_to_remove = df_isoforms_to_remove.sort_values(by="id")
df_isoforms_to_remove = df_isoforms_to_remove[
    df_isoforms_to_remove.loc[:, ["uniprot", "hgnc.symbol"]].duplicated(keep="first")
]
remove_genes(
    model, gene_list=list(df_isoforms_to_remove["id"].values), remove_reactions=False
)
df_isoforms_to_remove

Unnamed: 0,id,uniprot,hgnc.symbol
27,10327_AT2,P14550,AKR1A1
274,10423_AT2,O14735,CDIPT
281,1119_AT2,P35790,CHKA
283,1120_AT2,Q9Y259,CHKB
228,112_AT2,O43306,ADCY6
...,...,...,...
297,8525_AT2,Q13574,DGKZ
291,8525_AT3,Q13574,DGKZ
301,8527_AT2,Q16760,DGKD
185,8611_AT2,O14494,PLPP1


#### Rename genes to HGNC

In [32]:
gene_mapping = (
    get_annotation_df(getattr(model, "genes"), ["hgnc.symbol"])
    .set_index("id")["hgnc.symbol"]
    .to_dict()
)
rename_genes(model, gene_mapping)
for gene in model.genes:
    gene.name = ""

In [33]:
import pandas as pd

In [34]:
id_mapping_df = pd.DataFrame.from_dict(gene_mapping, orient="index")
id_mapping_df = id_mapping_df.reset_index(drop=False)
id_mapping_df.columns = ["geneRetired", "genes"]
id_mapping_df = id_mapping_df.loc[:, id_mapping_df.columns[::-1]]
id_mapping_df["genes"] = id_mapping_df["genes"].str.split(" and ")
id_mapping_df = id_mapping_df.explode("genes")


previous_id_mapping_df = pd.read_csv(
    f"{ROOT_PATH}/data/deprecatedIdentifiers/genes_deprecatedIdentifiers.tsv",
    sep="\t",
    index_col=0,
)

for idx, row in id_mapping_df.iterrows():
    new_id, retiring = row[["genes", "geneRetired"]]
    previously_retired = previous_id_mapping_df[
        previous_id_mapping_df["genes"] == retiring
    ]
    retired_set_of_ids = {retiring}
    if not previously_retired.empty:
        # Get all previously retired IDs
        try:
            retired_set_of_ids.update(
                previously_retired["geneRetired"].apply(split_string).item()
            )
        except ValueError:
            retired_set_of_ids.update(
                [
                    y
                    for x in previously_retired["geneRetired"].values
                    for y in split_string(x)
                ]
            )
        # Pulling the ID out of retirement
        if new_id in retired_set_of_ids:
            retired_set_of_ids.remove(new_id)
        retired_set_of_ids.add(retiring)

    id_mapping_df.loc[idx, "geneRetired"] = build_string(retired_set_of_ids, sep=";")

id_mapping_df.to_csv(
    f"{ROOT_PATH}/data/deprecatedIdentifiers/genes_deprecatedIdentifiers.tsv",
    sep="\t",
)
id_mapping_df

Unnamed: 0,genes,geneRetired
0,NMRK1,54981_AT1
1,RPE,6120_AT1
2,RPIA,22934_AT1
3,COMTD1,118881_AT1
4,SORD,6652_AT1
...,...,...
278,ATP1B1,481_AT1
279,ATP1B3,483_AT1
280,ATP1B2,482_AT1
281,ATP1B4,23439_AT1


### Metabolites
#### Update existing chemical formulas and charges

In [35]:
updated_formula_charges = {
    # To update
    "ascb__L": ("C6H7O6", -1),
    "dhdascb__L": ("C6H5O6", -1),
    "bilglcur": ("C39H42N4O12", -2),
    "pe_hs": ("C7H12NO8PR2", 0),
    "pc_hs": ("C10H18NO8PR2", 0),
    "lpc_hs": ("C9H19NO7PR", 0),
    "cdpdag_hs": ("C14H17N3O15P2R2", -2),
    "FA_hs": ("CO2R", -1),
    "dag_hs": ("C5H6O5R2", 0),
    "lpa_hs": ("C4H6O7PR", -2),
    "pa_hs": ("C5H5O8PR2", -2),
    "pail_hs": ("C11H16O13PR2", -1),
    "pail4p_hs": ("C11H15O16P2R2", -3),
    "pail45p_hs": ("C11H14O19P3R2", -5),
    "FAcoa_hs": ("C22H31N7O17P3RS", -4),
}

for met_id, (new_formula, new_charge) in updated_formula_charges.items():
    for metabolite in model.metabolites.query(
        lambda x: x.id.replace(f"_{x.compartment}", "") == met_id
    ):
        print(metabolite)
        metabolite.formula = new_formula
        metabolite.charge = new_charge

# Additional annotations can be obtained after linking to MetAtlas
annotations = {
    "FA_hs_c": "MAM10005c",
    "pe_hs_c": "MAM02685c",
    "lpc_hs_c": "MAM00656c",
    "pc_hs_c": "MAM02684c",
    "dag_hs_c": "MAM00240c",
    "pail_hs_c": "MAM02750c",
    "pail4p_hs_c": "MAM02685c",
    "pail45p_hs_c": "MAM02736c",
    "lpa_hs_c": "MAM03419c",
    "FAcoa_hs_c": "MAM10007c",
    "pe_hs_c": "MAM02685c",
    "cdpdag_hs_c": "MAM01427c",
}

for met, metatlas in annotations.items():
    met = model.metabolites.get_by_id(met)
    met.annotation["metatlas"] = metatlas

ascb__L_c
ascb__L_e
dhdascb__L_c
dhdascb__L_e
bilglcur_c
bilglcur_e
pe_hs_c
pc_hs_c
lpc_hs_c
cdpdag_hs_c
FA_hs_c
dag_hs_c
lpa_hs_c
pa_hs_c
pail_hs_c
pail4p_hs_c
pail45p_hs_c
FAcoa_hs_c


#### Standardize metabolite formulas

In [36]:
metabolite_formulas = dict(
    zip(model.metabolites.list_attr("id"), model.metabolites.list_attr("formula"))
)
standardized = standardardize_metabolite_formulas(metabolite_formulas)

for mid, updated_formula in standardized.items():
    if metabolite_formulas[mid] != updated_formula:
        print(f"Standardizing formula for `{mid}`")
        model_metabolite = model.metabolites.get_by_id(mid)
        model_metabolite.formula = updated_formula

#### Correct stoichiometry of reactions

In [37]:
reaction = model.reactions.get_by_id("BILIRBU")
reaction.add_metabolites({"h_c": 2})

In [38]:
for reaction in model.reactions:
    if reaction.boundary:
        continue
    if reaction.check_mass_balance():
        print(reaction.id, reaction.check_mass_balance())

### Add/reset SBO annotations

In [39]:
model = set_sbo_default_annotations(
    model, ["reactions", "metabolites", "genes"], verbose=True
)
model

SBO term set for CDIPT
SBO term set for CDS
SBO term set for CEPTC
SBO term set for CEPTE
SBO term set for DAGK
SBO term set for GPAT
SBO term set for LPAAT
SBO term set for LPCLPLB
SBO term set for PCPLA2
SBO term set for PI45P5P
SBO term set for PI45PLC
SBO term set for PI4PLC
SBO term set for PIPLC
SBO term set for PI4K
SBO term set for PI4P5K
SBO term set for PI4PP
SBO term set for PAPP
SBO term set for cdpdag_hs_c
SBO term set for pail_hs_c
SBO term set for pa_hs_c
SBO term set for dag_hs_c
SBO term set for pc_hs_c
SBO term set for pe_hs_c
SBO term set for FAcoa_hs_c
SBO term set for lpa_hs_c
SBO term set for lpc_hs_c
SBO term set for FA_hs_c
SBO term set for pail45p_hs_c
SBO term set for pail4p_hs_c


0,1
Name,iAB_RBC_283
Memory address,14ea90680
Number of metabolites,292
Number of reactions,348
Number of genes,283
Number of groups,33
Objective expression,1.0*NaKt - 1.0*NaKt_reverse_db47e
Compartments,"cytosol, extracellular space"


## Export updated model
### Version: 0.3.0

In [40]:
model.id = "RBC_GEM"
write_rbc_model(model, filetype="all")
model

0,1
Name,RBC_GEM
Memory address,14ea90680
Number of metabolites,292
Number of reactions,348
Number of genes,283
Number of groups,33
Objective expression,1.0*NaKt - 1.0*NaKt_reverse_db47e
Compartments,"cytosol, extracellular space"
