# Extract annotations from Metabolic Atlas via Human-GEM

The purpose of this notebook is to extract and format Human-GEM data for subsequent model annotation.

## Notebook Requirements:
*  Model genes, metabolites, and reactions **must** have the following annotations stored in the `object.annotation`. Values are expected to be seperated by semicolons. Accepted keys currently include:
    * For genes:
        * `"uniprot"`
    * For metabolites:
        * `"metatas"`
    * For reactions:
        * `"metatas"`

*  Note: Requires internet connection to retrieve latest files from the [MetabolicAtlas Human-GEM repository](https://github.com/SysBioChalmers/Human-GEM).

### Citations

Robinson JL, Kocabaş P, Wang H, Cholley PE, Cook D, Nilsson A, Anton M, Ferreira R, Domenzain I, Billa V, Limeta A, Hedin A, Gustafsson J, Kerkhoven EJ, Svensson LT, Palsson BO, Mardinoglu A, Hansson L, Uhlén M, Nielsen J. An atlas of human metabolism. Sci Signal. 2020 Mar 24;13(624):eaaz1482. doi: 10.1126/scisignal.aaz1482. PMID: 32209698; PMCID: PMC7331181.

## Setup
### Import packages

In [None]:
from warnings import warn

import matplotlib.pyplot as plt
import pandas as pd
from rbc_gem_utils import (
    GEM_NAME,
    build_string,
    check_database_release_online,
    compare_tables,
    explode_column,
    get_annotation_df,
    get_dirpath,
    read_cobra_model,
    show_versions,
    visualize_comparison,
)
from rbc_gem_utils.database.metatlas import (
    HUMANGEM_DB_TAG,
    HUMANGEM_MIRIAM,
    HUMANGEM_RELEASE_EXPECTED,
    download_database_HumanGEM,
    get_annotations_HumanGEM,
)

# Display versions of last time notebook ran and worked
show_versions()

## Set notebook options

In [None]:
db_tag = HUMANGEM_DB_TAG
expected_release = HUMANGEM_RELEASE_EXPECTED
download_database = True

compare_figsize = (5, 5)
compare = True
display_nunique = True
overwrite = True

rename_miriam = True
mapping_keys = {
    "genes": "uniprot",
    "metabolites": "metatlas",
    "reactions": "metatlas",
}

## Check Human-GEM version
* If the current release does not match the expected release, it is because database has been updated since the last time this code was utilized.
    * If the notebook works without needing any significant modifications, the only update needed is to the release in the [complex.py](../../src/rbc_gem_utils/database/complexportal.py) source code file to resolve the issue.

In [None]:
use_interim = not check_database_release_online(db_tag, verbose=True, **{})
# Use different directory paths for unexpected behavior
if use_interim:
    warn(
        "Online release of database has been updated since the last time notebook was used."
    )
database_dirpath = get_dirpath(
    "database", db_tag, use_temp="interim" if use_interim else None
)
annotation_dirpath = get_dirpath(
    "annotation", use_temp="interim" if use_interim else None
)
# Ensure directories exist
database_dirpath.mkdir(exist_ok=True, parents=True)
annotation_dirpath.mkdir(exist_ok=True, parents=True)

#### Download new files and update database
If an argument is not provided (`arg=None`), its default value for the repository used. 

In [None]:
if download_database:
    download_database_HumanGEM(
        annotation_type=list(mapping_keys),
        database_dirpath=database_dirpath,
        model_filetype={"xml", "yml"},
        model_release=expected_release,
    )

## Load models
### Load RBC-GEM model

In [None]:
model_dirpath = get_dirpath("model")
model = read_cobra_model(filename=model_dirpath / f"{GEM_NAME}.xml")
model

## Load Human-GEM model

In [None]:
HumanGEM = read_cobra_model(database_dirpath / "Human-GEM.xml")
HumanGEM

## Reactions

* `reactions.tsv` content:

|# |fieldname      |annotation               |Prefixes (https://identifiers.org/)|
|--|---------------|-------------------------|-----------------------------------|
|1 |rxns           |identical to `model.rxns`|metatlas                           |
|2 |rxnKEGGID      |KEGG reaction ID         |kegg.reaction                      |
|3 |rxnBiGGID      |BiGG reaction ID         |bigg.reaction                      |
|4 |rxnEHMNID      |EHMN reaction ID         |                                   |
|5 |rxnHepatoNET1ID|HepatoNET1 reaction ID   |                                   |
|6 |rxnREACTOMEID  |REACTOME ID              |reactome                           |
|7 |rxnRecon3DID   |Recon3D reaction ID      |vmhreaction                        |
|8 |rxnMetaNetXID  |MetaNetX reaction ID     |metanetx.reaction                  |
|9 |rxnHMR2ID      |HMR2 reaction ID         |                                   |
|10|rxnRatconID    |Ratcon reaction ID       |                                   |
|11|rxnTCDBID      |TCDB ID                  |tcdb                               |
|12|spontaneous    |Spontaneous status       |                                   |
|13|rxnRheaID      |Rhea ID                  |rhea                               |
|14|rxnRheaMasterID|Master Rhea ID           |rhea                               |
|15|rxnRetired     |Retired reaction IDs     |                                   |

##### Notes

* `spontaneous` status column is included.
* Otherwise, include columns that link to https://identifiers.org/.
* For `rhea`:
    * The master Rhea ID is utilized for the annotation.
    * The "RHEA:" prefix from Rhea IDs needs to be stripped.
* `ec-code` annotations are currently stored in the model file.

In [None]:
annotation_type = "reactions"
mapping_key = mapping_keys[annotation_type]
merge_key = {
    "metatlas": "MetAtlas",
    "bigg.reaction": "BiGGID",
    "kegg.reaction ": "KEGGID",
    "reactome": "REACTOMEID",
    "vmhreaction": "Recon3DID",
    "metanetx.reaction": "MetaNetXID",
    "tcdb": "TCDBID",
    "rhea": "RheaMasterID",
}[mapping_key]
merge_key = f"rxn{merge_key}"
annotation_columns = HUMANGEM_MIRIAM[annotation_type].copy()

del annotation_columns["rxnRheaID"]
annotation_columns = list(annotation_columns) + ["spontaneous", "ec-code"]
df_model = get_annotation_df(getattr(model, annotation_type), [mapping_key]).rename(
    {"id": annotation_type}, axis=1
)
df_model

In [None]:
df_annotations = get_annotations_HumanGEM(annotation_type, database_dirpath)
df_annotations = df_annotations.rename({"rxns": merge_key}, axis=1)
df_annotations = df_annotations.merge(
    pd.DataFrame.from_dict(
        {
            rxn.id: build_string(rxn.annotation.get("ec-code", []))
            for rxn in HumanGEM.reactions
            if rxn.annotation.get("ec-code", [])
        },
        orient="index",
        columns=["ec-code"],
    ),
    left_on=merge_key,
    right_index=True,
    how="left",
)

df_annotations = df_model.merge(
    df_annotations,
    left_on=mapping_key,
    right_on=merge_key if merge_key is not None else mapping_key,
    how="left",
)

if merge_key is not None and merge_key != mapping_key:
    df_annotations = df_annotations.drop(mapping_key, axis=1)

df_annotations = df_annotations.loc[:, [annotation_type] + annotation_columns]
if rename_miriam:
    df_annotations = df_annotations.rename(HUMANGEM_MIRIAM[annotation_type], axis=1)
    # Clean up Rhea if MIRIAM formats are being applied
    if "rhea" in df_annotations.columns:
        df_annotations["rhea"] = df_annotations["rhea"].str.lstrip("RHEA:")

df_annotations = df_annotations.replace(float("nan"), pd.NA).replace("", pd.NA)
if compare:
    compare_on_index = [annotation_type]
    try:
        df_previous = pd.read_csv(
            annotation_dirpath / f"{annotation_type}_{db_tag}.tsv",
            sep="\t",
            index_col=None,
            dtype=str,
        )
        df_previous = df_previous.replace(float("nan"), pd.NA).replace("", pd.NA)
    except FileNotFoundError:
        df_previous = pd.DataFrame([], columns=compare_on_index)
    df_comparision = compare_tables(
        df_previous.set_index(compare_on_index),
        df_annotations.set_index(compare_on_index),
    )

    fig, ax = plt.subplots(1, 1, figsize=compare_figsize)
    ax.yaxis.set_tick_params(labelsize=8)
    ax = visualize_comparison(df_comparision)

if display_nunique:
    for col in df_annotations.columns:
        df = explode_column(df_annotations, name=col, sep=";")
        df = df[col].drop_duplicates()
        print(f"{df.name}: {df.nunique()}")

if overwrite:
    df_annotations.to_csv(
        annotation_dirpath / f"{annotation_type}_{db_tag}.tsv", sep="\t", index=False
    )

df_annotations

## Metabolites

* `metabolites_Human-GEM.tsv` content:

|# |fieldname      |annotation                             |Prefixes (https://identifiers.org/)|
|--|---------------|---------------------------------------|-----------------------------------|
|1 |mets           |identical to `model.mets`              |metatlas                           |
|2 |metsNoComp     |`model.mets` without compartment suffix|                                   |
|3 |metBiGGID      |BiGG metabolite ID                     |bigg.metabolite                    |
|4 |metKEGGID      |KEGG metabolite ID                     |kegg.compound                      |
|5 |metHMDBID      |HMDB ID                                |hmdb                               |
|6 |metChEBIID     |ChEBI ID                               |chebi                              |
|7 |metPubChemID   |PubChem ID                             |pubchem.compound                   |
|8 |metLipidMapsID |LipidMaps ID                           |lipidmaps                          |
|9 |metEHMNID      |EHMN metabolite ID                     |                                   |
|10|metHepatoNET1ID|HepatoNET1 metabolite ID               |                                   |
|11|metRecon3DID   |Recon3D metabolite ID                  |vmhmetabolite                      |
|12|metMetaNetXID  |MetaNetX metabolite ID                 |metanetx.chemical                  |
|13|metHMR2ID      |HMR2 metabolite ID                     |                                   |
|14|metRetired     |Retired metabolite IDs                 |                                   |

##### Notes

* Include columns that link to https://identifiers.org/.
* `inchi` annotations are currently stored in the model file.

In [None]:
annotation_type = "metabolites"
mapping_key = mapping_keys[annotation_type]
merge_key = {
    "metatlas": "MetAtlas",
    "bigg.metabolite": "BiGGID",
    "kegg.compound ": "KEGGID",
    "hmdb": "HMDBID",
    "chebi": "ChEBIID",
    "pubchem.compound": "PubChemID",
    "lipidmaps": "LipidMapsID",
    "vmhmetabolite": "Recon3DID",
    "metanetx.chemical": "MetaNetXID",
}[mapping_key]
merge_key = f"met{merge_key}"

annotation_columns = HUMANGEM_MIRIAM[annotation_type].copy()
annotation_columns = list(annotation_columns) + ["inchi"]
df_model = get_annotation_df(getattr(model, annotation_type), [mapping_key]).rename(
    {"id": annotation_type}, axis=1
)
df_model

In [None]:
df_annotations = get_annotations_HumanGEM(annotation_type, database_dirpath)
df_annotations = df_annotations.rename({"mets": merge_key}, axis=1)
df_annotations = df_annotations.merge(
    pd.DataFrame.from_dict(
        {
            met.id: build_string(met.annotation.get("inchi", []))
            for met in HumanGEM.metabolites
            if met.annotation.get("inchi", [])
        },
        orient="index",
        columns=["inchi"],
    ),
    left_on=merge_key,
    right_index=True,
    how="left",
)
df_annotations = df_model.merge(
    df_annotations,
    left_on=mapping_key,
    right_on=merge_key if merge_key is not None else mapping_key,
    how="left",
)
if merge_key is not None and merge_key != mapping_key:
    df_annotations = df_annotations.drop(mapping_key, axis=1)

df_annotations = df_annotations.loc[:, [annotation_type] + annotation_columns]
if rename_miriam:
    df_annotations = df_annotations.rename(HUMANGEM_MIRIAM[annotation_type], axis=1)

df_annotations = df_annotations.replace(float("nan"), pd.NA).replace("", pd.NA)
if compare:
    compare_on_index = [annotation_type]
    try:
        df_previous = pd.read_csv(
            annotation_dirpath / f"{annotation_type}_{db_tag}.tsv",
            sep="\t",
            index_col=None,
            dtype=str,
        )
        df_previous = df_previous.replace(float("nan"), pd.NA).replace("", pd.NA)
    except FileNotFoundError:
        df_previous = pd.DataFrame([], columns=compare_on_index)
    df_comparision = compare_tables(
        df_previous.set_index(compare_on_index),
        df_annotations.set_index(compare_on_index),
    )

    fig, ax = plt.subplots(1, 1, figsize=compare_figsize)
    ax.yaxis.set_tick_params(labelsize=8)
    ax = visualize_comparison(df_comparision)

if display_nunique:
    for col in df_annotations.columns:
        df = explode_column(df_annotations, name=col, sep=";")
        df = df[col].drop_duplicates()
        print(f"{df.name}: {df.nunique()}")

if overwrite:
    df_annotations.to_csv(
        annotation_dirpath / f"{annotation_type}_{db_tag}.tsv", sep="\t", index=False
    )
df_annotations

## Genes

* `genes_Human-GEM.tsv` content:

|# |fieldname     |  annotation          |Prefixes (https://identifiers.org/)|
|--|--------------|----------------------|-----------------------------------|
|1 |genes         |Ensembl gene ID       |ensembl                            |
|2 |geneENSTID    |Ensembl transcript ID |ensembl                            |
|3 |geneENSPID    |Ensembl protein ID    |ensembl                            |
|4 |geneUniProtID |UniProt ID            |uniprot                            |
|5 |geneSymbols   |Gene Symbol           |hgnc.symbol                        |
|6 |geneEntrezID  |NCBI Entrez ID        |ncbigene                           |
|7 |geneNames     |Gene Name             |                                   |
|8 |geneAliases   |Alias Names           |                                   |
|9 |compartments  |Subcellular location  |                                   |
|10|compDataSource|Source for compartment|                                   |


##### Notes
* Unlike `reactions` and `metabolites`, the Human-GEM reconstruction does not have its own unique identifiers for genes but instead uses the 'Ensembl gene ID'. Therefore, slight alterations to the code are made to account for differences. 
* It is recommended to unify model with annotations using the NCBI Entrez ID or the UniProt ID.
* Include columns that link to https://identifiers.org/.
* For `ensembl`, the Ensembl gene ID is utilized for the annotation.

In [None]:
annotation_type = "genes"
mapping_key = mapping_keys[annotation_type]
merge_key = {
    "ensembl": "genes",
    "uniprot": "UniProtID",
    "hgnc.symbol": "Symbols",
    "ncbigene": "EntrezId",
}[mapping_key]
merge_key = f"gene{merge_key}"
annotation_columns = HUMANGEM_MIRIAM[annotation_type].copy()
# Two mappings, only keep ensemble genes
del annotation_columns["geneENSTID"]
del annotation_columns["geneENSPID"]
annotation_columns = list(annotation_columns)
df_model = get_annotation_df(getattr(model, annotation_type), [mapping_key]).rename(
    {"id": annotation_type}, axis=1
)

In [None]:
df_annotations = get_annotations_HumanGEM(annotation_type, database_dirpath)
df_annotations = df_annotations.rename({"genes": "geneEnsemblID"}, axis=1)
df_annotations = df_model.merge(
    df_annotations,
    left_on=mapping_key,
    right_on=merge_key if merge_key is not None else mapping_key,
    how="left",
)
if merge_key is not None and merge_key != mapping_key:
    df_annotations = df_annotations.drop(mapping_key, axis=1)

df_annotations = df_annotations.loc[:, [annotation_type] + annotation_columns]

if rename_miriam:
    df_annotations = df_annotations.rename(HUMANGEM_MIRIAM[annotation_type], axis=1)


df_annotations = df_annotations.replace(float("nan"), pd.NA).replace("", pd.NA)
if compare:
    compare_on_index = [annotation_type]
    try:
        df_previous = pd.read_csv(
            annotation_dirpath / f"{annotation_type}_{db_tag}.tsv",
            sep="\t",
            index_col=None,
            dtype=str,
        )
        df_previous = df_previous.replace(float("nan"), pd.NA).replace("", pd.NA)
    except FileNotFoundError:
        df_previous = pd.DataFrame([], columns=compare_on_index)
    df_comparision = compare_tables(
        df_previous.set_index(compare_on_index),
        df_annotations.set_index(compare_on_index),
    )

    fig, ax = plt.subplots(1, 1, figsize=compare_figsize)
    ax.yaxis.set_tick_params(labelsize=8)
    ax = visualize_comparison(df_comparision)

if display_nunique:
    for col in df_annotations.columns:
        df = explode_column(df_annotations, name=col, sep=";")
        df = df[col].drop_duplicates()
        print(f"{df.name}: {df.nunique()}")

if overwrite:
    df_annotations.to_csv(
        annotation_dirpath / f"{annotation_type}_{db_tag}.tsv", sep="\t", index=False
    )

df_annotations