# Extract relevant data from ENZYME - Enzyme nomenclature database

The purpose of this notebook is to extract relevant data about enzymes and the reactions they catalyze from the database.

## Notebook Requirements:
*  Model genes **must** have the at least one of following annotations stored in the `object.annotation`. Values are expected to be seperated by semicolons. Accepted keys currently include:
    * `"ec-code"`
    * `"uniprot"`
 

Note: Requires internet connection to download information from [ENZYME - Enzyme nomenclature database](https://enzyme.expasy.org/).

## Setup
### Import packages

In [1]:
import re
from warnings import warn

import pandas as pd
from rbc_gem_utils import (
    GEM_NAME,
    build_string,
    check_database_release_online,
    explode_column,
    get_annotation_df,
    get_dirpath,
    read_cobra_model,
    show_versions,
)
from rbc_gem_utils.database.ec import (
    EC_DB_TAG,
    EC_RELEASE_EXPECTED,
    download_database_EC,
)

# Display versions of last time notebook ran and worked
show_versions()


Package Information
-------------------
rbc-gem-utils 0.0.2

Dependency Information
----------------------
beautifulsoup4                       4.13.4
bio                                   1.8.0
cobra                                0.29.1
depinfo                               2.2.0
gurobipy                             12.0.2
matplotlib                           3.10.3
matplotlib-venn                       1.1.2
memote                               0.17.0
networkx                              3.4.2
notebook                              7.4.2
openpyxl                              3.1.5
pandas                                2.2.3
pre-commit                            4.2.0
rbc-gem-utils[database,network,vis] missing
requests                             2.32.3
scipy                                1.15.3
seaborn                              0.13.2

Build Tools Information
-----------------------
pip          25.1
setuptools 78.1.1
wheel      0.45.1

Platform Information
-------------------

## Set notebook options

In [2]:
db_tag = EC_DB_TAG
expected_release = EC_RELEASE_EXPECTED
download_database = True
display_nunique = True
overwrite = True

# Best mapping key is ec-code or uniprot
mapping_key = "uniprot"

## Check EC-ENZYME version
If the version does not match the expected version, it is because database has been updated since the last time this code was utilized. 

### Expected EC-ENZYME version: 27-Nov-2024
* Updates to the database are made every eight weeks (need confirmation)
* Last release utilized: **27-Nov-2024**.

In [3]:
use_interim = not check_database_release_online(db_tag, verbose=True, **{})
# Use different directory paths for unexpected behavior
if use_interim:
    warn(
        "Online release of database has been updated since the last time notebook was used."
    )


database_dirpath = get_dirpath(
    "database", db_tag, use_temp="interim" if use_interim else None
)
annotation_dirpath = get_dirpath(
    "annotation", use_temp="interim" if use_interim else None
)

# Ensure directories exist
database_dirpath.mkdir(exist_ok=True, parents=True)
annotation_dirpath.mkdir(exist_ok=True, parents=True)

Current and expected releases match. Current release: 09-Apr-2025


#### Download new files and update database
If an argument is not provided (`arg=None`), its default value for the repository used. 

In [4]:
if download_database:
    download_database_EC(filename="enzyme.dat", database_dirpath=database_dirpath)
    download_database_EC(filename="enzclass.txt", database_dirpath=database_dirpath)

## Read data files
### Enzymes

In [5]:
with open(database_dirpath / "ec_enzyme.dat") as file:
    lines = file.readlines()

lines
data = {}
idx = -1
for line in lines:
    line = line.strip()
    line_type = line[:2]
    line_value = line[2 + 3 :]
    # All entries start with ID and end with '//' for termination
    if line.startswith("ID"):
        data[idx] = {"ID": line_value.split(" ")[-1]}
        continue
    elif line.startswith("//"):
        idx += 1
        continue
    elif line.startswith("CC"):
        if idx not in data:
            continue

    elif line.startswith("DR"):
        line_value = [x.strip() for x in line_value.split(";") if x.strip()]
        line_value = build_string(
            [
                x.split(", ")[0]
                for x in line_value
                if x.split(", ")[-1].endswith("_HUMAN")
            ]
        )

    if line_value:
        current = data.get(idx, {}).get(line_type, "")
        if current:
            current = build_string([current, line_value.rstrip(".")])
        else:
            current = line_value.rstrip(".")
        data[idx][line_type] = current

df_ec_enzyme = pd.DataFrame.from_dict(data, orient="index")
df_ec_enzyme = df_ec_enzyme[df_ec_enzyme["DE"].str.find("Transferred entry") == -1]
df_ec_enzyme = df_ec_enzyme.drop_duplicates()
df_ec_enzyme = df_ec_enzyme.rename(
    {
        "ID": "ec-code",
        "DE": "description",
        "AN": "alternate",
        "CA": "catalytic activity",
        "CC": "comments",
        "DR": "uniprot",
    },
    axis=1,
)
df_ec_enzyme["uniprot"] = df_ec_enzyme["uniprot"].apply(
    lambda x: x.split(";") if isinstance(x, str) else x
)
df_ec_enzyme = df_ec_enzyme.explode("uniprot")
df_ec_enzyme

Unnamed: 0,ec-code,description,alternate,catalytic activity,comments,uniprot
0,1.1.1.1,alcohol dehydrogenase,aldehyde reductase,(1) a primary alcohol + NAD(+) = an aldehyde +...,-!- Acts on primary or secondary alcohols or h...,P07327
0,1.1.1.1,alcohol dehydrogenase,aldehyde reductase,(1) a primary alcohol + NAD(+) = an aldehyde +...,-!- Acts on primary or secondary alcohols or h...,P00326
0,1.1.1.1,alcohol dehydrogenase,aldehyde reductase,(1) a primary alcohol + NAD(+) = an aldehyde +...,-!- Acts on primary or secondary alcohols or h...,P28332
0,1.1.1.1,alcohol dehydrogenase,aldehyde reductase,(1) a primary alcohol + NAD(+) = an aldehyde +...,-!- Acts on primary or secondary alcohols or h...,P40394
0,1.1.1.1,alcohol dehydrogenase,aldehyde reductase,(1) a primary alcohol + NAD(+) = an aldehyde +...,-!- Acts on primary or secondary alcohols or h...,P11766
...,...,...,...,...,...,...
8396,7.6.2.12,ABC-type capsular-polysaccharide transporter,capsular-polysaccharide-transporting ATPase,ATP + H2O + capsular polysaccharide-[capsular ...,-!- ATP-binding cassette (ABC) type transporte...,
8397,7.6.2.13,ABC-type autoinducer-2 transporter,autoinducer-2 ABC transporter;autoinducer-2 tr...,"ATP + H2O + (2R,4S)-2-methyl-2,3,3,4-tetrahydr...",-!- ATP-binding cassette (ABC) type transporte...,
8398,7.6.2.14,ABC-type aliphatic sulfonate transporter,aliphatic sulfonate ABC transporter;aliphatic ...,ATP + H2O + aliphatic sulfonate-[sulfonate-bin...,-!- ATP-binding cassette (ABC) type transporte...,
8399,7.6.2.15,ABC-type thiamine transporter,thiamine ABC transporter;thiamine transporting...,thiamine(out) + ATP + H2O = thiamine(in) + ADP...,-!- ATP-binding cassette (ABC) type transporte...,


### Enzyme classes

In [6]:
items = [
    "class",
    "subclass",
    "subsubclass",
    "serial",
    "description",
]

with open(database_dirpath / "ec_enzclass.txt") as file:
    lines = file.readlines()

lines = lines[11:-5]

ec_enzclass_data = {}
for i, line in enumerate(lines):
    if not re.search(r"^(\d+|\-)\.", line):
        continue

    line_items = [
        substr.strip().rstrip(".")
        for string in line.split(".", maxsplit=3)
        for substr in string.split(" ", maxsplit=1)
        if substr.strip()
    ]
    if len(line_items) != 5:
        warn(f"Issue with parsing line {i+1}: {repr(line)}")
        continue
    ec_enzclass_data[i] = {
        "ec-code": ".".join(line_items[:4]).strip(),
        "description": line_items[4],
    }
df_ec_enzclass = pd.DataFrame.from_dict(ec_enzclass_data, orient="index")
df_ec_enzclass = pd.concat(
    (df_ec_enzclass, df_ec_enzyme[["ec-code", "description"]]), axis=0
)
df_ec_enzclass = (
    df_ec_enzclass.sort_values("ec-code")
    .reset_index(drop=True)
    .dropna()
    .drop_duplicates()
    .astype(str)
)
description_dict = df_ec_enzclass.set_index("ec-code").to_dict()["description"]
df_ec_enzclass

Unnamed: 0,ec-code,description
0,1.-.-.-,Oxidoreductases
1,1.1.-.-,Acting on the CH-OH group of donors
2,1.1.1.-,With NAD(+) or NADP(+) as acceptor
3,1.1.1.1,alcohol dehydrogenase
8,1.1.1.10,L-xylulose reductase
...,...,...
10359,7.6.2.5,ABC-type heme transporter
10360,7.6.2.6,ABC-type guanine transporter
10361,7.6.2.7,ABC-type taurine transporter
10362,7.6.2.8,ABC-type vitamin B12 transporter


## Load RBC-GEM model

In [7]:
model_dirpath = get_dirpath("model")
model = read_cobra_model(filename=model_dirpath / f"{GEM_NAME}.xml")
model

Set parameter Username
Set parameter LicenseID to value 2664191
Academic license - for non-commercial use only - expires 2026-05-12


0,1
Name,RBC_GEM
Memory address,1d6aded5d90
Number of metabolites,2157
Number of reactions,3275
Number of genes,820
Number of groups,78
Objective expression,1.0*NaKt - 1.0*NaKt_reverse_db47e
Compartments,"cytosol, extracellular space"


### Extract current annotations from model

In [8]:
annotation_type = "genes"
df_model_mappings = (
    get_annotation_df(model.genes, ["ec-code", "uniprot"])
    .rename({"id": annotation_type}, axis=1)
    .dropna(subset=[mapping_key])
)
for col in df_model_mappings.columns:
    df_model_mappings = explode_column(df_model_mappings, name=col, sep=";")
df_model_mappings = df_model_mappings.sort_values(annotation_type)
df_model_mappings

Unnamed: 0,genes,ec-code,uniprot
626,A4GALT,2.4.1.228,Q9NPC4
295,AARS1,6.1.1.7,P49588
295,AARS1,6.-.-.-,P49588
477,AASDHPPT,2.7.8.7,Q9NRN7
627,ABCA1,7.6.2.1,O95477
...,...,...,...
552,ZDHHC20,2.3.1.-,Q5W0Z9
552,ZDHHC20,2.3.1.225,Q5W0Z9
525,ZDHHC3,2.3.1.-,Q9NYG2
525,ZDHHC3,2.3.1.225,Q9NYG2


### Map to EC Codes

In [9]:
df_model_ec_enzyme = df_model_mappings[["genes", "ec-code", "uniprot"]].merge(
    df_ec_enzyme,
    left_on=mapping_key,
    right_on=mapping_key,
    how="left",
    suffixes=("", "_drop"),
)
for key in ["ec-code", "uniprot"]:
    if f"{key}_drop" in df_model_ec_enzyme.columns:
        drop_key = f"{key}_drop"
        df_model_ec_enzyme = df_model_ec_enzyme[
            df_model_ec_enzyme[key] == df_model_ec_enzyme[drop_key]
        ].drop(labels=[drop_key], axis=1)
df_model_ec_enzyme = df_model_ec_enzyme.reset_index(drop=True)


df_model_ec_enzyme["subsubclass"] = df_model_ec_enzyme["ec-code"].apply(
    lambda x: ".".join(x.rsplit(".", maxsplit=1)[:1] + 1 * ["-"])
)
df_model_ec_enzyme["subclass"] = df_model_ec_enzyme["ec-code"].apply(
    lambda x: ".".join(x.rsplit(".", maxsplit=2)[:1] + 2 * ["-"])
)
df_model_ec_enzyme["class"] = df_model_ec_enzyme["ec-code"].apply(
    lambda x: ".".join(x.rsplit(".", maxsplit=3)[:1] + 3 * ["-"])
)

df_model_ec_enzyme["ec-code.description"] = df_model_ec_enzyme["ec-code"].apply(
    lambda x: description_dict[x]
)
df_model_ec_enzyme["subsubclass.description"] = df_model_ec_enzyme["subsubclass"].apply(
    lambda x: description_dict[x]
)
df_model_ec_enzyme["subclass.description"] = df_model_ec_enzyme["subclass"].replace(
    description_dict
)
df_model_ec_enzyme["class.description"] = df_model_ec_enzyme["class"].replace(
    description_dict
)


df_model_ec_enzyme = df_model_ec_enzyme.loc[
    :,
    [
        "genes",
        "uniprot",
        "class",
        "class.description",
        "subclass",
        "subclass.description",
        "subsubclass",
        "subsubclass.description",
        "ec-code",
        "ec-code.description",
        "alternate",
        "catalytic activity",
        "comments",
    ],
]
df_model_ec_enzyme = df_model_ec_enzyme.rename({"go": "GO"}, axis=1)
df_model_ec_enzyme = df_model_ec_enzyme.groupby(
    ["genes", mapping_key], as_index=False
).agg(lambda x: build_string(x.dropna().unique()))
df_model_ec_enzyme = df_model_ec_enzyme.replace(float("nan"), pd.NA).replace("", pd.NA)


if display_nunique:
    for col in df_model_ec_enzyme.columns:
        df = explode_column(df_model_ec_enzyme, name=col, sep=";")
        df = df[col].drop_duplicates()
        print(f"{df.name}: {df.nunique()}")

if overwrite:
    df_model_ec_enzyme.to_csv(
        database_dirpath / f"{EC_DB_TAG}_{GEM_NAME}.tsv",
        sep="\t",
        index=False,
    )

df_model_ec_enzyme

genes: 600
uniprot: 600
class: 7
class.description: 7
subclass: 50
subclass.description: 50
subsubclass: 108
subsubclass.description: 99
ec-code: 508
ec-code.description: 509
alternate: 1502
catalytic activity: 797
comments: 1994


Unnamed: 0,genes,uniprot,class,class.description,subclass,subclass.description,subsubclass,subsubclass.description,ec-code,ec-code.description,alternate,catalytic activity,comments
0,A4GALT,Q9NPC4,2.-.-.-,Transferases,2.4.-.-,Glycosyltransferases,2.4.1.-,Hexosyltransferases,2.4.1.228,lactosylceramide 4-alpha-galactosyltransferase,"Galbeta1-4Glcbeta1-Cer alpha1,4-galactosyltran...",a beta-D-Gal-(1->4)-beta-D-Glc-(1<->1)-Cer(d18...,
1,AARS1,P49588,6.-.-.-,Ligases,6.1.-.-,Forming carbon-oxygen bonds,6.1.1.-,Ligases forming aminoacyl-tRNA and related com...,6.1.1.7,alanine--tRNA ligase,alanine translase;alanyl-tRNA synthetase,tRNA(Ala) + L-alanine + ATP = L-alanyl-tRNA(Al...,
2,AASDHPPT,Q9NRN7,2.-.-.-,Transferases,2.7.-.-,Transferring phosphorus-containing groups,2.7.8.-,Transferases for other substituted phosphate g...,2.7.8.7,holo-[acyl-carrier-protein] synthase,4'-phosphopantetheinyl transferase;ACPS;acyl c...,"apo-[ACP] + CoA = holo-[ACP] + adenosine 3',5'...","-!- All polyketide synthases, fatty-acid synth..."
3,ABCA1,O95477,7.-.-.-,Translocases,7.6.-.-,Catalysing the translocation of other compounds,7.6.2.-,Linked to the hydrolysis of a nucleoside triph...,7.6.2.1,P-type phospholipid transporter,flippase;phospholipid-transporting ATPase,ATP + H2O + phospholipidSide 1 = ADP + phospha...,-!- A P-type ATPase that undergoes covalent ph...
4,ABCA7,Q8IZY2,7.-.-.-,Translocases,7.6.-.-,Catalysing the translocation of other compounds,7.6.2.-,Linked to the hydrolysis of a nucleoside triph...,7.6.2.1,P-type phospholipid transporter,flippase;phospholipid-transporting ATPase,ATP + H2O + phospholipidSide 1 = ADP + phospha...,-!- A P-type ATPase that undergoes covalent ph...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,YES1,P07947,2.-.-.-,Transferases,2.7.-.-,Transferring phosphorus-containing groups,2.7.10.-,Protein-tyrosine kinases,2.7.10.2,non-specific protein-tyrosine kinase,cytoplasmic protein tyrosine kinase,L-tyrosyl-[protein] + ATP = O-phospho-L-tyrosy...,"-!- Unlike EC 2.7.10.1, this protein-tyrosine ..."
596,ZDHHC2,Q9UIJ5,2.-.-.-,Transferases,2.3.-.-,Acyltransferases,2.3.1.-,Transferring groups other than amino-acyl groups,2.3.1.225,protein S-acyltransferase,DHHC palmitoyl transferase;G-protein palmitoyl...,L-cysteinyl-[protein] + hexadecanoyl-CoA = S-h...,-!- The enzyme catalyzes the post-translationa...
597,ZDHHC20,Q5W0Z9,2.-.-.-,Transferases,2.3.-.-,Acyltransferases,2.3.1.-,Transferring groups other than amino-acyl groups,2.3.1.225,protein S-acyltransferase,DHHC palmitoyl transferase;G-protein palmitoyl...,L-cysteinyl-[protein] + hexadecanoyl-CoA = S-h...,-!- The enzyme catalyzes the post-translationa...
598,ZDHHC3,Q9NYG2,2.-.-.-,Transferases,2.3.-.-,Acyltransferases,2.3.1.-,Transferring groups other than amino-acyl groups,2.3.1.225,protein S-acyltransferase,DHHC palmitoyl transferase;G-protein palmitoyl...,L-cysteinyl-[protein] + hexadecanoyl-CoA = S-h...,-!- The enzyme catalyzes the post-translationa...
