# Extract information from DrugBank

This notebook demonstrates how to extract information from various fields in the DrugBank Database. 
Purpose of this notebook is to serve as examples of how DrugBank data can be extracted. 

Data extracted for specific purposes and with subsequent analysis notebooks (e.g., drug interaction visualization) should be kept and maintained seperately.


## DRUGBANK ONLINE
To utilize this notebook: 

1. Go to [DrugBank database](https://go.drugbank.com/releases/latest) and create an account.
2. Follow the instructions to obtain a free academic license.
3. Download and unzip the database file `"drugbank_all_full_database.xml.zip"`.
4. Rename the file `"full database.xml"` to `"drugbank_all_full_database.xml"`.
5. Remember clear out any personal account information and ensure the downloaded DrugBank file remains local!

The functions `download_database_DrugBank` takes a given username and a password and downloads the data, taking care to change the filename in the process.

Fields for the DrugBank XML schema are found [here](https://docs.drugbank.com/xml/#introduction).

Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, Assempour N, Iynkkaran I, Liu Y, Maciejewski A, Gale N, Wilson A, Chin L, Cummings R, Le D, Pon A, Knox C, Wilson M. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017 Nov 8. doi: 10.1093/nar/gkx1037.

### Import packages

In [1]:
from warnings import warn
import pandas as pd
from collections import defaultdict
from xml.etree import ElementTree


from rbc_gem_utils import (
    ROOT_PATH,
    INTERIM_PATH,
    DATABASE_PATH,
    ANNOTATION_PATH,
    get_annotation_df,
    read_rbc_model,
    check_database_version_online,
    check_version,
    show_versions,
)
from rbc_gem_utils.util import (
    build_string,
    strip_plural,
    ensure_iterable,
    split_string,
    has_value_type,
)

from rbc_gem_utils.database.drugbank import (
    DRUGBANK_NS,
    DRUGBANK_VERSION_EXPECTED,
    DRUGBANK_PATH,
    DRUGBANK_GENERAL_ELEMENTS,
    DRUGBANK_PHARMACOLOGY_ELEMENTS,
    DRUGBANK_CLASSIFICATION_ELEMENTS,
    DRUGBANK_MIXTURES_ELEMENTS,
    DRUGBANK_SALTS_ELEMENTS,
    DRUGBANK_PRICE_ELEMENTS,
    DRUGBANK_DOSAGE_ELEMENTS,
    DRUGBANK_PATENT_ELEMENTS,
    DRUGBANK_PATHWAY_ELEMENTS,
    strip_ns_DrugBank,
    get_version_DrugBank,
    download_database_DrugBank,
)

from rbc_gem_utils.util import ensure_iterable

# Display versions of last time notebook ran and worked
show_versions()


Package Information
-------------------
rbc-gem-utils 0.0.1

Dependency Information
----------------------
beautifulsoup4                       4.12.3
bio                                   1.6.2
cobra                                0.29.0
depinfo                               2.2.0
kaleido                               0.2.1
matplotlib                            3.8.2
memote                               0.17.0
networkx                              3.2.1
notebook                              7.0.7
openpyxl                              3.1.2
pandas                                2.2.0
pre-commit                            3.6.0
pyvis                                 0.3.2
rbc-gem-utils[database,network,vis] missing
requests                             2.31.0
scipy                                1.12.0
seaborn                              0.13.2

Build Tools Information
-----------------------
pip        23.3.1
setuptools 68.2.2
wheel      0.41.2

Platform Information
-------------------

## Check DrugBank version
If the version does not match the expected version, it is because database has been updated since the last time this code was utilized. 
### Expected DrugBank version: 5.1.11
* Last release utilized: [5.1.11](https://go.drugbank.com/releases) published on **2024-01-03**
* Version in the DrugBank file is formatted as {major}.{minor}

In [2]:
if not check_database_version_online("DrugBank"):
    warn(
        "Online version of database has been updated since the last time notebook was used."
    )

version = get_version_DrugBank()
if check_version(version, DRUGBANK_VERSION_EXPECTED, verbose=True):
    database_dirpath = f"{ROOT_PATH}{DATABASE_PATH}{DRUGBANK_PATH}"
    annotation_dirpath = f"{ROOT_PATH}{ANNOTATION_PATH}"
else:
    database_dirpath = f"{ROOT_PATH}{INTERIM_PATH}{DRUGBANK_PATH}"
    annotation_dirpath = f"{ROOT_PATH}{INTERIM_PATH}"
    version = DRUGBANK_VERSION_EXPECTED

Current and expected versions match.


#### Download new files and update database
If an argument is not provided (`arg=None`), its default value for the repository used. 
Username and password must be provided for this function, be sure to remove personal information after use!

In [3]:
download = False
if download:
    # Download data
    download_database_DrugBank(
        username="USERNAME",
        password="PASSWORD",
        database_dirpath=database_dirpath,
        version=version,
    )
filepath = f"{database_dirpath}/drugbank_all_full_database.xml"

## Load RBC-GEM model

In [4]:
model = read_rbc_model(filetype="xml")
model

0,1
Name,RBC_GEM
Memory address,146dae350
Number of metabolites,1967
Number of reactions,2788
Number of genes,653
Number of groups,74
Objective expression,1.0*NaKt - 1.0*NaKt_reverse_db47e
Compartments,"cytosol, extracellular space"


In [5]:
annotation_type = "genes"
df_model_mappings = get_annotation_df(
    getattr(model, annotation_type), ["uniprot", "drugbank"]
).rename({"id": annotation_type}, axis=1)

df_model_mappings["drugbank"] = df_model_mappings["drugbank"].apply(
    lambda x: split_string(x)
)
df_model_mappings = (
    df_model_mappings.explode("drugbank").dropna(subset=["drugbank"]).drop_duplicates()
)
print(df_model_mappings.nunique())
drugbank_ids = set(df_model_mappings["drugbank"].dropna().unique())

df_model_mappings

genes        395
uniprot      395
drugbank    2065
dtype: int64


Unnamed: 0,genes,uniprot,drugbank
0,RPE,Q96AT9,DB00153
1,RPIA,P49247,DB01756
2,SORD,Q00796,DB00157
2,SORD,Q00796,DB04478
4,SRM,P19623,DB00118
...,...,...,...
651,GRIA1,P42261,DB05047
651,GRIA1,P42261,DB06247
651,GRIA1,P42261,DB08883
651,GRIA1,P42261,DB09289


## Parse DrugBank information into DataFrame

In [6]:
root = ElementTree.parse(f"{database_dirpath}/drugbank_all_full_database.xml").getroot()
root

<Element '{http://www.drugbank.ca}drugbank' at 0x1482f1850>

### Extract general information

In [7]:
dataframes = {}
idx = 0
data = defaultdict(dict)
for drug in root:
    # General information
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    data[idx].update({"drugbank-id": drugbank_id})
    data[idx].update({attr: drug.get(attr) for attr in ["type", "created", "updated"]})
    # Get element values that don't require diving further.
    for key in DRUGBANK_GENERAL_ELEMENTS:
        if key == "drugbank-id":
            continue
        # Get elements that have several values should be joined together for these.
        # AHFS codes seems empty for all entries?
        if key in {
            "groups",
            "affected-organisms",
            "ahfs-codes",
            "pdb-entries",
            "food-interactions",
        }:
            elements = drug.findall(
                f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}"
            )
            data[idx].update(
                {
                    key: build_string(
                        [
                            element.text
                            for element in elements
                            if has_value_type(element)
                        ],
                        sep=";;",
                    )
                }
            )
        elif key == "categories":
            # Drug catgories
            # For mesh-id: https://registry.identifiers.org/registry/mesh
            for subkey in ["category", "mesh-id"]:
                elements = drug.findall(
                    f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}/{DRUGBANK_NS}{subkey}"
                )
                data[idx].update(
                    {
                        f"{key}.{subkey}": build_string(
                            [
                                element.text
                                for element in elements
                                if has_value_type(element)
                            ],
                            sep=";;",
                        )
                    }
                )
        elif key == "general-references":
            # For main dataframe, group all references and use the unique ID from DrugBank
            elements = [
                element.find(f"{DRUGBANK_NS}ref-id")
                for subkey in ["articles", "textbooks", "links", "attachments"]
                for element in drug.findall(
                    f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{subkey}/{DRUGBANK_NS}{strip_plural(subkey)}"
                )
            ]
            data[idx].update(
                {
                    f"{key}.ref-id": build_string(
                        [
                            element.text
                            for element in elements
                            if has_value_type(element)
                        ],
                        sep=";;",
                    )
                }
            )

        else:
            element = drug.find(f"{DRUGBANK_NS}{key}")
            if element is not None and has_value_type(element):
                data[idx].update({key: element.text})

    # Chemical Taxonomy from classyfire
    key = "classification"
    prefix = f"{key}."
    for subkey in DRUGBANK_CLASSIFICATION_ELEMENTS:
        element = drug.find(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{subkey}")
        if element is not None and has_value_type(element):
            data[idx].update({f"{prefix}{subkey}": element.text})

    # Synonyms with international brands elsewhere
    # External codes not included
    # Pharmacology events
    for key in DRUGBANK_PHARMACOLOGY_ELEMENTS:
        prefix = "pharmacology."
        element = drug.find(f"{DRUGBANK_NS}{key}")
        if element is not None and has_value_type(element):
            data[idx].update({f"{prefix}{key}": element.text})

    # Regional availability not included
    # International brands with synonyms elsewhere

    # Mixtures of drugs, names only
    # Packagers of drugs, names only
    # Patents of drugs, numbers only
    # Pathway involvement of drugs, smpdb-id only
    # Products of drugs, names only
    for key, subkey in zip(
        ["mixtures", "packagers", "patents", "pathways", "products"],
        ["names", "names", "numbers", "smpdb-ids", "names"],
    ):
        prefix = f"{key}."
        elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
        data[idx].update(
            {
                f"{prefix}{subkey}": build_string(
                    [
                        element.findtext(f"{DRUGBANK_NS}{strip_plural(subkey)}")
                        for element in elements
                    ],
                    sep=";;",
                )
            }
        )
    # Manufacturers of drugs, names only
    key = "manufacturers"
    prefix = f"{key}."
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    data[idx].update(
        {
            f"{prefix}{key}": build_string(
                [element.text for element in elements if has_value_type(element)],
                sep=";;",
            )
        }
    )
    # Prices of drugs elsewhere
    # Categories of drugs with general information
    # References are included with general
    # Therapeutic categories not included
    # Dosages of drugs elsewhere
    # ATC codes of drugs, codes only
    key = "atc-codes"
    prefix = f"{key}."
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    data[idx].update(
        {
            f"{prefix}{key}": build_string(
                [element.get("code") for element in elements], sep=";;"
            )
        }
    )

    # Drug interactions elsewhere
    # Structured Drug interactions and associated, not included.
    # Sequences
    key = "sequences"
    prefix = f"{key}."
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    data[idx].update(
        {
            f"{prefix}{key}": build_string(
                [element.text for element in elements if has_value_type(element)],
                sep=";;",
            )
        }
    )

    # Calculated and experimental properties
    for key in ["calculated-properties", "experimental-properties"]:
        prefix = f"{key}."
        properties = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}property")
        for property in properties:
            property_dict = dict(
                zip(
                    [strip_ns_DrugBank(sub.tag) for sub in property],
                    [sub.text for sub in property],
                )
            )
            kind = property_dict["kind"].replace(" ", "-").lower()
            if key == "calculated-properties":
                # Include source of calculation in header (ALOGPS, ChemAxon)
                kind = f"{kind}.{property_dict['source']}"
                data[idx].update({f"{prefix}{kind}": property_dict["value"]})
            else:
                # Include source of experimental measurement in data columns
                data[idx].update(
                    {
                        f"{prefix}{kind}": property_dict["value"],
                        f"{prefix}{kind}.source": property_dict["source"],
                    }
                )

    # External Identifiers, not included
    # External Links

    # Reactions are included elsewhere
    # SNP Effects and SNP Adverse Drug Reactions, combined
    # Only SNP dbID, elsewhere
    data[idx].update(
        {
            "snp.rs-ids": build_string(
                [
                    element.text
                    for key in ["snp-effects", "snp-adverse-drug-reactions"]
                    for element in drug.findall(
                        f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key.split('-')[-1])}/{DRUGBANK_NS}rs-id"
                    )
                    if has_value_type(element)
                ],
                sep=";;",
            )
        }
    )

    # References are included with general and elsewhere
    # Salt forms of drugs
    key = "salts"
    prefix = f"{key}."
    elements = drug.findall(
        f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}/{DRUGBANK_NS}drugbank-id[@primary='true']"
    )
    data[idx].update(
        {
            f"{prefix}drugbank-ids": build_string(
                [element.text for element in elements if has_value_type(element)],
                sep=";;",
            )
        }
    )

    # Targets / Enzymes / Carriers / Transporters, only drugbank bioentities IDs
    for key in ["targets", "enzymes", "carriers", "transporters"]:
        prefix = f"{key}."
        elements = drug.findall(
            f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}/{DRUGBANK_NS}id"
        )
        data[idx].update(
            {
                f"{prefix}ids": build_string(
                    [element.text for element in elements if has_value_type(element)],
                    sep=";;",
                )
            }
        )

    idx += 1


df_drugbank_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["general"] = df_drugbank_data
df_drugbank_data

Unnamed: 0,drugbank-id,type,created,updated,name,description,cas-number,unii,average-mass,monoisotopic-mass,...,experimental-properties.logp,experimental-properties.logp.source,experimental-properties.caco2-permeability,experimental-properties.caco2-permeability.source,experimental-properties.pka,experimental-properties.pka.source,experimental-properties.logs,experimental-properties.logs.source,experimental-properties.radioactivity,experimental-properties.radioactivity.source
0,DB00027,small molecule,2005-06-13,2024-01-02,Gramicidin D,Gramcidin D is a heterogeneous mixture of thre...,1405-97-6,5IE62321P4,1811.253,1810.033419343,...,,,,,,,,,,
1,DB00030,biotech,2005-06-13,2024-01-02,Insulin human,"Human Insulin, also known as Regular Insulin, ...",11061-68-0,1Y17CTI5SR,,,...,,,,,,,,,,
2,DB00035,small molecule,2005-06-13,2024-01-02,Desmopressin,"Desmopressin (dDAVP), a synthetic analogue of ...",16679-58-6,ENR1LLB0FP,1069.22,1068.426955905,...,,,,,,,,,,
3,DB00041,biotech,2005-06-13,2024-01-02,Aldesleukin,"Aldesleukin, a lymphokine, is produced by reco...",110942-02-4,M89N0Q7EQR,,,...,,,,,,,,,,
4,DB00046,biotech,2005-06-13,2024-01-02,Insulin lispro,Insulin lispro is a rapid-acting form of insul...,133107-64-9,GFX7QIS1II,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2060,DB16732,biotech,2021-09-21,2021-11-08,Tisotumab vedotin,Tisotumab vedotin is a tissue factor-directed ...,1418731-10-8,T41737F88A,,,...,,,,,,,,,,
2061,DB16826,small molecule,2022-07-15,2023-12-22,Repotrectinib,Repotrectinib is a next-generation tyrosine ki...,1802220-02-5,08O3FQ4UNP,355.373,355.144453003,...,,,,,,,,,,
2062,DB17083,small molecule,2022-10-26,2022-12-13,Linzagolix,"Linzagolix is a non-peptide, selective antagon...",935283-04-8,7CDW97HUEX,508.42,508.055206494,...,,,,,,,,,,
2063,DB17472,small molecule,2023-01-30,2023-12-07,Pirtobrutinib,Pirtobrutinib is a small molecule and a highly...,2101700-15-4,JNA39I7ZVB,479.436,479.158052208,...,,,,,,,,,,


In [8]:
for c in df_drugbank_data.columns:
    print(c)

drugbank-id
type
created
updated
name
description
cas-number
unii
average-mass
monoisotopic-mass
state
groups
categories.category
categories.mesh-id
affected-organisms
ahfs-codes
pdb-entries
msds
food-interactions
general-references.ref-id
synthesis-reference
classification.kingdom
classification.superclass
classification.direct-parent
classification.substituent
classification.description
classification.alternative-parent
pharmacology.indication
pharmacology.pharmacodynamics
pharmacology.mechanism-of-action
mixtures.names
packagers.names
patents.numbers
pathways.smpdb-ids
products.names
manufacturers.manufacturers
atc-codes.atc-codes
sequences.sequences
calculated-properties.logp.ALOGPS
calculated-properties.logs.ALOGPS
calculated-properties.water-solubility.ALOGPS
calculated-properties.logp.ChemAxon
calculated-properties.iupac-name.ChemAxon
calculated-properties.traditional-iupac-name.ChemAxon
calculated-properties.molecular-weight.ChemAxon
calculated-properties.monoisotopic-weight.Ch

### Extract property information

In [9]:
df_properties = df_drugbank_data.loc[
    :,
    ["drugbank-id"]
    + [
        c
        for c in df_drugbank_data.columns
        if c.startswith("calculated-properties")
        or c.startswith("experimental-properties")
    ],
]
df_properties

Unnamed: 0,drugbank-id,calculated-properties.logp.ALOGPS,calculated-properties.logs.ALOGPS,calculated-properties.water-solubility.ALOGPS,calculated-properties.logp.ChemAxon,calculated-properties.iupac-name.ChemAxon,calculated-properties.traditional-iupac-name.ChemAxon,calculated-properties.molecular-weight.ChemAxon,calculated-properties.monoisotopic-weight.ChemAxon,calculated-properties.smiles.ChemAxon,...,experimental-properties.logp,experimental-properties.logp.source,experimental-properties.caco2-permeability,experimental-properties.caco2-permeability.source,experimental-properties.pka,experimental-properties.pka.source,experimental-properties.logs,experimental-properties.logs.source,experimental-properties.radioactivity,experimental-properties.radioactivity.source
0,DB00027,4.38,-5.7,3.90e-03 g/l,5.96,(2R)-N-[(1S)-1-{[(1R)-1-{[(1S)-1-{[(1R)-1-{[(1...,(2R)-N-[(1S)-1-{[(1R)-1-{[(1S)-1-{[(1R)-1-{[(1...,1811.253,1810.033419343,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(...,...,,,,,,,,,,
1,DB00030,,,,,,,,,,...,,,,,,,,,,
2,DB00035,-1,-4,1.10e-01 g/l,-6.1,"(2R)-2-{[(2S)-1-[(4R,7S,10S,13S,16S)-13-benzyl...",tigecycline,1069.22,1068.426955905,NC(=O)CC[C@@H]1NC(=O)[C@H](CC2=CC=CC=C2)NC(=O)...,...,,,,,,,,,,
3,DB00041,,,,,,,,,,...,,,,,,,,,,
4,DB00046,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2060,DB16732,,,,,,,,,,...,,,,,,,,,,
2061,DB16826,2.33,-3.8,4.98e-02 g/l,2.17,"(3R,11S)-6-fluoro-3,11-dimethyl-10-oxa-2,13,17...","5-{[6-(1-cyclopropylpyrazol-4-yl)-[1,2,4]triaz...",355.373,355.144453003,C[C@H]1CNC(=O)C2=C3N=C(N[C@H](C)C4=CC(F)=CC=C4...,...,,,,,,,,,,
2062,DB17083,3.2,-5.4,1.98e-03 g/l,3.88,"3-{5-[(2,3-difluoro-6-methoxyphenyl)methoxy]-2...",linzagolix,508.42,508.055206494,COC1=C(COC2=C(OC)C=C(F)C(=C2)N2C(=O)NC3=CSC(C(...,...,,,,,,,,,,
2063,DB17472,3.17,-5.1,3.84e-03 g/l,3.35,5-amino-3-(4-{[(5-fluoro-2-methoxyphenyl)forma...,6-{[6-(4-aminobenzenesulfonamido)pyridin-3-yl]...,479.436,479.158052208,COC1=C(C=C(F)C=C1)C(=O)NCC1=CC=C(C=C1)C1=NN([C...,...,,,,,,,,,,


### Extract classification information

In [10]:
df_drugbank_classification_data = df_drugbank_data.loc[
    :,
    ["drugbank-id"]
    + [f"classification.{subkey}" for subkey in DRUGBANK_CLASSIFICATION_ELEMENTS],
]
df_drugbank_classification_data

Unnamed: 0,drugbank-id,classification.kingdom,classification.superclass,classification.direct-parent,classification.subclass,classification.substituent,classification.description,classification.alternative-parent
0,DB00027,Organic compounds,Organic Polymers,Polypeptides,,3-alkylindole,This compound belongs to the class of organic ...,3-alkylindoles
1,DB00030,Organic Compounds,Organic Acids,Peptides,"Amino Acids, Peptides, and Analogues",,,
2,DB00035,,,,,,,
3,DB00041,Organic Compounds,Organic Acids,Peptides,"Amino Acids, Peptides, and Analogues",,,
4,DB00046,Organic Compounds,Organic Acids,Peptides,"Amino Acids, Peptides, and Analogues",,,
...,...,...,...,...,...,...,...,...
2060,DB16732,Organic Compounds,Organic Acids,Peptides,"Amino Acids, Peptides, and Analogues",,,
2061,DB16826,,,,,,,
2062,DB17083,,,,,,,
2063,DB17472,,,,,,,


### Extract synonyms and other aliases
#### Synonyms

In [11]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "synonyms"
    prefix = f"{key}." if prefix else ""
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    for element in elements:
        if has_value_type(element):
            data[idx].update({"drugbank-id": drugbank_id})
            data[idx].update({f"{prefix}{key}": element.text})
            data[idx].update(
                {
                    f"{prefix}{attr}": element.get(attr)
                    for attr in ["language", "coder"]
                    if element.get(attr)
                }
            )
            idx += 1

df_drugbank_synonym_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)

df_drugbank_synonym_data

Unnamed: 0,drugbank-id,synonyms.synonyms,synonyms.language,synonyms.coder
0,DB00027,Bacillus brevis gramicidin D,english,
1,DB00027,Gramicidin,english,
2,DB00027,Gramicidin A,english,
3,DB00027,Gramicidin B,english,
4,DB00027,Gramicidin C,english,
...,...,...,...,...
7336,DB17472,(s)-5-amino-3-(4-((5-fluoro-2-methoxybenzamido...,english,
7337,DB17472,"1h-pyrazole-4-carboxamide, 5-amino-3-(4-(((5-f...",english,
7338,DB17472,5-amino-3-(4-((5-fluoro-2-methoxybenzamido)met...,english,
7339,DB17635,DCR-PHXC free acid,english,


#### International brands

The proprietary names used by the manufacturers for commercially available forms of the drug, focusing on brand names for products that are available in countries other than Canada and the Unites States.

In [12]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "international-brands"
    prefix = f"{key}." if prefix else ""
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    for element in elements:
        data[idx].update(
            {"drugbank-id": drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")}
        )
        for subkey in ["name", "company"]:
            data[idx].update(
                {f"{prefix}{subkey}": element.findtext(f"{DRUGBANK_NS}{subkey}")}
            )
        idx += 1

df_drugbank_intbrand_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
df_drugbank_intbrand_data

Unnamed: 0,drugbank-id,international-brands.name,international-brands.company
0,DB00027,Sofradex,Sanofi
1,DB00035,Adiuretin,Ferring
2,DB00035,DesmoMelt,Ferring
3,DB00047,Lantus R,
4,DB00047,Lusduna Nexvue,
...,...,...,...
3752,DB16390,Exkivity,Takeda Pharmaceutical Company Limited
3753,DB16627,Pepaxto,Oncopeptides AB
3754,DB16703,Rezurock,"Kadmon Holdings, Inc."
3755,DB16732,Tivdak,Seagen Inc. and Genmab A/S


### Extract salts information

In [13]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "salts"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        data[idx].update({"drugbank-id": drugbank_id})
        data[idx].update(
            {
                f"{prefix}drugbank-id": element.findtext(
                    f"{DRUGBANK_NS}drugbank-id[@primary='true']"
                )
            }
        )
        for subkey in DRUGBANK_SALTS_ELEMENTS:
            if subkey == "drugbank-id":
                continue
            subelement = element.find(f"{DRUGBANK_NS}{subkey}")
            if subelement is not None and has_value_type(subelement):
                data[idx].update(
                    {f"{prefix}{subkey}": element.findtext(f"{DRUGBANK_NS}{subkey}")}
                )

        idx += 1

df_drugbank_salts_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["salts"] = df_drugbank_salts_data
df_drugbank_salts_data

Unnamed: 0,drugbank-id,salts.drugbank-id,salts.name,salts.cas-number,salts.unii,salts.inchikey,salts.average-mass,salts.monoisotopic-mass
0,DB00030,DBSALT001733,Insulin human zinc suspension,,,,,
1,DB00030,DBSALT001734,NPH insulin,53027-39-7,,,,
2,DB00035,DBSALT001154,Desmopressin acetate,62357-86-2,XB13HYU18U,YNKFCNRZZPFMEX-XHPDKPNGSA-N,1183.32,1182.479779327
3,DB00035,DBSALT000044,Desmopressin acetate anhydrous,62288-83-9,1K12647SFC,MLSVJHOYXJGGTR-IFHOVBQLSA-N,1129.269,1128.448084334
4,DB00071,DBSALT001735,Insulin suspension isophane purified pork,,,,,
...,...,...,...,...,...,...,...,...
877,DB16656,DBSALT003170,Zotiraciclib citrate,1204918-73-9,3VF50SU4RZ,NWYDRHSNEATNRI-SQQVDAMQSA-N,564.595,564.222014006
878,DB16703,DBSALT003188,Belumosudil mesylate,2109704-99-4,6MX7XE1M0U,BGNMZPDNJWWQCU-UHFFFAOYSA-N,548.62,548.1841892
879,DB16703,DBSALT003189,Belumosudil trifluoroacetate,1243152-02-4,LL4OG4RZ5D,PBWWPJDQYJXRII-UHFFFAOYSA-N,580.568,580.204587862
880,DB17083,DBSALT003262,Linzagolix choline,1321816-57-2,VHS6SC660Q,IAIVRTFCYOGNBW-UHFFFAOYSA-M,611.59,611.154920537


### Extract mixtures information

In [14]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "mixtures"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        for subkey in DRUGBANK_MIXTURES_ELEMENTS:
            subelement = element.find(f"{DRUGBANK_NS}{subkey}")
            if subelement is not None and has_value_type(subelement):
                data[idx].update({f"{prefix}{subkey}": subelement.text})

        idx += 1

df_drugbank_mixtures_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["mixtures"] = df_drugbank_mixtures_data
df_drugbank_mixtures_data

Unnamed: 0,mixtures.name,mixtures.ingredients
0,Neomycin and Polymyxin B Sulfates and Gramicidin,Gramicidin D + Neomycin + Polymyxin B
1,Neosporin,Gramicidin D + Neomycin + Polymyxin B
2,Neocidin,Gramicidin D + Neomycin + Polymyxin B
3,Soframycin Nasal Spray,Framycetin + Gramicidin D + Phenylephrine
4,Triple Antibiotic Ointment,Bacitracin + Gramicidin D + Polymyxin B
...,...,...
18260,Bylvay,Odevixibat
18261,Rezlidhia,Olutasidenib
18262,KLH-2109 Choline,Linzagolix
18263,Yselty,Linzagolix


### Extract manufacturers information

In [15]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "manufacturers"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        if has_value_type(element):
            data[idx].update({"drugbank-id": drugbank_id})
            data[idx].update({f"{prefix}{strip_plural(key)}": element.text})
            data[idx].update({f"{prefix}generic": element.get("generic")})
            idx += 1


df_drugbank_manufacturers_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["manufacturers"] = df_drugbank_manufacturers_data
df_drugbank_manufacturers_data

Unnamed: 0,drugbank-id,manufacturers.manufacturer,manufacturers.generic
0,DB00030,Novo nordisk inc,false
1,DB00035,Sanofi aventis us llc,false
2,DB00035,Bedford laboratories div ben venue laboratorie...,true
3,DB00035,Hospira inc,true
4,DB00035,Teva parenteral medicines inc,true
...,...,...,...
3463,DB01021,Par pharmaceutical inc,true
3464,DB01021,Sandoz inc,true
3465,DB01021,Tg united labs llc,true
3466,DB01021,Watson laboratories inc,true


### Extract prices information

In [16]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "prices"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        data[idx].update({"drugbank-id": drugbank_id})
        data[idx].update(
            {
                f"{prefix}{subkey}": element.findtext(f"{DRUGBANK_NS}{subkey}")
                for subkey in DRUGBANK_PRICE_ELEMENTS
            }
        )
        data[idx].update(
            {f"{prefix}currency": element.find(f"{DRUGBANK_NS}cost").get("currency")}
        )

        idx += 1

df_drugbank_prices_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["prices"] = df_drugbank_prices_data
df_drugbank_prices_data

Unnamed: 0,drugbank-id,prices.description,prices.cost,prices.unit,prices.currency
0,DB00027,Neosporin gu irr 40 mg/ml amp,23.12,ml,USD
1,DB00027,Gramicidin d powder,240.0,g,USD
2,DB00027,Neosporin + pain relief cream,0.32,g,USD
3,DB00030,Novolin Ge Nph 100 unit/ml,2.14,cartridge,USD
4,DB00030,Novolin Ge Toronto 100 unit/ml,2.14,cartridge,USD
...,...,...,...,...,...
7013,DB06151,Mucomyst-10 10% Solution 30ml Vial,25.99,vial,USD
7014,DB06151,Acetylcysteine 20 % Solution,0.68,ml,USD
7015,DB06151,Mucomyst 20 % Solution,0.75,ml,USD
7016,DB06151,N-acetyl-l-cysteine powder,0.84,g,USD


### Extract dosage information

In [17]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "dosages"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        data[idx].update({"drugbank-id": drugbank_id})
        data[idx].update(
            {
                f"{prefix}{subkey}": element.findtext(f"{DRUGBANK_NS}{subkey}").lower()
                for subkey in DRUGBANK_DOSAGE_ELEMENTS
            }
        )
        idx += 1

df_drugbank_dosages_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["dosages"] = df_drugbank_dosages_data
df_drugbank_dosages_data

Unnamed: 0,drugbank-id,dosages.form,dosages.route,dosages.strength
0,DB00027,solution,auricular (otic); ophthalmic,
1,DB00027,solution / drops,auricular (otic),
2,DB00027,solution / drops,ophthalmic,
3,DB00027,solution,ophthalmic,
4,DB00027,solution,ophthalmic,0.025 mg
...,...,...,...,...
38680,DB17083,powder,,1 kg/1kg
38681,DB17083,"tablet, film coated",oral,100 mg
38682,DB17083,"tablet, film coated",oral,200 mg
38683,DB17472,"tablet, coated",oral,100 mg/1


### Extract ATC codes information

In [18]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "atc-codes"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        data[idx].update(
            {"drugbank-id": drugbank_id, f"{prefix}atc-code": element.get("code")}
        )
        data[idx].update({f"{prefix}description": drug.findtext(f"{DRUGBANK_NS}name")})
        data[idx].update({f"{prefix}code": element.get("code")})
        data[idx].update({f"{prefix}level": "substance"})
        idx += 1
        for level, subelement in zip(
            ["chemical", "pharmacological", "therapeutic", "anatomical"], list(element)
        ):
            data[idx].update(
                {"drugbank-id": drugbank_id, f"{prefix}atc-code": element.get("code")}
            )
            data[idx].update({f"{prefix}description": subelement.text})
            data[idx].update({f"{prefix}code": subelement.get("code")})
            data[idx].update({f"{prefix}level": level})
            idx += 1
df_atc_codes_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["atc-codes"] = df_atc_codes_data
df_atc_codes_data

Unnamed: 0,drugbank-id,atc-codes.atc-code,atc-codes.description,atc-codes.code,atc-codes.level
0,DB00027,R02AB30,Gramicidin D,R02AB30,substance
1,DB00027,R02AB30,Antibiotics,R02AB,chemical
2,DB00027,R02AB30,THROAT PREPARATIONS,R02A,pharmacological
3,DB00027,R02AB30,THROAT PREPARATIONS,R02,therapeutic
4,DB00027,R02AB30,RESPIRATORY SYSTEM,R,anatomical
...,...,...,...,...,...
9960,DB17083,H01CC04,Linzagolix,H01CC04,substance
9961,DB17083,H01CC04,Anti-gonadotropin-releasing hormones,H01CC,chemical
9962,DB17083,H01CC04,HYPOTHALAMIC HORMONES,H01C,pharmacological
9963,DB17083,H01CC04,PITUITARY AND HYPOTHALAMIC HORMONES AND ANALOGUES,H01,therapeutic


### Extract patent information

In [19]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "patents"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        data[idx].update({"drugbank-id": drugbank_id})
        data[idx].update(
            {
                f"{prefix}{subkey}": element.findtext(f"{DRUGBANK_NS}{subkey}").lower()
                for subkey in DRUGBANK_PATENT_ELEMENTS
            }
        )

        idx += 1

df_drugbank_dosages_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["patents"] = df_drugbank_dosages_data
df_drugbank_dosages_data

Unnamed: 0,drugbank-id,patents.number,patents.country,patents.approved,patents.expires,patents.pediatric-extension
0,DB00030,re37872,united states,2002-10-08,2010-02-12,false
1,DB00030,2183577,canada,2007-10-30,2015-02-07,false
2,DB00030,2253393,canada,2007-10-09,2017-05-07,false
3,DB00030,7291132,united states,2007-11-06,2024-08-09,false
4,DB00030,6257233,united states,2001-07-10,2019-05-14,false
...,...,...,...,...,...,...
6417,DB17635,11359203,united states,2015-10-09,2035-10-09,false
6418,DB17635,11286488,united states,2018-10-12,2038-10-12,false
6419,DB17635,10738311,united states,2015-10-09,2035-10-09,false
6420,DB17635,11053502,united states,2015-10-29,2035-10-29,false


### Extract drug interactions information
Note: These are drug interactions based on the free version of drugbank, terminology kept seperate from the [Structured Drug Interactions](https://docs.drugbank.com/xml/#structured-drug-interactions) for DrugBank.

In [20]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "drug-interactions"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        data[idx].update(
            {"drugbank-id": drugbank_id, "name": drug.findtext(f"{DRUGBANK_NS}name")}
        )
        data[idx].update(
            {
                f"{prefix}{subkey}": element.findtext(f"{DRUGBANK_NS}{subkey}")
                for subkey in ["drugbank-id", "name", "description"]
            }
        )
        idx += 1

df_drug_interactions = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)

# # Aggregate dataframe values
# df_drug_interactions["drug-interactions.drugbank-ids"] = df_drug_interactions[["drugbank-id", "drug-interactions.drugbank-id"]].agg(
#     lambda x: build_string(x, sep=";;"),
#     axis=1
# )
# df_drug_interactions["drug-interactions.names"] = df_drug_interactions[["name", "drug-interactions.name"]].agg(
#     lambda x: build_string(x, sep=";;"),
#     axis=1
# )
# df_drug_interactions = df_drug_interactions.loc[:, [
#     "drugbank-id",
#     "drug-interactions.drugbank-ids",
#     "drug-interactions.names",
#     "drug-interactions.description"
# ]]

# dataframes["interactions"] = df_drug_interactions
df_drug_interactions

Unnamed: 0,drugbank-id,name,drug-interactions.drugbank-id,drug-interactions.name,drug-interactions.description
0,DB00027,Gramicidin D,DB12768,BCG vaccine,The therapeutic efficacy of BCG vaccine can be...
1,DB00027,Gramicidin D,DB00266,Dicoumarol,The risk or severity of bleeding can be increa...
2,DB00027,Gramicidin D,DB00498,Phenindione,The risk or severity of bleeding can be increa...
3,DB00027,Gramicidin D,DB00682,Warfarin,The risk or severity of bleeding can be increa...
4,DB00027,Gramicidin D,DB00946,Phenprocoumon,The risk or severity of bleeding can be increa...
...,...,...,...,...,...
1035276,DB17472,Pirtobrutinib,DB11718,Encorafenib,The serum concentration of Encorafenib can be ...
1035277,DB17472,Pirtobrutinib,DB11679,Fruquintinib,The metabolism of Fruquintinib can be decrease...
1035278,DB17472,Pirtobrutinib,DB12005,Nirogacestat,The serum concentration of Nirogacestat can be...
1035279,DB17472,Pirtobrutinib,DB18705,SARS-CoV-2 virus recombinant spike (S) protein...,The therapeutic efficacy of SARS-CoV-2 virus r...


### Extract pathway information

In [21]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "pathways"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}."
    for element in elements:
        data[idx].update({"drugbank-id": drugbank_id})
        for subkey in DRUGBANK_PATHWAY_ELEMENTS:
            if subkey in {"smpdb-id", "name", "category"}:
                data[idx].update(
                    {f"{prefix}{subkey}": element.findtext(f"{DRUGBANK_NS}{subkey}")}
                )
            elif subkey == "drugs":
                # All drugbank IDs in this field will be redundant
                # as long as they also appear in the original drugbank ID column
                continue
            else:
                data[idx].update(
                    {
                        f"{prefix}uniprot-id": build_string(
                            [
                                subelem.text
                                for subelem in element.findall(
                                    f"{DRUGBANK_NS}{subkey}/{DRUGBANK_NS}uniprot-id"
                                )
                            ],
                            sep=";;",
                        )
                    }
                )

        idx += 1

df_pathways_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["pathways"] = df_pathways_data
df_pathways_data

Unnamed: 0,drugbank-id,pathways.smpdb-id,pathways.name,pathways.category,pathways.uniprot-id
0,DB00086,SMP0000282,Streptokinase Action Pathway,drug_action,P00747;;P00748;;P02452;;P03952;;P03951;;P00740...
1,DB00114,SMP0000002,Carbamoyl Phosphate Synthetase Deficiency,disease,Q15758;;P43007;;P24298;;Q9UI32;;P00367;;P31327...
2,DB00114,SMP0000003,Argininosuccinic Aciduria,disease,Q15758;;P43007;;P24298;;Q9UI32;;P00367;;P31327...
3,DB00114,SMP0000004,Glycine and Serine Metabolism,metabolic,P21397;;P05091;;O75600;;Q9UI17;;Q9UL12;;P23378...
4,DB00114,SMP0000006,Tyrosine Metabolism,metabolic,P17735;;P17174;;P32754;;O43708;;P16930;;P20711...
...,...,...,...,...,...
2498,DB07780,SMP0000387,CHILD Syndrome,disease,Q9BWD1;;Q01581;;Q03426;;Q15126;;P53602;;Q13907...
2499,DB07780,SMP0000509,Hyper-IgD Syndrome,disease,Q9BWD1;;Q01581;;Q03426;;Q15126;;P53602;;Q13907...
2500,DB07780,SMP0000511,Wolman Disease,disease,Q9BWD1;;Q01581;;Q03426;;Q15126;;P53602;;Q13907...
2501,DB08231,SMP0000456,Fatty Acid Biosynthesis,metabolic,Q13085;;P49327


### Extract reactions

In [22]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "reactions"
    prefix = f"{key}."
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    for element in elements:
        key = "enzymes"
        for subelement in element.findall(
            f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}"
        ):
            data[idx].update({"drugbank-id": drugbank_id})
            data[idx].update(
                {
                    f"{prefix}{subkey}": element.findtext(
                        f"{DRUGBANK_NS}{subkey}/{DRUGBANK_NS}drugbank-id"
                    )
                    for subkey in ["left-element", "right-element"]
                }
            )
            data[idx].update(
                {
                    f"{prefix}{key}.{strip_ns_DrugBank(subelem.tag)}": subelem.text
                    for subelem in subelement
                    if has_value_type(subelem)
                }
            )
            idx += 1

df_reactions_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["reactions"] = df_reactions_data
df_reactions_data

Unnamed: 0,drugbank-id,reactions.left-element,reactions.right-element,reactions.enzymes.drugbank-id,reactions.enzymes.name,reactions.enzymes.uniprot-id
0,DB00091,DB00091,DBMET02955,BE0002638,Cytochrome P450 3A4,P08684
1,DB00091,DB00091,DBMET02955,BE0002362,Cytochrome P450 3A5,P20815
2,DB00091,DB00091,DBMET02217,BE0002638,Cytochrome P450 3A4,P08684
3,DB00091,DB00091,DBMET02217,BE0002362,Cytochrome P450 3A5,P20815
4,DB00091,DB00091,DBMET00359,BE0002638,Cytochrome P450 3A4,P08684
...,...,...,...,...,...,...
2244,DB16650,DB16650,DBMET03545,BE0002433,Cytochrome P450 1A2,P05177
2245,DB16650,DB16650,DBMET03546,BE0003549,Cytochrome P450 2B6,P20813
2246,DB16650,DB16650,DBMET03546,BE0002363,Cytochrome P450 2D6,P10635
2247,DB16650,DB16650,DBMET03547,BE0004712,Cocaine esterase,O00748


### Extract SNPs

In [23]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    prefix = f"snp."
    for key in ["snp-effects", "snp-adverse-drug-reactions"]:
        elements = drug.findall(
            f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key.split('-')[-1])}"
        )
        for element in elements:
            data[idx].update({"drugbank-id": drugbank_id})
            data[idx].update(
                {
                    f"{prefix}{strip_ns_DrugBank(subelement.tag)}": subelement.text
                    for subelement in element
                    if has_value_type(subelement)
                }
            )
            idx += 1
df_snp_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["snp"] = df_snp_data
df_snp_data

Unnamed: 0,drugbank-id,snp.protein-name,snp.gene-symbol,snp.uniprot-id,snp.rs-id,snp.defining-change,snp.description,snp.pubmed-id,snp.allele,snp.adverse-reaction
0,DB00175,Kinesin-like protein KIF6,KIF6,Q6ZMV9,rs20455,C Allele,Patients with this genotype have a greater red...,18222353,,
1,DB00175,3-hydroxy-3-methylglutaryl-coenzyme A reductase,HMGCR,P04035,rs17244841,T Allele,Patients with this genotype have a lesser redu...,15199031,,
2,DB00176,Cytochrome P450 2D6,CYP2D6,P10635,rs35742686,2549delA,The presence of this polymorphism in CYP2D6 is...,25974703,CYP2D6*3,
3,DB00176,Cytochrome P450 2D6,CYP2D6,P10635,rs3892097,A allele,The presence of this polymorphism in CYP2D6 is...,25974703,CYP2D6*4,
4,DB00176,Cytochrome P450 2D6,CYP2D6,P10635,,Whole-gene deletion,The presence of this polymorphism in CYP2D6 is...,25974703,CYP2D6*5,
...,...,...,...,...,...,...,...,...,...,...
242,DB08916,Epidermal growth factor receptor,EGFR,P00533,rs121434568,T > G,The presence of this polymorphism in EGFR is a...,15118073,L858R,
243,DB08916,Epidermal growth factor receptor,EGFR,P00533,rs28929495,G > A or C or T,The presence of this polymorphism in EGFR is a...,15118073,G719A/C,
244,DB08930,UDP-glucuronosyltransferase 1-1,UGT1A1,P22309,rs8175347,extra TA in promoter,Poor drug metabolizer.,24329186,UGT1A1*28 or UGT1A 7/7,
245,DB08930,UDP-glucuronosyltransferase 1-1,UGT1A1,P22309,rs4148323,G > A,The presence of this polymorphism in UGT1A1 is...,24329186,UGT1A1*6,


### Extract products

A list of commercially available products in Canada and the United States that contain the drug.

In [24]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "products"
    prefix = f"{key}."
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    for element in elements:
        data[idx].update({"drugbank-id": drugbank_id})
        data[idx].update(
            {
                f"{prefix}{strip_ns_DrugBank(subelement.tag)}": subelement.text
                for subelement in element
                if has_value_type(subelement)
            }
        )
        idx += 1

df_products_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["products"] = df_products_data
df_products_data

Unnamed: 0,drugbank-id,products.name,products.labeller,products.dpd-id,products.started-marketing-on,products.dosage-form,products.route,products.generic,products.over-the-counter,products.approved,products.country,products.source,products.ended-marketing-on,products.ndc-product-code,products.fda-application-number,products.ema-product-code,products.ema-ma-number,products.strength
0,DB00027,Antibiotic Cream,Cellchem Pharmaceuticals Inc.,02311844,2009-12-23,Cream,Topical,false,true,true,Canada,DPD,,,,,,
1,DB00027,Antibiotic Cream,Canadian Custom Packaging Company,02372029,2012-03-22,Cream,Topical,false,true,true,Canada,DPD,2020-09-11,,,,,
2,DB00027,Antibiotic Cream,Technilab Pharma Inc.,02208288,1998-11-03,Cream,Topical,false,true,true,Canada,DPD,2005-08-05,,,,,
3,DB00027,Antibiotic Cream for Kids,"Taro Pharmaceuticals, Inc.",02315033,2009-07-30,Cream,Topical,false,true,true,Canada,DPD,,,,,,
4,DB00027,Antibiotic Cream for Kids,Cellchem Pharmaceuticals Inc.,02317974,,Cream,Topical,false,true,true,Canada,DPD,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193796,DB17083,Yselty,Theramex Ireland Limited,,2023-05-04,"Tablet, film coated",Oral,false,false,true,EU,EMA,,,,EMEA/H/C/005442,EU/1/21/1606/004,200 mg
193797,DB17083,Yselty,Theramex Ireland Limited,,2022-07-15,"Tablet, film coated",Oral,false,false,true,EU,EMA,,,,EMEA/H/C/005442,EU/1/21/1606/001,100 mg
193798,DB17083,Yselty,Theramex Ireland Limited,,2022-07-15,"Tablet, film coated",Oral,false,false,true,EU,EMA,,,,EMEA/H/C/005442,EU/1/21/1606/002,200 mg
193799,DB17472,Jaypirca,Eli Lilly and Company,,2023-01-27,"Tablet, coated",Oral,false,false,true,US,FDA NDC,,0002-6902,NDA216059,,,50 mg/1


### Extract proteins

In [25]:
idx = 0
data = defaultdict(dict)
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    prefix = f"proteins."
    for ptype in ["targets", "enzymes", "carriers", "transporters"]:
        elements = drug.findall(
            f"{DRUGBANK_NS}{ptype}/{DRUGBANK_NS}{strip_plural(ptype)}"
        )
        for element in elements:
            for subelement in element.findall(f"{DRUGBANK_NS}polypeptide"):
                data[idx].update({f"drugbank-id": drugbank_id, f"{prefix}type": ptype})
                data[idx].update(
                    {
                        f"{prefix}{strip_ns_DrugBank(subelement.tag)}": subelement.text
                        for subelement in element
                        if has_value_type(subelement)
                    }
                )
                key = "actions"
                data[idx].update(
                    {
                        f"{prefix}{key}": build_string(
                            [
                                subelement.text
                                for subelement in element.findall(
                                    f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}"
                                )
                                if has_value_type(subelement)
                            ],
                            sep=";;",
                        )
                    }
                )
                key = "references"
                # For main dataframe, group all references and use the unique ID from DrugBank
                elements = [
                    element.find(f"{DRUGBANK_NS}ref-id")
                    for subkey in ["articles", "textbooks", "links", "attachments"]
                    for element in drug.findall(
                        f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{subkey}/{DRUGBANK_NS}{strip_plural(subkey)}"
                    )
                ]
                data[idx].update(
                    {
                        f"{prefix}ref-id": build_string(
                            [
                                element.text
                                for element in elements
                                if has_value_type(element)
                            ],
                            sep=";;",
                        )
                    }
                )

                # Polypeptide
                key = "polypeptide"
                data[idx].update(
                    {
                        f"{prefix}{key}.uniprot-id": subelement.get("id"),
                        f"{prefix}{key}.source": subelement.get("source"),
                    }
                )
                data[idx].update(
                    {
                        f"{prefix}{key}.{strip_ns_DrugBank(subelem.tag)}": subelem.text
                        for subelem in subelement
                        if has_value_type(subelem)
                    }
                )
                subkey = "pfams"
                data[idx].update(
                    {
                        f"{prefix}{key}.{subkey}": build_string(
                            [
                                subelem.text
                                for subelem in subelement.findall(
                                    f"{DRUGBANK_NS}{subkey}/{DRUGBANK_NS}{strip_plural(subkey)}/{DRUGBANK_NS}identifier"
                                )
                                if has_value_type(subelem)
                            ]
                        )
                    }
                )

                idx += 1

df_proteins_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
dataframes["proteins"] = df_proteins_data
df_proteins_data

Unnamed: 0,drugbank-id,proteins.type,proteins.id,proteins.name,proteins.organism,proteins.known-action,proteins.actions,proteins.ref-id,proteins.polypeptide.uniprot-id,proteins.polypeptide.source,...,proteins.polypeptide.theoretical-pi,proteins.polypeptide.molecular-weight,proteins.polypeptide.chromosome-location,proteins.polypeptide.organism,proteins.polypeptide.amino-acid-sequence,proteins.polypeptide.gene-sequence,proteins.polypeptide.pfams,proteins.polypeptide.signal-regions,proteins.induction-strength,proteins.inhibition-strength
0,DB00027,transporters,BE0001032,P-glycoprotein 1,Humans,unknown,substrate;;inhibitor,,P08183,Swiss-Prot,...,9.44,141477.255,7,Humans,>lcl|BSEQ0037114|Multidrug resistance protein ...,>lcl|BSEQ0016291|Multidrug resistance protein ...,PF00005;PF00664,,,
1,DB00030,targets,BE0000033,Insulin receptor,Humans,yes,agonist,,P06213,Swiss-Prot,...,6.18,156331.465,19,Humans,>lcl|BSEQ0036940|Insulin receptor\nMATGGRRGAAA...,>lcl|BSEQ0020443|Insulin receptor (INSR)\nATGG...,PF07714;PF00041;PF00757;PF01030,1-27,,
2,DB00030,targets,BE0000858,Insulin-like growth factor 1 receptor,Humans,unknown,activator,,P08069,Swiss-Prot,...,5.54,154791.73,15,Humans,>lcl|BSEQ0001710|Insulin-like growth factor 1 ...,>lcl|BSEQ0020490|Insulin-like growth factor 1 ...,PF07714;PF00757;PF01030,1-30,,
3,DB00030,targets,BE0001123,Carboxypeptidase E,Humans,unknown,modulator;;product of,,P16870,Swiss-Prot,...,4.78,53150.185,4,Humans,>lcl|BSEQ0002234|Carboxypeptidase E\nMAGRGGSAL...,>lcl|BSEQ0016324|Carboxypeptidase E (CPE)\nATG...,PF00246,1-25,,
4,DB00030,targets,BE0001147,Protein NOV homolog,Humans,unknown,downregulator,,P48745,Swiss-Prot,...,7.74,39161.82,8,Humans,>lcl|BSEQ0019069|Protein NOV homolog\nMQSVQSTS...,>lcl|BSEQ0019070|Protein NOV homolog (NOV)\nAT...,PF00093;PF00219;PF00007,1-31,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18135,DB17472,enzymes,BE0002363,Cytochrome P450 2D6,Humans,unknown,inhibitor,,P10635,Swiss-Prot,...,7.26,55768.94,22,Humans,>lcl|BSEQ0004641|Cytochrome P450 2D6\nMGLEALVP...,>lcl|BSEQ0019275|Cytochrome P450 2D6 (CYP2D6)\...,PF00067,,,unknown
18136,DB17472,enzymes,BE0002362,Cytochrome P450 3A5,Humans,unknown,inducer,,P20815,Swiss-Prot,...,9.09,57108.065,7,Humans,>lcl|BSEQ0004639|Cytochrome P450 3A5\nMDLIPNLA...,>lcl|BSEQ0016766|Cytochrome P450 3A5 (CYP3A5)\...,PF00067,,unknown,
18137,DB17472,transporters,BE0001032,P-glycoprotein 1,Humans,unknown,substrate;;inhibitor,,P08183,Swiss-Prot,...,9.44,141477.255,7,Humans,>lcl|BSEQ0037114|Multidrug resistance protein ...,>lcl|BSEQ0016291|Multidrug resistance protein ...,PF00005;PF00664,,,
18138,DB17472,transporters,BE0001067,ATP-binding cassette sub-family G member 2,Humans,unknown,substrate;;inhibitor,,Q9UNQ0,Swiss-Prot,...,8.9,72313.47,4,Humans,>lcl|BSEQ0002125|ATP-binding cassette sub-fami...,>lcl|BSEQ0016303|ATP-binding cassette sub-fami...,PF00005;PF01061,,,


#### Pharmacologically active

In [26]:
df_pharm = df_proteins_data[df_proteins_data["proteins.known-action"] == "yes"]
df_pharm = df_pharm.groupby(["proteins.id", "proteins.polypeptide.uniprot-id"])
df_pharm = df_pharm["drugbank-id"].agg(lambda x: build_string(x.unique())).reset_index()
df_pharm = df_pharm.rename(
    {
        "proteins.id": "bioentity",
        "proteins.polypeptide.uniprot-id": "uniprot",
        "drugbank-id": "drugbank",
    },
    axis=1,
)
df_pharm

Unnamed: 0,bioentity,uniprot,drugbank
0,BE0000005,P35228,DB05383
1,BE0000008,Q13423,DB09092
2,BE0000011,P50213,DB09092
3,BE0000012,Q9P0X4,DB00381;DB00909;DB01388
4,BE0000013,P30542,DB00201;DB00277;DB00640;DB00651;DB01223;DB01303
...,...,...,...
936,BE0010221,F5HC79,DB12070
937,BE0010221,F5HCU8,DB12070
938,BE0010221,F5HGI9,DB12070
939,BE0010514,Q9Y243,DB12218


### Create main dataframe
#### Map to model

In [27]:
if not drugbank_ids:
    drugbank_ids = [
        ensure_iterable(gene.annotation.get("drugbank"))
        for gene in model.genes
        if gene.annotation.get("drugbank")
    ]
    drugbank_ids = sorted(
        set([db_id for db_id_list in drugbank_ids for db_id in db_id_list])
    )
df_main = dataframes["general"].copy()
df_main = df_main[df_main["drugbank-id"].isin(drugbank_ids)].copy()

# # Add SNP to main dataframe, adds 5 entries per code
# if dataframes.get("snp") is not None:
#     merge_key = "snp.rs-id"
#     df_main[merge_key] = df_main[f"{merge_key}s"].apply(lambda x: split_string(x, sep=";;"))
#     df_main = df_main.explode(merge_key)
#     df_main = df_main.merge(
#         dataframes["snp"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)


# # Add ATC codes to main dataframe, adds 5 entries per code
# if dataframes.get("atc-codes") is not None:
#     merge_key = "atc-codes.atc-code"
#     df_main[merge_key] = df_main[f"{merge_key}s"].apply(lambda x: split_string(x, sep=";;"))
#     df_main = df_main.explode(merge_key)
#     df_main = df_main.merge(
#         dataframes["atc-codes"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)

# # Add salts to main dataframe, caution with use due to number of entries
# if dataframes.get("salts") is not None:
#     merge_key = "salts.drugbank-id"
#     df_main[merge_key] = df_main[f"{merge_key}s"].apply(lambda x: split_string(x, sep=";;"))
#     df_main = df_main.explode(merge_key)
#     df_main = df_main.merge(
#         dataframes["salts"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)


# # Add mixtures to main dataframe, caution with use due to number of entries
# if dataframes.get("mixtures") is not None:
#     merge_key = "mixtures.name"
#     df_main[merge_key] = df_main[f"{merge_key}s"].apply(lambda x: split_string(x, sep=";;"))
#     df_main = df_main.explode(merge_key)
#     df_main = df_main.merge(
#         dataframes["mixtures"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)

# # Add pathways to main dataframe, caution with use due to number of entries
# if dataframes.get("pathways") is not None:
#     merge_key = "pathways.smpdb-id"
#     df_main[merge_key] = df_main[f"{merge_key}s"].apply(lambda x: split_string(x, sep=";;"))
#     df_main = df_main.explode(merge_key)
#     df_main = df_main.merge(
#         dataframes["pathways"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)

# # Add prices to main dataframe, caution with use due to number of entries
# if dataframes.get("prices") is not None:
#     merge_key = "drugbank-id"
#     df_main = df_main.merge(
#         dataframes["prices"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)

# # Add manufacturers to main dataframe, caution with use due to number of entries
# if dataframes.get("manufacturers") is not None:
#     merge_key = "drugbank-id"
#     df_main = df_main.merge(
#         dataframes["manufacturers"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)


# # Add dosages to main dataframe, caution with use due to number of entries
# if dataframes.get("dosages") is not None:
#     merge_key = "drugbank-id"
#     df_main = df_main.merge(
#         dataframes["dosages"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)

# Add reactions to main dataframe, caution with use due to number of entries
# if dataframes.get("reactions") is not None:
#     merge_key = "drugbank-id"
#     df_main = df_main.merge(
#         dataframes["reactions"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)

# Add products to main dataframe, caution with use due to number of entries
# if dataframes.get("products") is not None:
#     merge_key = "drugbank-id"
#     df_main = df_main.merge(
#         dataframes["products"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)

# Add proteins to main dataframe, caution with use due to number of entries
# if dataframes.get("proteins") is not None:
#     merge_key = "drugbank-id"
#     df_main = df_main.merge(
#         dataframes["proteins"],
#         left_on=merge_key,
#         right_on=merge_key,
#         how="left",
#         suffixes=("", "_drop"),
#     )
#     df_main = df_main.drop([c for c in df_main.columns if c.endswith("_drop")], axis=1)
#     df_main = df_main.drop_duplicates().reset_index(drop=True)

df_main

Unnamed: 0,drugbank-id,type,created,updated,name,description,cas-number,unii,average-mass,monoisotopic-mass,...,experimental-properties.logp,experimental-properties.logp.source,experimental-properties.caco2-permeability,experimental-properties.caco2-permeability.source,experimental-properties.pka,experimental-properties.pka.source,experimental-properties.logs,experimental-properties.logs.source,experimental-properties.radioactivity,experimental-properties.radioactivity.source
0,DB00027,small molecule,2005-06-13,2024-01-02,Gramicidin D,Gramcidin D is a heterogeneous mixture of thre...,1405-97-6,5IE62321P4,1811.253,1810.033419343,...,,,,,,,,,,
1,DB00030,biotech,2005-06-13,2024-01-02,Insulin human,"Human Insulin, also known as Regular Insulin, ...",11061-68-0,1Y17CTI5SR,,,...,,,,,,,,,,
2,DB00035,small molecule,2005-06-13,2024-01-02,Desmopressin,"Desmopressin (dDAVP), a synthetic analogue of ...",16679-58-6,ENR1LLB0FP,1069.22,1068.426955905,...,,,,,,,,,,
3,DB00041,biotech,2005-06-13,2024-01-02,Aldesleukin,"Aldesleukin, a lymphokine, is produced by reco...",110942-02-4,M89N0Q7EQR,,,...,,,,,,,,,,
4,DB00046,biotech,2005-06-13,2024-01-02,Insulin lispro,Insulin lispro is a rapid-acting form of insul...,133107-64-9,GFX7QIS1II,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2060,DB16732,biotech,2021-09-21,2021-11-08,Tisotumab vedotin,Tisotumab vedotin is a tissue factor-directed ...,1418731-10-8,T41737F88A,,,...,,,,,,,,,,
2061,DB16826,small molecule,2022-07-15,2023-12-22,Repotrectinib,Repotrectinib is a next-generation tyrosine ki...,1802220-02-5,08O3FQ4UNP,355.373,355.144453003,...,,,,,,,,,,
2062,DB17083,small molecule,2022-10-26,2022-12-13,Linzagolix,"Linzagolix is a non-peptide, selective antagon...",935283-04-8,7CDW97HUEX,508.42,508.055206494,...,,,,,,,,,,
2063,DB17472,small molecule,2023-01-30,2023-12-07,Pirtobrutinib,Pirtobrutinib is a small molecule and a highly...,2101700-15-4,JNA39I7ZVB,479.436,479.158052208,...,,,,,,,,,,


### Aggregate all alternate aliases into single columns

In [28]:
df_aliases = (
    pd.concat(
        tuple(
            [
                df[["drugbank-id", key]].rename({key: "aliases"}, axis=1)
                for df, key in zip(
                    [
                        df_products_data,
                        df_drugbank_synonym_data,
                        df_drugbank_intbrand_data,
                    ],
                    ["products.name", "synonyms.synonyms", "international-brands.name"],
                )
            ]
        )
    )
    .groupby("drugbank-id")
    .agg(lambda x: build_string(x.unique(), sep=";;"))
)
df_aliases = df_aliases.reset_index(drop=False)
df_aliases

Unnamed: 0,drugbank-id,aliases
0,DB00027,Antibiotic Cream;;Antibiotic Cream for Kids;;A...
1,DB00030,Actraphane 30;;Actraphane 30 Flexpen;;Actrapha...
2,DB00035,Apo-desmopressin;;Bipazen;;Ddavp;;Ddavp Inj 4m...
3,DB00041,Proleukin;;125-L-serine-2-133-interleukin 2 (h...
4,DB00046,Admelog;;Admelog Solostar;;Humalog;;Humalog (c...
...,...,...
1427,DB16732,Tivdak;;HuMax-TF-ADC;;Tisotumab vedotin;;tisot...
1428,DB16826,"(3R,6S,)-45-FLUORO-3,6-DIMETHYL-5-OXA-2,8-DIAZ..."
1429,DB17083,"KLH-2109 Choline;;Yselty;;3-(5-((2,3-difluoro-..."
1430,DB17472,Jaypirca;;(s)-5-amino-3-(4-((5-fluoro-2-methox...


### Extract reference information

In [29]:
data = {}
idx = 0
prefix = True
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # if drugbank_id not in drugbank_ids:
    #     continue
    key = "general-references"
    element = drug.find(f"{DRUGBANK_NS}{key}")
    for key in ["articles", "textbooks", "links", "attachments"]:
        elements = element.findall(
            f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}"
        )
        for element in elements:
            data[idx] = {
                strip_ns_DrugBank(subelement.tag): subelement.text
                for subelement in element
                if has_value_type(subelement)
            }
            data[idx].update(
                {
                    # "drugbank-id": drugbank_id,
                    "type": strip_ns_DrugBank(element.tag),
                }
            )

            idx += 1

df_drugbank_reference_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .drop_duplicates()
    .reset_index(drop=True)
)
df_drugbank_reference_data

Unnamed: 0,ref-id,pubmed-id,citation,type,title,url,isbn
0,A1,16244762,"Smythe MA, Stephens JL, Koerber JM, Mattson JC...",article,,,
1,A2,16690967,"Tardy B, Lecompte T, Boelhen F, Tardy-Poncet B...",article,,,
2,A3,16241940,"Lubenow N, Eichler P, Lietz T, Greinacher A: L...",article,,,
3,A246609,19707378,Petros S: Lepirudin in the management of patie...,article,,,
4,A246624,25294122,"Chapin JC, Hajjar KA: Fibrinolysis and the con...",article,,,
...,...,...,...,...,...,...,...
16127,A33650,26254266,"Mehta HM, Malandra M, Corey SJ: G-CSF and GM-C...",article,,,
16128,A262172,17402806,"Cosler LE, Eldar-Lissai A, Culakova E, Kuderer...",article,,,
16129,A262551,37675837,"Borralleras C, Castrodeza Sanz J, Arrazola P, ...",article,,,
16130,L49156,,,link,Anti-SEZ6 Antibody-drug Conjugate ABBV-706 (Co...,https://ncit.nci.nih.gov/ncitbrowser/ConceptRe...,


In [30]:
df_refs = df_main.copy()
df_refs["ref-id"] = df_refs["general-references.ref-id"].apply(
    lambda x: split_string(x, sep=";;")
)
df_refs = df_refs.explode("ref-id")
df_refs = df_refs.loc[:, ["ref-id"]].replace("", float("nan")).dropna()
df_refs = df_refs.merge(
    df_drugbank_reference_data, left_on="ref-id", right_on="ref-id", how="left"
)
df_refs = df_refs.drop_duplicates()
df_refs

Unnamed: 0,ref-id,pubmed-id,citation,type,title,url,isbn
0,A33,8810522,"Ketchem RR, Lee KC, Huo S, Cross TA: Macromole...",article,,,
1,A34,11570868,"Townsley LE, Tucker WA, Sham S, Hinton JF: Str...",article,,,
2,A35,10397797,"Burkhart BM, Gassman RM, Langs DA, Pangborn WA...",article,,,
3,A40,23512415,"Herrmann BL, Kasser C, Keuthage W, Huptas M, D...",article,,,
4,A41,11118018,"Lepore M, Pampanelli S, Fanelli C, Porcellati ...",article,,,
...,...,...,...,...,...,...,...
9495,L44873,,,,,,
9496,A261685,27188687,"Khan SR, Pearle MS, Robertson WG, Gambaro G, C...",article,,,
9497,A261690,36407951,"Liu A, Zhao J, Shah M, Migliorati JM, Tawfik S...",article,,,
9498,L48320,,,,,,


### Extract interaction information

In [31]:
data = df_drug_interactions.copy()
data = data[data["drugbank-id"].isin(drugbank_ids)]
data = data[data["drug-interactions.drugbank-id"].isin(drugbank_ids)]
data = data.reset_index(drop=True)
# Aggregate dataframe values
data["interaction.ids"] = data[["drugbank-id", "drug-interactions.drugbank-id"]].agg(
    lambda x: build_string(sorted(x), sep=";"), axis=1
)
data["interaction.names"] = data[["name", "drug-interactions.name"]].agg(
    lambda x: build_string(sorted(x), sep=";"), axis=1
)

data = data[~data["interaction.ids"].duplicated()]
data = data[~data["interaction.names"].duplicated()]
data

Unnamed: 0,drugbank-id,name,drug-interactions.drugbank-id,drug-interactions.name,drug-interactions.description,interaction.ids,interaction.names
0,DB00027,Gramicidin D,DB01418,Acenocoumarol,The risk or severity of bleeding can be increa...,DB00027;;DB01418,Acenocoumarol;;Gramicidin D
1,DB00027,Gramicidin D,DB08794,Ethyl biscoumacetate,The risk or severity of bleeding can be increa...,DB00027;;DB08794,Ethyl biscoumacetate;;Gramicidin D
2,DB00027,Gramicidin D,DB00281,Lidocaine,The risk or severity of methemoglobinemia can ...,DB00027;;DB00281,Gramicidin D;;Lidocaine
3,DB00027,Gramicidin D,DB00721,Procaine,The risk or severity of methemoglobinemia can ...,DB00027;;DB00721,Gramicidin D;;Procaine
4,DB00027,Gramicidin D,DB00814,Meloxicam,The risk or severity of methemoglobinemia can ...,DB00027;;DB00814,Gramicidin D;;Meloxicam
...,...,...,...,...,...,...,...
428185,DB16650,Deucravacitinib,DB17472,Pirtobrutinib,The risk or severity of adverse effects can be...,DB16650;;DB17472,Deucravacitinib;;Pirtobrutinib
428249,DB16690,Tegoprazan,DB16703,Belumosudil,The serum concentration of Belumosudil can be ...,DB16690;;DB16703,Belumosudil;;Tegoprazan
428675,DB16703,Belumosudil,DB17472,Pirtobrutinib,The risk or severity of adverse effects can be...,DB16703;;DB17472,Belumosudil;;Pirtobrutinib
428685,DB16703,Belumosudil,DB16826,Repotrectinib,The serum concentration of Repotrectinib can b...,DB16703;;DB16826,Belumosudil;;Repotrectinib
