# Extract drug data from DrugBank

Purpose of this notebook is to extract and format Drug Interaction data for subsequent visualization

## DRUGBANK ONLINE
To utilize this notebook: 

1. Go to [DrugBank database](https://go.drugbank.com/releases/latest) and create an account.
2. Follow the instructions to obtain a free academic license.
3. Download and unzip the database file `"drugbank_all_full_database.xml.zip"`.
4. Rename the file `"full database.xml"` to `"drugbank_all_full_database.xml"`.
5. Remember clear out any personal account information and ensure the downloaded DrugBank file remains local!

The functions `download_database_DrugBank` takes a given username and a password and downloads the data, taking care to change the filename in the process.

Fields for the DrugBank XML schema are found [here](https://docs.drugbank.com/xml/#introduction).

Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, Assempour N, Iynkkaran I, Liu Y, Maciejewski A, Gale N, Wilson A, Chin L, Cummings R, Le D, Pon A, Knox C, Wilson M. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017 Nov 8. doi: 10.1093/nar/gkx1037.

## Setup
### Import packages

In [1]:
from collections import defaultdict
from xml.etree import ElementTree

import pandas as pd
from rbc_gem_utils import (
    ANNOTATION_PATH,
    DATABASE_PATH,
    INTERIM_PATH,
    ROOT_PATH,
    build_string,
    check_version,
    get_annotation_df,
    read_rbc_model,
    show_versions,
    split_string,
)
from rbc_gem_utils.database.drugbank import (
    DRUGBANK_GENERAL_ELEMENTS,
    DRUGBANK_NS,
    DRUGBANK_PATH,
    DRUGBANK_VERSION_EXPECTED,
    download_database_DrugBank,
    get_version_DrugBank,
    strip_ns_DrugBank,
)
from rbc_gem_utils.util import has_value_type, strip_plural

# Display versions of last time notebook ran and worked
show_versions()


Package Information
-------------------
rbc-gem-utils 0.0.1

Dependency Information
----------------------
beautifulsoup4                       4.12.3
bio                                   1.6.2
cobra                                0.29.0
depinfo                               2.2.0
kaleido                               0.2.1
matplotlib                            3.8.2
memote                               0.17.0
networkx                              3.2.1
notebook                              7.0.7
openpyxl                              3.1.2
pandas                                2.2.0
pre-commit                            3.6.0
pyvis                                 0.3.2
rbc-gem-utils[database,network,vis] missing
requests                             2.31.0
scipy                                1.12.0
seaborn                              0.13.2

Build Tools Information
-----------------------
pip        23.3.1
setuptools 68.2.2
wheel      0.41.2

Platform Information
-------------------

## Check DrugBank version
If the version does not match the expected version, it is because database has been updated since the last time this code was utilized. 
### Expected DrugBank version: 5.1.13
* Last release utilized: [5.1.13](https://go.drugbank.com/releases) published on **2025-01-02**
* Version in the DrugBank file is formatted as {major}.{minor}

In [2]:
database_dirpath = ROOT_PATH / DATABASE_PATH / DRUGBANK_PATH
annotation_dirpath = ROOT_PATH / ANNOTATION_PATH
# version = get_version_DrugBank()
# if check_version(version, DRUGBANK_VERSION_EXPECTED, verbose=True):
#     database_dirpath = ROOT_PATH / DATABASE_PATH / DRUGBANK_PATH
#     annotation_dirpath = ROOT_PATH / ANNOTATION_PATH
# else:
#     database_dirpath = ROOT_PATH / INTERIM_PATH / DRUGBANK_PATH
#     annotation_dirpath = ROOT_PATH / INTERIM_PATH
#     version = DRUGBANK_VERSION_EXPECTED

#### Download new files and update database
If an argument is not provided (`arg=None`), its default value for the repository used. 
Username and password must be provided for this function, be sure to remove personal information after use!

In [3]:
download = False
if download:
    # Download data
    download_database_DrugBank(
        username="USERNAME",
        password="PASSWORD",
        database_dirpath=database_dirpath,
        version=version,
    )
filepath = database_dirpath / "drugbank_all_full_database.xml"

## Load RBC-GEM model

In [4]:
model = read_rbc_model(filetype="xml")
model

Set parameter Username
Academic license - for non-commercial use only - expires 2025-11-21


0,1
Name,RBC_GEM
Memory address,152b9edd0
Number of metabolites,2157
Number of reactions,3275
Number of genes,820
Number of groups,78
Objective expression,1.0*NaKt - 1.0*NaKt_reverse_db47e
Compartments,"cytosol, extracellular space"


In [5]:
annotation_type = "genes"
df_model_mappings = get_annotation_df(
    getattr(model, annotation_type), ["uniprot", "drugbank"]
).rename({"id": annotation_type}, axis=1)

df_model_mappings["drugbank"] = df_model_mappings["drugbank"].apply(
    lambda x: split_string(x)
)
df_model_mappings = df_model_mappings.explode("drugbank").drop_duplicates()
print(df_model_mappings.nunique())
drugbank_ids = set(df_model_mappings["drugbank"].dropna().unique())
uniprot_ids = set(df_model_mappings["uniprot"].dropna().unique())
df_model_mappings

genes        820
uniprot      820
drugbank    2440
dtype: int64


Unnamed: 0,genes,uniprot,drugbank
0,RPE,Q96AT9,DB00153
1,RPIA,P49247,DB01756
2,SORD,Q00796,DB00157
2,SORD,Q00796,DB04478
3,AKR7A2,O43488,
...,...,...,...
816,VCPIP1,Q96JH7,
817,VPS4B,O75351,
818,WDR77,Q9BQA1,
819,YES1,P07947,DB01254


## Parse DrugBank information into DataFrame

In [6]:
all_drug_dfs = {}
root = ElementTree.parse(filepath).getroot()
root

<Element '{http://www.drugbank.ca}drugbank' at 0x110ea76f0>

#### Extract general information

In [7]:
idx = 0
data = defaultdict(dict)
for drug in root:
    # General information
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    data[idx].update({"drugbank-id": drugbank_id})
    data[idx].update({attr: drug.get(attr) for attr in ["type", "created", "updated"]})
    for key in DRUGBANK_GENERAL_ELEMENTS:
        if key == "drugbank-id":
            continue

        if key in {"name", "cas-number"}:
            element = drug.find(f"{DRUGBANK_NS}{key}")
            if element is not None and has_value_type(element):
                data[idx].update({key: element.text})

    for key in {"products", "international-brands"}:
        subkey = "name"
        data[idx].update(
            {
                f"{key}": build_string(
                    [
                        element.findtext(f"{DRUGBANK_NS}{subkey}")
                        for element in drug.findall(
                            f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}"
                        )
                    ]
                )
            }
        )
    key = "synonyms"
    data[idx].update(
        {
            f"{key}": build_string(
                [
                    element.text
                    for element in drug.findall(
                        f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}"
                    )
                ]
            )
        }
    )

    idx += 1

df_drugbank_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
all_drug_dfs["General"] = df_drugbank_data
df_drugbank_data = df_drugbank_data.drop(["created", "updated"], axis=1)
df_drugbank_data

Unnamed: 0,drugbank-id,type,name,cas-number,international-brands,products,synonyms
0,DB00006,small molecule,Bivalirudin,128270-60-0,Angiox;Hirulog,Angiomax;Angiomax RTU;Angiox;Bivalirudin;Bival...,Bivalirudin;Bivalirudina;Bivalirudinum
1,DB00027,small molecule,Gramicidin D,1405-97-6,Sofradex,Antibiotic Cream;Antibiotic Cream for Kids;Ant...,Bacillus brevis gramicidin D;Gramicidin;Gramic...
2,DB00030,biotech,Insulin human,11061-68-0,,Actraphane 30;Actraphane 30 Flexpen;Actraphane...,High molecular weight insulin human;Human insu...
3,DB00035,small molecule,Desmopressin,16679-58-6,Adiuretin;DesmoMelt,Apo-desmopressin;Bipazen;Ddavp;Ddavp Inj 4mcg/...,1-(3-mercaptopropionic acid)-8-D-arginine-vaso...
4,DB00041,biotech,Aldesleukin,110942-02-4,,Proleukin,125-L-serine-2-133-interleukin 2 (human reduce...
...,...,...,...,...,...,...,...
2435,DB16732,biotech,Tisotumab vedotin,1418731-10-8,Tivdak,Tivdak,HuMax-TF-ADC;Tisotumab vedotin;tisotumab vedot...
2436,DB16826,small molecule,Repotrectinib,1802220-02-5,,Augtyro,"(3R,6S,)-45-FLUORO-3,6-DIMETHYL-5-OXA-2,8-DIAZ..."
2437,DB17083,small molecule,Linzagolix,935283-04-8,,KLH-2109 Choline;Yselty,"3-(5-((2,3-difluoro-6-methoxyphenyl)methoxy)-2..."
2438,DB17472,small molecule,Pirtobrutinib,2101700-15-4,Jaypirca,Jaypirca,(s)-5-amino-3-(4-((5-fluoro-2-methoxybenzamido...


In [8]:
idx = 0
data = defaultdict(dict)
for drug in root:
    # General information
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "categories"
    # Drug catgories
    # For mesh-id: https://registry.identifiers.org/registry/mesh
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    for element in elements:
        data[idx].update({"drugbank-id": drugbank_id})
        data[idx]["category"] = element.findtext(f"{DRUGBANK_NS}category")
        data[idx]["mesh-id"] = element.findtext(f"{DRUGBANK_NS}mesh-id")
        idx += 1


df_drug_category = (
    pd.DataFrame.from_dict(data, orient="index")
    .replace("", float("nan"))
    .drop_duplicates()
    .reset_index(drop=True)
)
all_drug_dfs["Categories"] = df_drug_category
df_drug_category

Unnamed: 0,drugbank-id,category,mesh-id
0,DB00006,"Amino Acids, Peptides, and Proteins",D000602
1,DB00006,Anticoagulants,D000925
2,DB00006,Antithrombins,D000991
3,DB00006,Blood and Blood Forming Organs,
4,DB00006,Enzyme Inhibitors,D004791
...,...,...,...
30683,DB17472,P-glycoprotein inhibitors,
30684,DB17472,P-glycoprotein substrates,
30685,DB17472,Protein Kinase Inhibitors,D047428
30686,DB17472,Tyrosine Kinase Inhibitors,D000092004


#### Extract ATC codes

In [9]:
idx = 0
data = defaultdict(dict)
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "atc-codes"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    for element in elements:
        data[idx].update(
            {
                "drugbank-id": drugbank_id,
                "substance.code": element.get("code"),
                "substance.description": drug.findtext(f"{DRUGBANK_NS}name"),
            }
        )
        for level, subelement in zip(
            ["chemical", "pharmacological", "therapeutic", "anatomical"], list(element)
        ):
            data[idx].update(
                {
                    f"{level}.description": subelement.text,
                    f"{level}.code": subelement.get("code"),
                }
            )
        idx += 1

df_atc_codes_data = pd.DataFrame.from_dict(data, orient="index")
df_atc_codes_data = df_atc_codes_data.loc[
    :, list(df_atc_codes_data.columns[:1]) + list(df_atc_codes_data.columns[1:][::-1])
]
df_atc_codes_data = (
    df_drugbank_data[["drugbank-id"]]
    .merge(
        df_atc_codes_data,
        left_on="drugbank-id",
        right_on="drugbank-id",
        how="left",
    )
    .drop_duplicates()
    .reset_index(drop=True)
)
all_drug_dfs["ATC"] = df_atc_codes_data

print(df_atc_codes_data.nunique())
df_atc_codes_data

drugbank-id                    2440
anatomical.code                  14
anatomical.description           14
therapeutic.code                 85
therapeutic.description          85
pharmacological.code            202
pharmacological.description     198
chemical.code                   527
chemical.description            487
substance.description          1010
substance.code                 1852
dtype: int64


Unnamed: 0,drugbank-id,anatomical.code,anatomical.description,therapeutic.code,therapeutic.description,pharmacological.code,pharmacological.description,chemical.code,chemical.description,substance.description,substance.code
0,DB00006,B,BLOOD AND BLOOD FORMING ORGANS,B01,ANTITHROMBOTIC AGENTS,B01A,ANTITHROMBOTIC AGENTS,B01AE,Direct thrombin inhibitors,Bivalirudin,B01AE06
1,DB00027,R,RESPIRATORY SYSTEM,R02,THROAT PREPARATIONS,R02A,THROAT PREPARATIONS,R02AB,Antibiotics,Gramicidin D,R02AB30
2,DB00030,A,ALIMENTARY TRACT AND METABOLISM,A10,DRUGS USED IN DIABETES,A10A,INSULINS AND ANALOGUES,A10AC,"Insulins and analogues for injection, intermed...",Insulin human,A10AC01
3,DB00030,A,ALIMENTARY TRACT AND METABOLISM,A10,DRUGS USED IN DIABETES,A10A,INSULINS AND ANALOGUES,A10AE,"Insulins and analogues for injection, long-acting",Insulin human,A10AE01
4,DB00030,A,ALIMENTARY TRACT AND METABOLISM,A10,DRUGS USED IN DIABETES,A10A,INSULINS AND ANALOGUES,A10AB,"Insulins and analogues for injection, fast-acting",Insulin human,A10AB01
...,...,...,...,...,...,...,...,...,...,...,...
3549,DB16732,L,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L01,ANTINEOPLASTIC AGENTS,L01F,MONOCLONAL ANTIBODIES AND ANTIBODY DRUG CONJUG...,L01FX,Other monoclonal antibodies and antibody drug ...,Tisotumab vedotin,L01FX23
3550,DB16826,,,,,,,,,,
3551,DB17083,H,"SYSTEMIC HORMONAL PREPARATIONS, EXCL. SEX HORM...",H01,PITUITARY AND HYPOTHALAMIC HORMONES AND ANALOGUES,H01C,HYPOTHALAMIC HORMONES,H01CC,Anti-gonadotropin-releasing hormones,Linzagolix,H01CC04
3552,DB17472,L,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L01,ANTINEOPLASTIC AGENTS,L01E,PROTEIN KINASE INHIBITORS,L01EL,Bruton's tyrosine kinase (BTK) inhibitors,Pirtobrutinib,L01EL05


#### Extract drug interactions
Extracted drug interactions are confined to those that directly map into the reconstruction.

In [10]:
prefix = True

idx = 0
data = defaultdict(dict)
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "drug-interactions"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        interacting_id = element.findtext(f"{DRUGBANK_NS}drugbank-id")
        if interacting_id in drugbank_ids:
            data[idx].update(
                {
                    "drugbank-id": drugbank_id,
                    "name": drug.findtext(f"{DRUGBANK_NS}name"),
                }
            )
            data[idx].update(
                {
                    f"{prefix}{subkey}": element.findtext(f"{DRUGBANK_NS}{subkey}")
                    for subkey in ["drugbank-id", "name", "description"]
                }
            )
            idx += 1

df_drug_interactions = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)

# Drug interactions go two ways, generate a field to keep only unique interactions
df_drug_interactions["drug;drug"] = df_drug_interactions[
    ["drugbank-id", "drug-interactions.drugbank-id"]
].apply(lambda x: build_string(sorted(x.values)), axis=1)
df_drug_interactions = df_drug_interactions.drop_duplicates(subset=["drug;drug"])
df_drug_interactions = df_drug_interactions.reset_index(drop=True)
df_drug_interactions = df_drug_interactions.rename(
    {
        "drugbank-id": "drugbank_A",
        "name": "name_A",
        "drug-interactions.drugbank-id": "drugbank_B",
        "drug-interactions.name": "name_B",
        "drug;drug": "drugbank_A;drugbank_B",
    },
    axis=1,
)
all_drug_dfs["Interactions"] = df_drug_interactions

print(df_drug_interactions.nunique())
df_drug_interactions

drugbank_A                         1184
name_A                             1184
drugbank_B                         1197
name_B                             1197
drug-interactions.description    231553
drugbank_A;drugbank_B            231553
dtype: int64


Unnamed: 0,drugbank_A,name_A,drugbank_B,name_B,drug-interactions.description,drugbank_A;drugbank_B
0,DB00006,Bivalirudin,DB06605,Apixaban,Apixaban may increase the anticoagulant activi...,DB00006;DB06605
1,DB00006,Bivalirudin,DB06695,Dabigatran etexilate,Dabigatran etexilate may increase the anticoag...,DB00006;DB06695
2,DB00006,Bivalirudin,DB01254,Dasatinib,The risk or severity of bleeding and hemorrhag...,DB00006;DB01254
3,DB00006,Bivalirudin,DB01586,Ursodeoxycholic acid,The risk or severity of bleeding and bruising ...,DB00006;DB01586
4,DB00006,Bivalirudin,DB02123,Glycochenodeoxycholic Acid,The risk or severity of bleeding and bruising ...,DB00006;DB02123
...,...,...,...,...,...,...
231548,DB16650,Deucravacitinib,DB17472,Pirtobrutinib,The risk or severity of adverse effects can be...,DB16650;DB17472
231549,DB16690,Tegoprazan,DB16703,Belumosudil,The serum concentration of Belumosudil can be ...,DB16690;DB16703
231550,DB16703,Belumosudil,DB17472,Pirtobrutinib,The risk or severity of adverse effects can be...,DB16703;DB17472
231551,DB16703,Belumosudil,DB16826,Repotrectinib,The serum concentration of Repotrectinib can b...,DB16703;DB16826


### Extract proteins

In [11]:
idx = 0
data = defaultdict(dict)
prefix = False
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    prefix = f"proteins." if prefix else ""
    for ptype in ["targets", "enzymes", "carriers", "transporters"]:
        elements = drug.findall(
            f"{DRUGBANK_NS}{ptype}/{DRUGBANK_NS}{strip_plural(ptype)}"
        )
        for element in elements:
            for subelement in element.findall(f"{DRUGBANK_NS}polypeptide"):
                data[idx].update({f"drugbank-id": drugbank_id, f"{prefix}type": ptype})
                data[idx].update(
                    {
                        f"{prefix}{strip_ns_DrugBank(subelement.tag)}": subelement.text
                        for subelement in element
                        if has_value_type(subelement)
                    }
                )

                # Polypeptide
                key = "polypeptide"
                data[idx].update(
                    {
                        f"{prefix}{key}.uniprot-id": subelement.get("id"),
                        f"{prefix}{key}.source": subelement.get("source"),
                    }
                )
                data[idx].update(
                    {
                        f"{prefix}{key}.{strip_ns_DrugBank(subelem.tag)}": subelem.text
                        for subelem in subelement
                        if has_value_type(subelem)
                    }
                )
                subkey = "pfams"
                data[idx].update(
                    {
                        f"{prefix}{key}.{subkey}": build_string(
                            [
                                subelem.text
                                for subelem in subelement.findall(
                                    f"{DRUGBANK_NS}{subkey}/{DRUGBANK_NS}{strip_plural(subkey)}/{DRUGBANK_NS}identifier"
                                )
                                if has_value_type(subelem)
                            ]
                        )
                    }
                )

                idx += 1

df_proteins = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
df_proteins = df_proteins[
    df_proteins[f"{prefix}polypeptide.uniprot-id"].isin(uniprot_ids)
]
df_proteins = df_proteins.drop_duplicates().reset_index(drop=True)
df_proteins = (
    df_model_mappings[["genes", "uniprot"]]
    .merge(df_proteins, left_on="uniprot", right_on="polypeptide.uniprot-id")
    .drop_duplicates()
    .drop(["uniprot"], axis=1)
    .reset_index(drop=True)
)

all_drug_dfs["Proteins"] = df_proteins
df_proteins

Unnamed: 0,genes,drugbank-id,type,id,name,organism,known-action,polypeptide.uniprot-id,polypeptide.source,polypeptide.name,...,polypeptide.theoretical-pi,polypeptide.molecular-weight,polypeptide.chromosome-location,polypeptide.organism,polypeptide.amino-acid-sequence,polypeptide.gene-sequence,polypeptide.pfams,inhibition-strength,polypeptide.transmembrane-regions,induction-strength
0,RPE,DB00153,enzymes,BE0009671,Ribulose-phosphate 3-epimerase,Humans,no,Q96AT9,Swiss-Prot,Ribulose-phosphate 3-epimerase,...,,24927.555,2,Humans,>lcl|BSEQ0052026|Ribulose-phosphate 3-epimeras...,>lcl|BSEQ0052027|Ribulose-phosphate 3-epimeras...,PF00834,,,
1,RPIA,DB01756,targets,BE0004438,Ribose-5-phosphate isomerase,Humans,unknown,P49247,Swiss-Prot,Ribose-5-phosphate isomerase,...,,33268.72,2,Humans,>lcl|BSEQ0009359|Ribose-5-phosphate isomerase\...,>lcl|BSEQ0020766|Ribose-5-phosphate isomerase ...,PF06026,,,
2,SORD,DB00157,targets,BE0000299,Sorbitol dehydrogenase,Humans,unknown,Q00796,Swiss-Prot,Sorbitol dehydrogenase,...,8.06,38324.25,15,Humans,>lcl|BSEQ0036986|Sorbitol dehydrogenase\nMAAAA...,>lcl|BSEQ0010069|Sorbitol dehydrogenase (SORD)...,PF08240;PF00107,,,
3,SORD,DB04478,targets,BE0000299,Sorbitol dehydrogenase,Humans,unknown,Q00796,Swiss-Prot,Sorbitol dehydrogenase,...,8.06,38324.25,15,Humans,>lcl|BSEQ0036986|Sorbitol dehydrogenase\nMAAAA...,>lcl|BSEQ0010069|Sorbitol dehydrogenase (SORD)...,PF08240;PF00107,,,
4,SRM,DB00118,enzymes,BE0000300,Spermidine synthase,Humans,unknown,P19623,Swiss-Prot,Spermidine synthase,...,5.17,33824.455,1,Humans,>lcl|BSEQ0010239|Spermidine synthase\nMEPGPDGP...,>lcl|BSEQ0010240|Spermidine synthase (SRM)\nAT...,PF01564;PF17284,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4655,SLK,DB12010,targets,BE0004041,STE20-like serine/threonine-protein kinase,Humans,unknown,Q9H2G2,Swiss-Prot,STE20-like serine/threonine-protein kinase,...,4.8,142693.96,10,Humans,>lcl|BSEQ0007723|STE20-like serine/threonine-p...,>lcl|BSEQ0019517|STE20-like serine/threonine-p...,PF12474;PF00069,,,
4656,UAP1,DB02196,targets,BE0001429,UDP-N-acetylhexosamine pyrophosphorylase,Humans,unknown,Q16222,Swiss-Prot,UDP-N-acetylhexosamine pyrophosphorylase,...,6.29,58768.705,1,Humans,>lcl|BSEQ0002837|UDP-N-acetylhexosamine pyroph...,>lcl|BSEQ0011002|UDP-N-acetylhexosamine pyroph...,PF01704,,,
4657,UAP1,DB03397,targets,BE0001429,UDP-N-acetylhexosamine pyrophosphorylase,Humans,unknown,Q16222,Swiss-Prot,UDP-N-acetylhexosamine pyrophosphorylase,...,6.29,58768.705,1,Humans,>lcl|BSEQ0002837|UDP-N-acetylhexosamine pyroph...,>lcl|BSEQ0011002|UDP-N-acetylhexosamine pyroph...,PF01704,,,
4658,YES1,DB01254,targets,BE0000840,Tyrosine-protein kinase Yes,Humans,yes,P07947,Swiss-Prot,Tyrosine-protein kinase Yes,...,6.71,60800.78,18,Humans,>lcl|BSEQ0010592|Tyrosine-protein kinase Yes\n...,>lcl|BSEQ0010593|Tyrosine-protein kinase Yes (...,PF00018;PF07714;PF00017,,,


In [12]:
idx = 0
data = defaultdict(dict)
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    # Get only drugbank IDs specified
    for key in ["snp-effects", "snp-adverse-drug-reactions"]:
        elements = drug.findall(
            f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key.split('-')[-1])}"
        )
        for element in elements:
            data[idx].update({"drugbank-id": drugbank_id})
            data[idx].update(
                {
                    f"{strip_ns_DrugBank(subelement.tag)}": subelement.text
                    for subelement in element
                    if has_value_type(subelement)
                }
            )
            idx += 1
df_snp_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
df_snp_data = df_snp_data[df_snp_data["uniprot-id"].isin(uniprot_ids)]
df_snp_data = df_snp_data.drop_duplicates().reset_index(drop=True)
all_drug_dfs["SNP"] = df_snp_data

df_snp_data

Unnamed: 0,drugbank-id,protein-name,gene-symbol,uniprot-id,rs-id,defining-change,description,pubmed-id,allele,adverse-reaction
0,DB00215,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,C Allele,Patients with this genotype have an increased ...,17913323,,
1,DB00285,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,C Allele,Patients with this genotype have an increased ...,17913323,,
2,DB00285,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,T > C,Patients with this genotype have increased ris...,22641028,,
3,DB00295,Multidrug resistance protein 1,ABCB1,P08183,rs1045642,T Allele,Patients with this genotype may have an increa...,17898703,,
4,DB00317,ATP-binding cassette sub-family G member 2,ABCG2,Q9UNQ0,rs2231142,,Patients with this genotype have an increased ...,17148776,,A allele
5,DB00321,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,C Allele,Patients with this genotype have an increased ...,17913323,,
6,DB00321,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,T > C,Patients with this genotype have increased ris...,22641028,,
7,DB00352,Thiopurine S-methyltransferase,TPMT,P51580,rs1800462,,The presence of this polymorphism in TPMT may ...,21270794,TPMT*2,G Allele
8,DB00352,Thiopurine S-methyltransferase,TPMT,P51580,rs1800460,,The presence of this polymorphism in TPMT may ...,21270794,TPMT*3A,A Allele
9,DB00352,Thiopurine S-methyltransferase,TPMT,P51580,rs1142345,,The presence of this polymorphism in TPMT may ...,21270794,TPMT*3C,G Allele


## Export drug data for subsequent visualization

In [13]:
print(list(all_drug_dfs.keys()))
for sheet_name, df in all_drug_dfs.items():
    df.to_csv(database_dirpath / f"{sheet_name}_DrugBank.tsv", sep="\t")

['General', 'Categories', 'ATC', 'Interactions', 'Proteins', 'SNP']
