# Extract drug data from DrugBank

Purpose of this notebook is to extract and format Drug Interaction data for subsequent visualization

## DRUGBANK ONLINE
To utilize this notebook: 

1. Go to [DrugBank database](https://go.drugbank.com/releases/latest) and create an account.
2. Follow the instructions to obtain a free academic license.
3. Download and unzip the database file `"drugbank_all_full_database.xml.zip"`.
4. Rename the file `"full database.xml"` to `"drugbank_all_full_database.xml"`.
5. Remember clear out any personal account information and ensure the downloaded DrugBank file remains local!

The functions `download_database_DrugBank` takes a given username and a password and downloads the data, taking care to change the filename in the process.

Fields for the DrugBank XML schema are found [here](https://docs.drugbank.com/xml/#introduction).

Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, Assempour N, Iynkkaran I, Liu Y, Maciejewski A, Gale N, Wilson A, Chin L, Cummings R, Le D, Pon A, Knox C, Wilson M. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017 Nov 8. doi: 10.1093/nar/gkx1037.

## Setup
### Import packages

In [1]:
from warnings import warn
import pandas as pd
import numpy as np
from collections import defaultdict
from xml.etree import ElementTree
import matplotlib.pyplot as plt

from rbc_gem_utils import (
    ROOT_PATH,
    INTERIM_PATH,
    DATABASE_PATH,
    ANNOTATION_PATH,
    get_annotation_df,
    read_rbc_model,
    check_database_version_online,
    check_version,
    show_versions,
    build_string,
    split_string,
)
from rbc_gem_utils.database.drugbank import (
    DRUGBANK_NS,
    DRUGBANK_VERSION_EXPECTED,
    DRUGBANK_PATH,
    DRUGBANK_GENERAL_ELEMENTS,
    strip_ns_DrugBank,
    get_version_DrugBank,
    download_database_DrugBank,
)

from rbc_gem_utils.util import (
    strip_plural,
    has_value_type,
)

# Display versions of last time notebook ran and worked
show_versions()


Package Information
-------------------
rbc-gem-utils 0.0.1

Dependency Information
----------------------
beautifulsoup4                       4.12.3
bio                                   1.6.2
cobra                                0.29.0
depinfo                               2.2.0
kaleido                               0.2.1
matplotlib                            3.8.2
memote                               0.17.0
networkx                              3.2.1
notebook                              7.0.7
openpyxl                              3.1.2
pandas                                2.2.0
pre-commit                            3.6.0
pyvis                                 0.3.2
rbc-gem-utils[database,network,vis] missing
requests                             2.31.0
scipy                                1.12.0
seaborn                              0.13.2

Build Tools Information
-----------------------
pip        23.3.1
setuptools 68.2.2
wheel      0.41.2

Platform Information
-------------------

## Check DrugBank version
If the version does not match the expected version, it is because database has been updated since the last time this code was utilized. 
### Expected DrugBank version: 5.1.11
* Last release utilized: [5.1.11](https://go.drugbank.com/releases) published on **2024-01-03**
* Version in the DrugBank file is formatted as {major}.{minor}

In [2]:
if not check_database_version_online("DrugBank"):
    warn(
        "Online version of database has been updated since the last time notebook was used."
    )

version = get_version_DrugBank()
if check_version(version, DRUGBANK_VERSION_EXPECTED, verbose=True):
    database_dirpath = f"{ROOT_PATH}{DATABASE_PATH}{DRUGBANK_PATH}"
    annotation_dirpath = f"{ROOT_PATH}{ANNOTATION_PATH}"
else:
    database_dirpath = f"{ROOT_PATH}{INTERIM_PATH}{DRUGBANK_PATH}"
    annotation_dirpath = f"{ROOT_PATH}{INTERIM_PATH}"
    version = DRUGBANK_VERSION_EXPECTED

Current and expected versions match.


#### Download new files and update database
If an argument is not provided (`arg=None`), its default value for the repository used. 
Username and password must be provided for this function, be sure to remove personal information after use!

In [3]:
download = False
if download:
    # Download data
    download_database_DrugBank(
        username="USERNAME",
        password="PASSWORD",
        database_dirpath=database_dirpath,
        version=version,
    )
filepath = f"{database_dirpath}/drugbank_all_full_database.xml"

## Load RBC-GEM model

In [4]:
model = read_rbc_model(filetype="xml")
model

0,1
Name,RBC_GEM
Memory address,1506c49d0
Number of metabolites,1971
Number of reactions,2798
Number of genes,656
Number of groups,75
Objective expression,1.0*NaKt - 1.0*NaKt_reverse_db47e
Compartments,"cytosol, extracellular space"


In [5]:
annotation_type = "genes"
df_model_mappings = get_annotation_df(
    getattr(model, annotation_type), ["uniprot", "drugbank"]
).rename({"id": annotation_type}, axis=1)

df_model_mappings["drugbank"] = df_model_mappings["drugbank"].apply(
    lambda x: split_string(x)
)
df_model_mappings = df_model_mappings.explode("drugbank").drop_duplicates()
print(df_model_mappings.nunique())
drugbank_ids = set(df_model_mappings["drugbank"].dropna().unique())
uniprot_ids = set(df_model_mappings["uniprot"].dropna().unique())
df_model_mappings

genes        656
uniprot      656
drugbank    2066
dtype: int64


Unnamed: 0,genes,uniprot,drugbank
0,RPE,Q96AT9,DB00153
1,RPIA,P49247,DB01756
2,SORD,Q00796,DB00157
2,SORD,Q00796,DB04478
3,AKR7A2,O43488,
...,...,...,...
652,SMPD1,P17405,DB12151
652,SMPD1,P17405,DB14009
653,SPHK1,Q9NYA1,DB08868
654,TRPC6,Q9Y210,


## Parse DrugBank information into DataFrame

In [6]:
all_drug_dfs = {}
root = ElementTree.parse(f"{database_dirpath}/drugbank_all_full_database.xml").getroot()
root

<Element '{http://www.drugbank.ca}drugbank' at 0x152e69850>

#### Extract general information

In [7]:
idx = 0
data = defaultdict(dict)
for drug in root:
    # General information
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    data[idx].update({"drugbank-id": drugbank_id})
    data[idx].update({attr: drug.get(attr) for attr in ["type", "created", "updated"]})
    for key in DRUGBANK_GENERAL_ELEMENTS:
        if key == "drugbank-id":
            continue

        if key in {"name", "cas-number"}:
            element = drug.find(f"{DRUGBANK_NS}{key}")
            if element is not None and has_value_type(element):
                data[idx].update({key: element.text})

    for key in {"products", "international-brands"}:
        subkey = "name"
        data[idx].update(
            {
                f"{key}": build_string(
                    [
                        element.findtext(f"{DRUGBANK_NS}{subkey}")
                        for element in drug.findall(
                            f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}"
                        )
                    ]
                )
            }
        )
    key = "synonyms"
    data[idx].update(
        {
            f"{key}": build_string(
                [
                    element.text
                    for element in drug.findall(
                        f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}"
                    )
                ]
            )
        }
    )

    idx += 1

df_drugbank_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
all_drug_dfs["General"] = df_drugbank_data
df_drugbank_data = df_drugbank_data.drop(["created", "updated"], axis=1)
df_drugbank_data

Unnamed: 0,drugbank-id,type,name,cas-number,products,international-brands,synonyms
0,DB00027,small molecule,Gramicidin D,1405-97-6,Antibiotic Cream;Antibiotic Cream for Kids;Ant...,Sofradex,Bacillus brevis gramicidin D;Gramicidin;Gramic...
1,DB00030,biotech,Insulin human,11061-68-0,Actraphane 30;Actraphane 30 Flexpen;Actraphane...,,High molecular weight insulin human;Human insu...
2,DB00035,small molecule,Desmopressin,16679-58-6,Apo-desmopressin;Bipazen;Ddavp;Ddavp Inj 4mcg/...,Adiuretin;DesmoMelt,1-(3-mercaptopropionic acid)-8-D-arginine-vaso...
3,DB00041,biotech,Aldesleukin,110942-02-4,Proleukin,,125-L-serine-2-133-interleukin 2 (human reduce...
4,DB00046,biotech,Insulin lispro,133107-64-9,Admelog;Admelog Solostar;Humalog;Humalog (cart...,,Insulin lispro;Insulin lispro (genetical recom...
...,...,...,...,...,...,...,...
2061,DB16732,biotech,Tisotumab vedotin,1418731-10-8,Tivdak,Tivdak,HuMax-TF-ADC;Tisotumab vedotin;tisotumab vedot...
2062,DB16826,small molecule,Repotrectinib,1802220-02-5,,,"(3R,6S,)-45-FLUORO-3,6-DIMETHYL-5-OXA-2,8-DIAZ..."
2063,DB17083,small molecule,Linzagolix,935283-04-8,KLH-2109 Choline;Yselty,,"3-(5-((2,3-difluoro-6-methoxyphenyl)methoxy)-2..."
2064,DB17472,small molecule,Pirtobrutinib,2101700-15-4,Jaypirca,Jaypirca,(s)-5-amino-3-(4-((5-fluoro-2-methoxybenzamido...


In [8]:
idx = 0
data = defaultdict(dict)
for drug in root:
    # General information
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "categories"
    # Drug catgories
    # For mesh-id: https://registry.identifiers.org/registry/mesh
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    for element in elements:
        data[idx].update({"drugbank-id": drugbank_id})
        data[idx]["category"] = element.findtext(f"{DRUGBANK_NS}category")
        data[idx]["mesh-id"] = element.findtext(f"{DRUGBANK_NS}mesh-id")
        idx += 1


df_drug_category = (
    pd.DataFrame.from_dict(data, orient="index")
    .replace("", float("nan"))
    .drop_duplicates()
    .reset_index(drop=True)
)
all_drug_dfs["Categories"] = df_drug_category
df_drug_category

Unnamed: 0,drugbank-id,category,mesh-id
0,DB00027,"Amino Acids, Peptides, and Proteins",D000602
1,DB00027,Anti-Bacterial Agents,D000900
2,DB00027,Anti-Infective Agents,D000890
3,DB00027,"Anti-Infective Agents, Local",D000891
4,DB00027,Membrane Proteins,D008565
...,...,...,...
29132,DB17472,P-glycoprotein inhibitors,
29133,DB17472,P-glycoprotein substrates,
29134,DB17472,Protein Kinase Inhibitors,D047428
29135,DB17472,Tyrosine Kinase Inhibitors,D000092004


#### Extract ATC codes

In [9]:
idx = 0
data = defaultdict(dict)
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "atc-codes"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    for element in elements:
        data[idx].update(
            {
                "drugbank-id": drugbank_id,
                "substance.code": element.get("code"),
                "substance.description": drug.findtext(f"{DRUGBANK_NS}name"),
            }
        )
        for level, subelement in zip(
            ["chemical", "pharmacological", "therapeutic", "anatomical"], list(element)
        ):
            data[idx].update(
                {
                    f"{level}.description": subelement.text,
                    f"{level}.code": subelement.get("code"),
                }
            )
        idx += 1

df_atc_codes_data = pd.DataFrame.from_dict(data, orient="index")
df_atc_codes_data = df_atc_codes_data.loc[
    :, list(df_atc_codes_data.columns[:1]) + list(df_atc_codes_data.columns[1:][::-1])
]
df_atc_codes_data = (
    df_drugbank_data[["drugbank-id"]]
    .merge(
        df_atc_codes_data,
        left_on="drugbank-id",
        right_on="drugbank-id",
        how="left",
    )
    .drop_duplicates()
    .reset_index(drop=True)
)
all_drug_dfs["ATC"] = df_atc_codes_data

print(df_atc_codes_data.nunique())
df_atc_codes_data

drugbank-id                    2066
anatomical.code                  14
anatomical.description           14
therapeutic.code                 85
therapeutic.description          85
pharmacological.code            198
pharmacological.description     194
chemical.code                   509
chemical.description            472
substance.description           947
substance.code                 1741
dtype: int64


Unnamed: 0,drugbank-id,anatomical.code,anatomical.description,therapeutic.code,therapeutic.description,pharmacological.code,pharmacological.description,chemical.code,chemical.description,substance.description,substance.code
0,DB00027,R,RESPIRATORY SYSTEM,R02,THROAT PREPARATIONS,R02A,THROAT PREPARATIONS,R02AB,Antibiotics,Gramicidin D,R02AB30
1,DB00030,A,ALIMENTARY TRACT AND METABOLISM,A10,DRUGS USED IN DIABETES,A10A,INSULINS AND ANALOGUES,A10AC,"Insulins and analogues for injection, intermed...",Insulin human,A10AC01
2,DB00030,A,ALIMENTARY TRACT AND METABOLISM,A10,DRUGS USED IN DIABETES,A10A,INSULINS AND ANALOGUES,A10AE,"Insulins and analogues for injection, long-acting",Insulin human,A10AE01
3,DB00030,A,ALIMENTARY TRACT AND METABOLISM,A10,DRUGS USED IN DIABETES,A10A,INSULINS AND ANALOGUES,A10AB,"Insulins and analogues for injection, fast-acting",Insulin human,A10AB01
4,DB00030,A,ALIMENTARY TRACT AND METABOLISM,A10,DRUGS USED IN DIABETES,A10A,INSULINS AND ANALOGUES,A10AD,"Insulins and analogues for injection, intermed...",Insulin human,A10AD01
...,...,...,...,...,...,...,...,...,...,...,...
3108,DB16732,L,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L01,ANTINEOPLASTIC AGENTS,L01F,MONOCLONAL ANTIBODIES AND ANTIBODY DRUG CONJUG...,L01FX,Other monoclonal antibodies and antibody drug ...,Tisotumab vedotin,L01FX23
3109,DB16826,,,,,,,,,,
3110,DB17083,H,"SYSTEMIC HORMONAL PREPARATIONS, EXCL. SEX HORM...",H01,PITUITARY AND HYPOTHALAMIC HORMONES AND ANALOGUES,H01C,HYPOTHALAMIC HORMONES,H01CC,Anti-gonadotropin-releasing hormones,Linzagolix,H01CC04
3111,DB17472,,,,,,,,,,


#### Extract drug interactions
Extracted drug interactions are confined to those that directly map into the reconstruction.

In [10]:
prefix = True

idx = 0
data = defaultdict(dict)
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    # Get only drugbank IDs specified
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    key = "drug-interactions"
    elements = drug.findall(f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key)}")
    prefix = f"{key}." if prefix else ""
    for element in elements:
        interacting_id = element.findtext(f"{DRUGBANK_NS}drugbank-id")
        if interacting_id in drugbank_ids:
            data[idx].update(
                {
                    "drugbank-id": drugbank_id,
                    "name": drug.findtext(f"{DRUGBANK_NS}name"),
                }
            )
            data[idx].update(
                {
                    f"{prefix}{subkey}": element.findtext(f"{DRUGBANK_NS}{subkey}")
                    for subkey in ["drugbank-id", "name", "description"]
                }
            )
            idx += 1

df_drug_interactions = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)

# Drug interactions go two ways, generate a field to keep only unique interactions
df_drug_interactions["drug;drug"] = df_drug_interactions[
    ["drugbank-id", "drug-interactions.drugbank-id"]
].apply(lambda x: build_string(sorted(x.values)), axis=1)
df_drug_interactions = df_drug_interactions.drop_duplicates(subset=["drug;drug"])
df_drug_interactions = df_drug_interactions.reset_index(drop=True)
df_drug_interactions = df_drug_interactions.rename(
    {
        "drugbank-id": "drugbank_A",
        "name": "name_A",
        "drug-interactions.drugbank-id": "drugbank_B",
        "drug-interactions.name": "name_B",
        "drug;drug": "drugbank_A;drugbank_B",
    },
    axis=1,
)
all_drug_dfs["Interactions"] = df_drug_interactions

print(df_drug_interactions.nunique())
df_drug_interactions

drugbank_A                         1118
name_A                             1118
drugbank_B                         1131
name_B                             1131
drug-interactions.description    214891
drugbank_A;drugbank_B            214891
dtype: int64


Unnamed: 0,drugbank_A,name_A,drugbank_B,name_B,drug-interactions.description,drugbank_A;drugbank_B
0,DB00027,Gramicidin D,DB01418,Acenocoumarol,The risk or severity of bleeding can be increa...,DB00027;DB01418
1,DB00027,Gramicidin D,DB08794,Ethyl biscoumacetate,The risk or severity of bleeding can be increa...,DB00027;DB08794
2,DB00027,Gramicidin D,DB00281,Lidocaine,The risk or severity of methemoglobinemia can ...,DB00027;DB00281
3,DB00027,Gramicidin D,DB00721,Procaine,The risk or severity of methemoglobinemia can ...,DB00027;DB00721
4,DB00027,Gramicidin D,DB00814,Meloxicam,The risk or severity of methemoglobinemia can ...,DB00027;DB00814
...,...,...,...,...,...,...
214886,DB16650,Deucravacitinib,DB17472,Pirtobrutinib,The risk or severity of adverse effects can be...,DB16650;DB17472
214887,DB16690,Tegoprazan,DB16703,Belumosudil,The serum concentration of Belumosudil can be ...,DB16690;DB16703
214888,DB16703,Belumosudil,DB17472,Pirtobrutinib,The risk or severity of adverse effects can be...,DB16703;DB17472
214889,DB16703,Belumosudil,DB16826,Repotrectinib,The serum concentration of Repotrectinib can b...,DB16703;DB16826


### Extract proteins

In [11]:
idx = 0
data = defaultdict(dict)
prefix = False
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    prefix = f"proteins." if prefix else ""
    for ptype in ["targets", "enzymes", "carriers", "transporters"]:
        elements = drug.findall(
            f"{DRUGBANK_NS}{ptype}/{DRUGBANK_NS}{strip_plural(ptype)}"
        )
        for element in elements:
            for subelement in element.findall(f"{DRUGBANK_NS}polypeptide"):
                data[idx].update({f"drugbank-id": drugbank_id, f"{prefix}type": ptype})
                data[idx].update(
                    {
                        f"{prefix}{strip_ns_DrugBank(subelement.tag)}": subelement.text
                        for subelement in element
                        if has_value_type(subelement)
                    }
                )

                # Polypeptide
                key = "polypeptide"
                data[idx].update(
                    {
                        f"{prefix}{key}.uniprot-id": subelement.get("id"),
                        f"{prefix}{key}.source": subelement.get("source"),
                    }
                )
                data[idx].update(
                    {
                        f"{prefix}{key}.{strip_ns_DrugBank(subelem.tag)}": subelem.text
                        for subelem in subelement
                        if has_value_type(subelem)
                    }
                )
                subkey = "pfams"
                data[idx].update(
                    {
                        f"{prefix}{key}.{subkey}": build_string(
                            [
                                subelem.text
                                for subelem in subelement.findall(
                                    f"{DRUGBANK_NS}{subkey}/{DRUGBANK_NS}{strip_plural(subkey)}/{DRUGBANK_NS}identifier"
                                )
                                if has_value_type(subelem)
                            ]
                        )
                    }
                )

                idx += 1

df_proteins = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
df_proteins = df_proteins[
    df_proteins[f"{prefix}polypeptide.uniprot-id"].isin(uniprot_ids)
]
df_proteins = df_proteins.drop_duplicates().reset_index(drop=True)

all_drug_dfs["Proteins"] = df_proteins
df_proteins

Unnamed: 0,drugbank-id,type,id,name,organism,known-action,polypeptide.uniprot-id,polypeptide.source,polypeptide.name,polypeptide.general-function,...,polypeptide.theoretical-pi,polypeptide.molecular-weight,polypeptide.chromosome-location,polypeptide.organism,polypeptide.amino-acid-sequence,polypeptide.gene-sequence,polypeptide.pfams,polypeptide.signal-regions,induction-strength,inhibition-strength
0,DB00027,transporters,BE0001032,P-glycoprotein 1,Humans,unknown,P08183,Swiss-Prot,Multidrug resistance protein 1,Xenobiotic-transporting atpase activity,...,9.44,141477.255,7,Humans,>lcl|BSEQ0037114|Multidrug resistance protein ...,>lcl|BSEQ0016291|Multidrug resistance protein ...,PF00005;PF00664,,,
1,DB00030,targets,BE0000033,Insulin receptor,Humans,yes,P06213,Swiss-Prot,Insulin receptor,Receptor signaling protein tyrosine kinase act...,...,6.18,156331.465,19,Humans,>lcl|BSEQ0036940|Insulin receptor\nMATGGRRGAAA...,>lcl|BSEQ0020443|Insulin receptor (INSR)\nATGG...,PF07714;PF00041;PF00757;PF01030,1-27,,
2,DB00035,enzymes,BE0000262,Prostaglandin G/H synthase 2,Humans,unknown,P35354,Swiss-Prot,Prostaglandin G/H synthase 2,Prostaglandin-endoperoxide synthase activity,...,7.41,68995.625,1,Humans,>lcl|BSEQ0021832|Prostaglandin G/H synthase 2\...,>lcl|BSEQ0021833|Prostaglandin G/H synthase 2 ...,PF03098;PF00008,1-17,unknown,
3,DB00041,enzymes,BE0000262,Prostaglandin G/H synthase 2,Humans,unknown,P35354,Swiss-Prot,Prostaglandin G/H synthase 2,Prostaglandin-endoperoxide synthase activity,...,7.41,68995.625,1,Humans,>lcl|BSEQ0021832|Prostaglandin G/H synthase 2\...,>lcl|BSEQ0021833|Prostaglandin G/H synthase 2 ...,PF03098;PF00008,1-17,unknown,
4,DB00041,enzymes,BE0000657,Cytosolic phospholipase A2,Humans,unknown,P47712,Swiss-Prot,Cytosolic phospholipase A2,Phospholipase a2 activity,...,5.03,85238.2,1,Humans,>lcl|BSEQ0010456|Cytosolic phospholipase A2\nM...,>lcl|BSEQ0010457|Cytosolic phospholipase A2 (P...,PF00168;PF01735,,unknown,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3746,DB16826,transporters,BE0001032,P-glycoprotein 1,Humans,no,P08183,Swiss-Prot,Multidrug resistance protein 1,Xenobiotic-transporting atpase activity,...,9.44,141477.255,7,Humans,>lcl|BSEQ0037114|Multidrug resistance protein ...,>lcl|BSEQ0016291|Multidrug resistance protein ...,PF00005;PF00664,,,
3747,DB17083,transporters,BE0001067,ATP-binding cassette sub-family G member 2,Humans,no,Q9UNQ0,Swiss-Prot,ATP-binding cassette sub-family G member 2,Xenobiotic-transporting atpase activity,...,8.9,72313.47,4,Humans,>lcl|BSEQ0002125|ATP-binding cassette sub-fami...,>lcl|BSEQ0016303|ATP-binding cassette sub-fami...,PF00005;PF01061,,,
3748,DB17472,transporters,BE0001032,P-glycoprotein 1,Humans,unknown,P08183,Swiss-Prot,Multidrug resistance protein 1,Xenobiotic-transporting atpase activity,...,9.44,141477.255,7,Humans,>lcl|BSEQ0037114|Multidrug resistance protein ...,>lcl|BSEQ0016291|Multidrug resistance protein ...,PF00005;PF00664,,,
3749,DB17472,transporters,BE0001067,ATP-binding cassette sub-family G member 2,Humans,unknown,Q9UNQ0,Swiss-Prot,ATP-binding cassette sub-family G member 2,Xenobiotic-transporting atpase activity,...,8.9,72313.47,4,Humans,>lcl|BSEQ0002125|ATP-binding cassette sub-fami...,>lcl|BSEQ0016303|ATP-binding cassette sub-fami...,PF00005;PF01061,,,


In [12]:
idx = 0
data = defaultdict(dict)
for drug in root:
    drugbank_id = drug.findtext(f"{DRUGBANK_NS}drugbank-id[@primary='true']")
    if drugbank_ids and drugbank_id not in drugbank_ids:
        continue
    # Get only drugbank IDs specified
    for key in ["snp-effects", "snp-adverse-drug-reactions"]:
        elements = drug.findall(
            f"{DRUGBANK_NS}{key}/{DRUGBANK_NS}{strip_plural(key.split('-')[-1])}"
        )
        for element in elements:
            data[idx].update({"drugbank-id": drugbank_id})
            data[idx].update(
                {
                    f"{strip_ns_DrugBank(subelement.tag)}": subelement.text
                    for subelement in element
                    if has_value_type(subelement)
                }
            )
            idx += 1
df_snp_data = (
    pd.DataFrame.from_dict(data, orient="index")
    .fillna("")
    .drop_duplicates()
    .reset_index(drop=True)
)
df_snp_data = df_snp_data[df_snp_data["uniprot-id"].isin(uniprot_ids)]
df_snp_data = df_snp_data.drop_duplicates().reset_index(drop=True)
all_drug_dfs["SNP"] = df_snp_data

df_snp_data

Unnamed: 0,drugbank-id,protein-name,gene-symbol,uniprot-id,rs-id,defining-change,description,pubmed-id,allele,adverse-reaction
0,DB00215,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,C Allele,Patients with this genotype have an increased ...,17913323,,
1,DB00285,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,C Allele,Patients with this genotype have an increased ...,17913323,,
2,DB00285,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,T > C,Patients with this genotype have increased ris...,22641028,,
3,DB00295,Multidrug resistance protein 1,ABCB1,P08183,rs1045642,T Allele,Patients with this genotype may have an increa...,17898703,,
4,DB00317,ATP-binding cassette sub-family G member 2,ABCG2,Q9UNQ0,rs2231142,,Patients with this genotype have an increased ...,17148776,,A allele
5,DB00321,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,C Allele,Patients with this genotype have an increased ...,17913323,,
6,DB00321,Multidrug resistance protein 1,ABCB1,P08183,rs2032583,T > C,Patients with this genotype have increased ris...,22641028,,
7,DB00352,Thiopurine S-methyltransferase,TPMT,P51580,rs1800462,,The presence of this polymorphism in TPMT may ...,21270794,TPMT*2,G Allele
8,DB00352,Thiopurine S-methyltransferase,TPMT,P51580,rs1800460,,The presence of this polymorphism in TPMT may ...,21270794,TPMT*3A,A Allele
9,DB00352,Thiopurine S-methyltransferase,TPMT,P51580,rs1142345,,The presence of this polymorphism in TPMT may ...,21270794,TPMT*3C,G Allele


## Export drug data for subsequent visualization

In [13]:
print(list(all_drug_dfs.keys()))
for sheet_name, df in all_drug_dfs.items():
    df.to_csv(f"{database_dirpath}/{sheet_name}_DrugBank.tsv", sep="\t")

['General', 'Categories', 'ATC', 'Interactions', 'Proteins', 'SNP']
