# Processing Drug-Target Interaction Data

This notebook covers:
- Converting merged DTI data into an `h5torch` dataset
- Splitting the dataset (stratified) into train/val/test in two settings: random split, and cold-start split
- Computing embeddings from foundation models and storing them in the `h5torch` file
    - Drugs: `MMELON` (graph, image, text), and `RDKit` fingerprints
    - Targets: `NT`, `ESM`, and `ESPF` fingerprints
- Visualizing the foundatoin model embeddings

In [1]:
from resolve import *

Setting working directory to: /home/robsyc/Desktop/thesis/MB-VAE-DTI


The drug and protein embedding generation was offloaded to an HPC. We used: 
- Digital Ocean droplet with a 48 GB NVIDIA L40S GPU
- Ubuntu 22.04, Python3.11 and basic virtual environments

Due to dependency-conflicts between the foundation models, we had to create a new venv for each model (basic `requirements.txt` files can be found in the corresponding folders in the `external` directory). Check the `scripts/embedding.sh` file for more details.

The `embeddings.sh` script creates HDF5 files in the `external/temp` directory. Namely, `dti_smiles.hdf5`, `dti_aa.hdf5`, and `dti_dna.hdf5` for the DTI dataset, and `pretrain_smiles.hdf5`, `pretrain_aa.hdf5`, and `pretrain_dna.hdf5` for the pre-training datasets.

These files are then used to construct the `h5torch` files using the `h5torch_creation.py` script, namely, `dti.h5torch`, `drugs.h5torch`, and `targets.h5torch`. Below we inspect the structure and contents of these files as well as how they are used to instantiate `PretrainDataset` and `DTIDataset` dataloaders.

> Note: The pretrain dataset `drugs.h5torch` was limited to 2 million entities (from original 3,460,396) due to storage constraints. See `cap_drugs_h5torch.py` for more details.

## Pretrain Datasets

In [12]:
from mb_vae_dti.processing.h5factory import inspect_h5torch_file

output_dir = Path("/home/robsyc/Desktop/thesis/MB-VAE-DTI/data/input")

target_output_file = output_dir / "targets.h5torch"
inspect_h5torch_file(target_output_file)

drug_output_file = output_dir / "drugs.h5torch"
inspect_h5torch_file(drug_output_file)

2025-05-30 18:00:34,739 - INFO - --- Inspecting H5torch File: targets.h5torch ---
2025-05-30 18:00:34,741 - INFO - --- Finished Inspecting: targets.h5torch ---
2025-05-30 18:00:34,742 - INFO - --- Inspecting H5torch File: drugs.h5torch ---
2025-05-30 18:00:34,743 - INFO - --- Finished Inspecting: drugs.h5torch ---



[Root Attributes]
  - entity_type: target
  - n_items: 190851

[Central Dataset]
  Mode: N/A (Implicitly N-D or similar)
    - Name: central
      - Path: /central
      - Shape/Length: (190851,)
      - Saved Dtype: uint32

[Aligned Axes]

  --- Axis 0 ---
    - Name: EMB-ESM
      - Path: /0/EMB-ESM
      - Shape/Length: (190851, 1152)
      - Saved Dtype: float32
    - Name: EMB-NT
      - Path: /0/EMB-NT
      - Shape/Length: (190851, 1024)
      - Saved Dtype: float32
    - Name: FP-ESP
      - Path: /0/FP-ESP
      - Shape/Length: (190851, 4170)
      - Saved Dtype: uint8
    - Name: aa
      - Path: /0/aa
      - Shape/Length: Length: 190851
      - Saved Dtype: |S1280
    - Name: dna
      - Path: /0/dna
      - Shape/Length: Length: 190851
      - Saved Dtype: |S3843

[Unstructured Datasets]
    - Name: is_train
      - Path: /unstructured/is_train
      - Shape/Length: (190851,)
      - Saved Dtype: bool

[Root Attributes]
  - entity_type: drug
  - n_items: 2000000

[Central

In [31]:
from mb_vae_dti.processing.h5datasets import PretrainDataset
from external.ESPF.script import get_target_fingerprint
import numpy as np

targets_pretrain_training = PretrainDataset(
    h5_path=target_output_file,
    subset_filters={'split_col': 'is_train', 'split_value': True}
)
sample = targets_pretrain_training[42]
for key, value in sample.items():
    print(key, value)

np.all(sample["features"]["FP-ESP"] == get_target_fingerprint(sample["representations"]["aa"]))

2025-05-30 18:15:27,808 - INFO - Subset mask for targets.h5torch: kept 171765 / 190851 items
2025-05-30 18:15:27,810 - INFO - Initialized PretrainDataset from targets.h5torch. Size: 171765 items.
2025-05-30 18:15:27,811 - INFO -   Features (Axis 0): ['EMB-ESM', 'EMB-NT', 'FP-ESP']
2025-05-30 18:15:27,811 - INFO -   Representations (Axis 0): ['aa', 'dna']


id 49
representations {'aa': 'MAAAMTFCRLLNRCGEAARSLPLGARCFGVRVSPTGEKVTHTGQVYDDKDYRRIRFVGRQKEVNENFAIDLIAEQPVSEVETRVIACDGGGGALGHPKVYINLDKETKTGTCGYCGLQFRQHHH', 'dna': 'ATGGCGGCGGCGATGACCTTCTGCCGGCTGCTGAACCGGTGCGGCGAGGCGGCGCGGAGCCTGCCCCTGGGCGCCAGGTGTTTCGGGGTGCGGGTCTCGCCGACCGGGGAGAAGGTCACGCACACTGGCCAGGTTTATGATGATAAAGACTACAGGAGAATTCGGTTTGTAGGTCGTCAGAAAGAGGTGAATGAAAACTTTGCCATTGATTTGATAGCAGAGCAGCCCGTGAGCGAGGTGGAGACTCGGGTGATAGCGTGCGATGGCGGCGGGGGAGCTCTTGGCCACCCAAAAGTGTATATAAACTTGGACAAAGAAACAAAAACCGGCACATGCGGTTACTGTGGGCTCCAGTTCAGACAGCACCACCACTAG'}
features {'EMB-ESM': array([-0.01264881,  0.00669643, -0.00759549, ...,  0.00806052,
        0.01426091, -0.00678943], dtype=float32), 'EMB-NT': array([ 0.3568277 ,  0.11620766, -0.11930461, ...,  0.15212396,
       -0.13019717,  0.31840327], dtype=float32), 'FP-ESP': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)}


True

In [39]:
from mb_vae_dti.processing.h5datasets import PretrainDataset
from external.MorganFP.script import get_drug_fingerprint
import numpy as np

drugs_pretrain_validation = PretrainDataset(
    h5_path=drug_output_file,
    subset_filters={'split_col': 'is_train', 'split_value': False}
)
sample = drugs_pretrain_validation[42]
print(sample)

np.all(sample["features"]["FP-Morgan"] == get_drug_fingerprint(sample["representations"]["smiles"]))

2025-05-30 18:21:07,335 - INFO - Subset mask for drugs.h5torch: kept 200000 / 2000000 items
2025-05-30 18:21:07,342 - INFO - Initialized PretrainDataset from drugs.h5torch. Size: 200000 items.
2025-05-30 18:21:07,342 - INFO -   Features (Axis 0): ['EMB-BiomedGraph', 'EMB-BiomedImg', 'EMB-BiomedText', 'FP-Morgan']
2025-05-30 18:21:07,343 - INFO -   Representations (Axis 0): ['smiles']


{'id': 313, 'representations': {'smiles': 'CCOc1cc2c(c(O)c1OCC)C(=O)NC1C2CC(O)C(O)C1O'}, 'features': {'EMB-BiomedGraph': array([ 3.28338034e-02,  5.93008325e-02, -6.75319880e-02, -2.33449712e-01,
        4.99292940e-01,  3.82202864e-03,  8.50097910e-02, -1.92996599e-02,
       -2.97931135e-01, -9.10175368e-02, -3.02154664e-03, -5.64331174e-01,
        1.00632846e-01, -3.97299835e-03, -1.95052072e-01,  8.50920454e-02,
        6.33843467e-02,  1.54944748e-01,  4.93423976e-02,  1.09941289e-01,
       -1.31553829e-01, -2.16462798e-02, -4.77177389e-02,  3.33764516e-02,
        2.83989847e-01, -8.55271611e-03,  1.59002018e+00, -5.26811257e-02,
        4.47995365e-02,  2.23343277e+00,  7.47756287e-03, -2.32666992e-02,
       -2.66399048e-02,  5.62607646e-02,  2.10713923e-01, -4.17282850e-01,
       -2.69951411e-02,  8.95682499e-02, -8.06409940e-02, -1.74574982e-02,
        4.65488546e-02,  7.21551552e-02,  2.56469473e-02,  4.38158214e-02,
       -6.77484721e-02, -1.05443504e-02, -2.94203628e-

True

## DTI Dataset

In [None]:
import pandas as pd

df = pd.read_csv("data/processed/dti.csv")
df

In [10]:
from mb_vae_dti.processing.h5factory import inspect_h5torch_file
import pandas as pd

data_dir = Path("/home/robsyc/Desktop/thesis/MB-VAE-DTI/data/processed")

dti_df = pd.read_csv(data_dir / "dti.csv")

dti_target_input_files = [temp_dir / "dti_aa.hdf5", temp_dir / "dti_dna.hdf5"]
dti_drug_input_files = [temp_dir / "dti_smiles.hdf5"]

dti_output_file = output_dir / "dti.h5torch"

# create_dti_h5torch(
#     dti_df,
#     dti_drug_input_files,
#     dti_target_input_files,
#     dti_output_file
# )

inspect_h5torch_file(dti_output_file)

2025-05-30 14:31:40,695 - INFO - --- Inspecting H5torch File: dti.h5torch ---
2025-05-30 14:31:40,701 - INFO - --- Finished Inspecting: dti.h5torch ---



[Root Attributes]
  - created_at: 2025-05-02T10:28:30.834915
  - n_drugs: 126811
  - n_interactions: 339197
  - n_targets: 1976
  - sparsity: 0.0013536554463707139

[Central Dataset]
  Mode: coo
  Shape (Attr): [126811   1976]
    - Name: indices
      - Path: /central/indices
      - Shape/Length: (2, 339197)
      - Saved Dtype: int64
    - Dataset 'values' not found or not a dataset.

[Aligned Axes]

  --- Axis 0 ---
    - Name: Drug_ID
      - Path: /0/Drug_ID
      - Shape/Length: Length: 126811
      - Saved Dtype: |S7
    - Name: Drug_InChIKey
      - Path: /0/Drug_InChIKey
      - Shape/Length: Length: 126811
      - Saved Dtype: object
    - Name: EMB-BiomedGraph
      - Path: /0/EMB-BiomedGraph
      - Shape/Length: (126811, 512)
      - Saved Dtype: float32
    - Name: EMB-BiomedImg
      - Path: /0/EMB-BiomedImg
      - Shape/Length: (126811, 512)
      - Saved Dtype: float32
    - Name: EMB-BiomedText
      - Path: /0/EMB-BiomedText
      - Shape/Length: (126811, 768)
   