# PyKale Tutorial: Drug-Target Interaction Prediction using DeepDTA

| [Open In Colab](https://colab.research.google.com/github/pykale/pykale/blob/main/examples/bindingdb_deepdta/tutorial.ipynb) (click `Runtime` → `Run all (Ctrl+F9)` |

If using [Google Colab](https://colab.research.google.com), a free GPU can be enabled to save time via setting `Runtime` → `Change runtime type` → `Hardware accelerator: GPU`

## Introduction
Drug-target interaction prediction is an important research area in the field of drug discovery. It refers to predicting the binding affinity between the given chemical compounds and protein targets. In this example we train a standard DeepDTA model as a baseline in BindingDB, a public, web-accessible dataset of measured binding affinities.

### DeepDTA
[DeepDTA](https://academic.oup.com/bioinformatics/article/34/17/i821/5093245) is the modeling of protein sequences and compound 1D representations with convolutional neural networks (CNNs). The whole architecture of DeepDTA is shown below.

![DeepDTA](figures/deepdta.png)

### Datasets
We construct **three datasets** from BindingDB distinguished by different affinity measurement metrics
(**Kd, IC50 and Ki**). They are acquired from [Therapeutics Data Commons](https://tdcommons.ai/) (TDC), which is a collection of machine learning tasks spreading across different domains of therapeutics. The data statistics is shown below:

|  Metrics   | Drugs | Targets | Pairs |
|  :----:  | :----:  |   :----:  | :----:  |
| Kd  | 10,655 | 1,413 | 52,284 |
| IC50  | 549,205 | 5,078 | 991,486 |
| Ki | 174,662 | 3,070 | 375,032 |

This figure is the binding affinity distribution for the three datasets respectively, where the metric values (x-axis) have been transformed into log space.
![Binding affinity distribution](figures/bindingdb.jpg)
This tutorial uses the (smallest) **Kd** dataset.

## Setup

The first few blocks of code are necessary to set up the notebook execution environment and import the required modules, including PyKale.

This checks if the notebook is running on Google Colab and installs required packages.

In [None]:
if 'google.colab' in str(get_ipython()):
    print('Running on CoLab')
    !pip uninstall --yes imgaug && pip uninstall --yes albumentations && pip install git+https://github.com/aleju/imgaug.git
    !pip install rdkit-pypi torchaudio torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.10.0+cu113.html 
    !pip install git+https://github.com/pykale/pykale.git 

    !git clone https://github.com/pykale/pykale.git
    %cd pykale/examples/bindingdb_deepdta
else:
    print('Not running on CoLab')

This imports required modules.

In [None]:
import pytorch_lightning as pl
import torch
from config import get_cfg_defaults
from model import get_model
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger
from torch.utils.data import DataLoader, Subset

from kale.loaddata.tdc_datasets import BindingDBDataset
from kale.utils.seed import set_seed

## Configuration

The customized configuration used in this tutorial is stored in `./configs/tutorial.yaml`, this file overwrites defaults in `config.py` where a value is specified.

For saving time to run a whole pipeline in this tutorial, we sample small train/valid/test (8,000/1,000/1,000) subsets from the original BindingDB dataset.

In [None]:
cfg_path = "./configs/tutorial.yaml"
train_subset_size, valid_subset_size, test_subset_size = 8000, 1000, 1000

cfg = get_cfg_defaults()
cfg.merge_from_file(cfg_path)
cfg.freeze()
print(cfg)

set_seed(cfg.SOLVER.SEED)

## Check if a GPU is available

If a CUDA GPU is available, this should be used to accelerate the training process. The code below checks and reports on this.


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using: " + device)
gpus = 1 if device == "cuda" else 0

## Select Datasets

Source and target datasets are specified using the `BindingDBDataset()` function and loaded using the `DataLoader()` function.

In [None]:
train_dataset = BindingDBDataset(name=cfg.DATASET.NAME, split="train", path=cfg.DATASET.PATH)
valid_dataset = BindingDBDataset(name=cfg.DATASET.NAME, split="valid", path=cfg.DATASET.PATH)
test_dataset = BindingDBDataset(name=cfg.DATASET.NAME, split="test", path=cfg.DATASET.PATH)
train_size, valid_size, test_size = len(train_dataset), len(valid_dataset), len(test_dataset)
train_sample_indices, valid_sample_indices, test_sample_indices = torch.randperm(train_size)[:train_subset_size].tolist(), torch.randperm(valid_size)[:valid_subset_size].tolist(), torch.randperm(test_size)[:test_subset_size].tolist()
train_dataset, valid_dataset, test_dataset = Subset(train_dataset, train_sample_indices), Subset(valid_dataset, valid_sample_indices), Subset(test_dataset, test_sample_indices)

In [None]:
cfg.DATASET.PATH

In [None]:
train_loader = DataLoader(dataset=train_dataset, shuffle=True, batch_size=cfg.SOLVER.TRAIN_BATCH_SIZE)
valid_loader = DataLoader(dataset=valid_dataset, shuffle=True, batch_size=cfg.SOLVER.TEST_BATCH_SIZE)
test_loader = DataLoader(dataset=test_dataset, shuffle=True, batch_size=cfg.SOLVER.TEST_BATCH_SIZE)

## Setup model

Here, we use the previously defined configuration and dataset to set up the model we will subsequently train.

In [None]:
model = get_model(cfg)

## Setup Logger

A logger is used to store output generated during and after model training. This information can be used to assess the effectiveness of the training and to identify problems.

In [None]:
tb_logger = TensorBoardLogger("tb_logs", name=cfg.DATASET.NAME)

## Setup Trainer

A trainer object is used to determine and store model parameters. Here, one is configured with information on how a model should be trained, and what hardware will be used.

In [None]:
checkpoint_callback = ModelCheckpoint(monitor="valid_loss", mode="min")
trainer = pl.Trainer(min_epochs=cfg.SOLVER.MIN_EPOCHS, 
                     max_epochs=cfg.SOLVER.MAX_EPOCHS, 
                     gpus=gpus, logger=tb_logger, 
                     callbacks=[checkpoint_callback])

## Train Model

Optimize model parameters using the trainer.

In [None]:
%time trainer.fit(model, train_dataloader=train_loader, val_dataloaders=valid_loader)

## Test Optimized Model

Check performance of model optimized with training data against test data which was not used in training.

In [None]:
trainer.test(test_dataloaders=test_loader)

You should get a test loss of $7.3\cdots$ in root mean square error (RMSE). The target value ($y$) has a range of [-13, 20] (in log space). Thus, with only three epochs, we have learned to predict the target value with an RMSE of 7.3 over a range of [-13, 20].

We set the maximum epochs to 3 and extract a subset (8000/1000/1000) to save time in running this tutorial. You may change these settings. Setting the max epochs to 100 and using the full dataset will get a much better result (<1).

## Architecture
Below is the architecture of DeepDTA with default hyperparameters settings.

<pre>
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
├─CNNEncoder: 1-1                        [256, 96]                 --
|    └─Embedding: 2-1                    [256, 85, 128]            8,320
|    └─Conv1d: 2-2                       [256, 32, 121]            21,792
|    └─Conv1d: 2-3                       [256, 64, 114]            16,448
|    └─Conv1d: 2-4                       [256, 96, 107]            49,248
|    └─AdaptiveMaxPool1d: 2-5            [256, 96, 1]              --
├─CNNEncoder: 1-2                        [256, 96]                 --
|    └─Embedding: 2-6                    [256, 1200, 128]          3,328
|    └─Conv1d: 2-7                       [256, 32, 121]            307,232
|    └─Conv1d: 2-8                       [256, 64, 114]            16,448
|    └─Conv1d: 2-9                       [256, 96, 107]            49,248
|    └─AdaptiveMaxPool1d: 2-10           [256, 96, 1]              --
├─MLPDecoder: 1-3                        [256, 1]                  --
|    └─Linear: 2-11                      [256, 1024]               197,632
|    └─Dropout: 2-12                     [256, 1024]               --
|    └─Linear: 2-13                      [256, 1024]               1,049,600
|    └─Dropout: 2-14                     [256, 1024]               --
|    └─Linear: 2-15                      [256, 512]                524,800
|    └─Linear: 2-16                      [256, 1]                  513
==========================================================================================
Total params: 2,244,609
Trainable params: 2,244,609
Non-trainable params: 0
Total mult-adds (M): 58.08
==========================================================================================
Input size (MB): 1.32
Forward/backward pass size (MB): 429.92
Params size (MB): 8.98
Estimated Total Size (MB): 440.21