## Bayesian neural network with pretrained protein embedding enhances prediction accuracy of drug-protein interaction

ABSTRACT: The characterization of drug-protein interactions is crucial in the high-throughput screening for drug
discovery. The deep learning-based approaches have attracted attention because they can predict
drug-protein interactions without trial-and-error by humans. However, because data labeling requires
significant resources, the available protein data size is relatively small, which consequently decreases
model performance. Here we propose two methods to construct a deep learning framework that exhibits
superior performance with a small labeled dataset. At first, we use transfer learning in encoding protein
sequences with a pretrained model, which trains general sequence representations in an unsupervised
manner. Second, we use a Bayesian neural network to make a robust model by estimating the data
uncertainty. As a result, our model performs better than the previous baselines for predicting drugprotein interactions. We also show that the quantified uncertainty from the Bayesian inference is related
to the confidence and can be used for screening DPI data points.

Link to paper: https://arxiv.org/pdf/2012.08194v2.pdf

Credit: https://github.com/QHwan/PretrainDPI

In [1]:
# Clone the repository and cd into directory
!git clone https://github.com/QHwan/PretrainDPI.git
%cd PretrainDPI

/content/PretrainDPI


In [None]:
# Install dependencies / requirements
!pip install rdkit-pypi==2021.3.1.5 fair-esm
!pip install torch==1.4.0
!pip install torch-geometric \
  torch-sparse==latest+cu101 \
  torch-scatter==latest+cu101 \
  torch-cluster==latest+cu101 \
  -f https://pytorch-geometric.com/whl/torch-1.4.0.html

This code classifies binary drug-protein interaction data with the pretrained protein embedding and Bayesian neural networks.

At first, one needs to convert raw data format into deep learning-ready. In the `./data` folder, run `preprocess.py`

In [12]:
%cd PretrainDPI/data
!python preprocess.py --dataset human --n_split 10 --pretrained transformer12
%cd ..

/content/PretrainDPI/data
Number of pairs: 6727
Number of positive interactions: 3369
Number of negative interactions: 3358
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt" to /root/.cache/torch/hub/checkpoints/esm1_t12_85M_UR50S.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm1_t12_85M_UR50S-contact-regression.pt" to /root/.cache/torch/hub/checkpoints/esm1_t12_85M_UR50S-contact-regression.pt
  2% 52/3364 [03:56<6:30:46,  7.08s/it]tcmalloc: large alloc 1987010560 bytes == 0x560bd3ff0000 @  0x7fc8e5d51b6b 0x7fc8e5d71379 0x7fc87392225e 0x7fc8739239d2 0x7fc8b11ea853 0x7fc8b09ee9cf 0x7fc8b119c48a 0x7fc8b08adf99 0x7fc8b08b1337 0x7fc8b08b1428 0x7fc8b0f73bff 0x7fc8b0d23846 0x7fc8b0d2a76f 0x7fc8b2531e58 0x7fc8b253217f 0x7fc8b1175a86 0x7fc8b117b0af 0x7fc8c22bf92b 0x560b74a2ecc0 0x560b74a2ea50 0x560b74aa2be0 0x560b74a9d4ae 0x560b74a303ea 0x560b74a9f32a 0x560b74a9d4ae 0x560b74a30c9f 0x560b74a30ea1 0x560b74a9fbb5 0x560b74a9d4ae 0x560b74a30c

We use a pretrained model from Rives et al. (BioRxiv 622803; doi:https://doi.org/10.1101/622803). The model is available in the GitHub repository (github.com/facebookresearch/esm). If you do not have the model, the `preprocess.py` code automatically downloads it.

We support two training codes, `main_nn.py` and `main_dropout.py`. The latter file uses Bayesian training with MC-dropout method. We adopt concrete dropout.

In [3]:
!python main_dropout.py --dataset_file './data/human/human_transformer12.npz' --save_model './saved_models' --save_result './results'

Counter({1: 2696, 0: 2688})
Counter({1: 337, 0: 336})
Counter({1: 336, 0: 334})
Time: 3.22, Epoch: 1, Train Loss: 0.5719, Val Loss: 0.4750, Val Score: 0.8476, Best Score: 0.0000
Time: 3.20, Epoch: 2, Train Loss: 0.4188, Val Loss: 0.3806, Val Score: 0.9101, Best Score: 0.8476
Time: 3.25, Epoch: 3, Train Loss: 0.3949, Val Loss: 0.3681, Val Score: 0.9193, Best Score: 0.9101
Time: 3.23, Epoch: 4, Train Loss: 0.3711, Val Loss: 0.3875, Val Score: 0.9208, Best Score: 0.9193
Time: 3.22, Epoch: 5, Train Loss: 0.3505, Val Loss: 0.3060, Val Score: 0.9413, Best Score: 0.9208
Time: 3.15, Epoch: 6, Train Loss: 0.3451, Val Loss: 0.3230, Val Score: 0.9390, Best Score: 0.9413
Time: 3.14, Epoch: 7, Train Loss: 0.3259, Val Loss: 0.2929, Val Score: 0.9488, Best Score: 0.9413
Time: 3.17, Epoch: 8, Train Loss: 0.3122, Val Loss: 0.2888, Val Score: 0.9482, Best Score: 0.9488
Time: 3.18, Epoch: 9, Train Loss: 0.3085, Val Loss: 0.2769, Val Score: 0.9538, Best Score: 0.9488
Time: 3.20, Epoch: 10, Train Loss: 0.2