## DEEPScreen: high performance drug–target interaction prediction with convolutional neural networks using 2-D structural compound representations

ABSTRACT: The identification of physical interactions between drug candidate compounds and target biomolecules is
an important process in drug discovery. Since conventional screening procedures are expensive and time
consuming, computational approaches are employed to provide aid by automatically predicting novel
drug–target interactions (DTIs). In this study, we propose a large-scale DTI prediction system,
DEEPScreen, for early stage drug discovery, using deep convolutional neural networks. One of the main
advantages of DEEPScreen is employing readily available 2-D structural representations of compounds at
the input level instead of conventional descriptors that display limited performance. DEEPScreen learns
complex features inherently from the 2-D representations, thus producing highly accurate predictions.
The DEEPScreen system was trained for 704 target proteins (using curated bioactivity data) and finalized
with rigorous hyper-parameter optimization tests. We compared the performance of DEEPScreen
against the state-of-the-art on multiple benchmark datasets to indicate the effectiveness of the
proposed approach and verified selected novel predictions through molecular docking analysis and
literature-based validation. Finally, JAK proteins that were predicted by DEEPScreen as new targets of
a well-known drug cladribine were experimentally demonstrated in vitro on cancer cells through STAT3
phosphorylation, which is the downstream effector protein. The DEEPScreen system can be exploited in the fields of drug discovery and repurposing for in silico screening of the chemogenomic space, to
provide novel DTIs which can be experimentally pursued.

Link to paper: https://pubs.rsc.org/en/content/articlepdf/2020/sc/c9sc03414e

Credit: https://github.com/cansyl/DEEPScreen

Google Colab: https://colab.research.google.com/drive/1eW7ji8wxR2RChfP6hlcaJ0WL4EYYiYOD?usp=sharing

In [1]:
# Clone the repository and cd into directory
!git clone https://github.com/cansyl/DEEPScreen.git
%cd DEEPScreen/bin

/content/DEEPScreen/bin


In [None]:
# Install requirements / dependencies
!pip install -r requirements.txt

### Descriptions of folders and files in the DEEPScreen repository

* **bin** folder includes the source code of DEEPScreen.

* **training_files** folder includes the files directly used in the training and testing of the system:
    * **chembl27_preprocessed_filtered_bioactivity_dataset.tsv.zip** updated version of ChEMBL preprocessed and filtered dataset contains drug/compound-target interactions from the ChEMBL database (v27) after the application of multiple filtering operations to obtain a clean training set,
    * **chembl27_training_target_list.txt** list of target chembl ids,
    * **target_training_datasets** contains a folder (e.g. CHEMBL286) for each target where each target folder contains 
    	* a json file named  **train_val_test_dict.json** which includes train/validation/test compound ids,
    	* a folder named **imgs** which holds images of compounds.
       
    * **chembl27_preprocessed_filtered_act_inact_comps_10.0_20.0_blast_comp_0.2.txt** contains the active and inactive compound information for each target protein in ChEMBL, after the similarity-based negative training dataset enrichment process. In this file, there are two lines for each target, in the following format:
        
        ```
       CHEMBL286_act	CHEMBL1818056,CHEMBL2115367,CHEMBL344651,CHEMBL62054, ...
       CHEMBL286_inact	CHEMBL288434,CHEMBL584926,CHEMBL406111,CHEMBL151055, ...
       ```
       
       The list of active/inactive compounds separated by commas (i.e., the second tab seperated column: *CHEMBL1818056,C...*) for the correnponding target (i.e., the first column: *CHEMBL286_act*),
       
    * **chembl27_uniprot_mapping.txt** contains the id mapping between UniProt accessions and ChEMBL ids for proteins, in tab-separated format (Target UniProt accession, Target	ChEMBL id, Target protein name and Target type),
    
* **result_files** folder contains results of various tests/analyses:

* **2-D images of:** 
   - 409,311 ChEMBL compounds in the train/validation/test datasets of 812 target proteins of DEEPScreen can be downloaded from [here](https://drive.google.com/file/d/1E7ZpLN_fMdXmPJPP7WH3IPWPceleP_3a/view?usp=sharing)
   - all compounds (~2M) in ChEMBL v27 can be downloaded from [here](https://drive.google.com/file/d/16T8NI1Umf8A0qeLu90Akbx3ic-vdAbUO/view?usp=sharing)
   - all drugs (~11K) in DrugBank v5.1.7 can be downloaded from [here](https://drive.google.com/file/d/11vSqg1SgX7y25TbX4EzNOjWNkSFVZzek/view?usp=sharing)

### How to train DEEPScreen models and get performance results 

* Clone the Git Repository

* Download the compressed file for the target  that you want to train  [here](https://www.dropbox.com/sh/as18uxmctnf39kc/AADUqZX3XAiQRU6UVp3SsBRXa?dl=0)

* Locate the zipped target file under **training_files/target_training_datasets** and unzip it

* Run the **main_training.py** script as shown below

### Explanation of Parameters

* **--targetid**: Target to be trained (default: CHEMBL286)

* **--model**: CNN architecture to be used (default: CNNModel1)

* **--fc1**: number of neurons in the first fully-connected layer (default:512)

* **--fc2**: number of neurons in the second fully-connected layer (default:256)

* **--lr**:learning rate (default: 0.001)

* **--bs**: batch size (default: 32)

* **--dropout**: dropout rate (default: 0.1)

* **--epoch**: number of epochs (default: 200)

* **--en**: the name of the experiment (default: my_experiment)

#### To perform training for a target (CHEMBL286 in the below example):

In [3]:
!python main_training.py --targetid CHEMBL286 --model CNNModel1 --fc1 256 --fc2 128 --lr 0.01 --bs 64 --dropout 0.25 --epoch 100 --en my_chembl286_training

Namespace(bs=64, dropout=0.25, en='my_chembl286_training', epoch=100, fc1=256, fc2=128, lr=0.01, model='CNNModel1', targetid='CHEMBL286')
Arguments: CHEMBL286-CNNModel1-256-128-0.01-64-0.25-100-my_chembl286_training
GPU is available on this device!
Epoch :0
Training mode: True
Epoch 0 training loss: 23.956565976142883
Validation mode: True
Epoch :1
Training mode: True
Epoch 1 training loss: 12.626873254776001
Validation mode: True
Epoch :2
Training mode: True
Epoch 2 training loss: 12.199288070201874
Validation mode: True
Epoch :3
Training mode: True
Epoch 3 training loss: 11.504060626029968
Validation mode: True
Epoch :4
Training mode: True
Epoch 4 training loss: 11.481438398361206
Validation mode: True
Epoch :5
Training mode: True
Epoch 5 training loss: 11.039315849542618
Validation mode: True
Epoch :6
Training mode: True
Epoch 6 training loss: 10.671604573726654
Validation mode: True
Epoch :7
Training mode: True
Epoch 7 training loss: 10.56933107972145
Validation mode: True
Epoch :8

#### Output of the scripts
**main_training.py** creates a folder named **<experiment_name>** (given as argument **--en**)   under **result_files/experiments** folder. Two files are created under **results_files/experiments/<experiment_name>**:
* **best_val_test_predictions-<hyperparameters_seperated by dash>-<experiment_name>.txt** contains predictions for independent test dataset. 
* **best_val_test_performance_results-<hyperparameters_seperated by dash>-<experiment_name>.txt** which contains the best test performance results. Sample output files for ChEMBL286 target is given under  **results_files/experiments/my_chembl286_training**.