## Multi-Objective Molecule Generation using Interpretable Substructures

ABSTRACT: Drug discovery aims to find novel compounds
with specified chemical property profiles. In terms
of generative modeling, the goal is to learn to
sample molecules in the intersection of multiple
property constraints. This task becomes increasingly challenging when there are many property
constraints. We propose to offset this complexity by composing molecules from a vocabulary
of substructures that we call molecular rationales.
These rationales are identified from molecules as
substructures that are likely responsible for each
property of interest. We then learn to expand rationales into a full molecule using graph generative
models. Our final generative model composes
molecules as mixtures of multiple rationale completions, and this mixture is fine-tuned to preserve
the properties of interest. We evaluate our model
on various drug design tasks and demonstrate significant improvements over state-of-the-art baselines in terms of accuracy, diversity, and novelty
of generated compounds.

Link to paper: https://arxiv.org/pdf/2002.03244v3.pdf

Credit: https://github.com/wengong-jin/multiobj-rationale

Google Colab: https://colab.research.google.com/drive/1qmWMxUg8OSX_oxT1zcPT5tUCNxUGgz_E?usp=sharing

In [1]:
# Clone the repository and cd into directory
!git clone https://github.com/wengong-jin/multiobj-rationale.git
%cd multiobj-rationale

/content/multiobj-rationale


In [None]:
# Install chemprop module
!pip install chemprop

# Install RDKit
!pip install rdkit-pypi==2021.3.1.5

### Property Predictors

The property predictors for GSK3 and JNK3 are provided in `data/gsk3/gsk3.pkl` and `data/jnk3/jnk3.pkl`. For example, to predict properties of given molecules, run

In [3]:
!python properties.py --prop jnk3 < data/jnk3/rationales.txt
!python properties.py --prop gsk3,jnk3 < data/dual_gsk3_jnk3/rationales.txt

COC1CCC(C(=O)Nc2cc(-c3ccnc(Nc4ccc(F)cc4)c3)ccn2)CC1 c1cc(-c2ccn[cH:1]c2)cc(Nc2cc[cH:1]cc2)n1 0.71
COC1CCC(C(=O)Nc2cc(-c3ccnc(Nc4ccc(F)cc4)c3)ccn2)CC1 c1cc(-c2ccnc([NH2:1])c2)cc(Nc2cc[cH:1]cc2)n1 0.78
Cn1ccc(-c2cc(Nc3ccc(-n4cnc(N5CCOCC5)n4)cc3)ncc2F)n1 c1c[cH:1]ccc1Nc1cc(-c2cc[nH:1]n2)[cH:1]cn1 0.67
Cn1ccc(-c2cc(Nc3ccc(-n4cnc(N5CCOCC5)n4)cc3)ncc2F)n1 c1cc(-n2cn[cH:1]n2)ccc1Nc1cc([CH3:1])[cH:1]cn1 0.53
Oc1nnc(-c2ccnc(Nc3ccccc3)c2)n1-c1ccc2ccccc2c1 c1ccc(Nc2cc(-c3nn[cH:1][nH:1]3)ccn2)cc1 0.63
Oc1nnc(-c2ccnc(Nc3ccccc3)c2)n1-c1ccc2ccccc2c1 c1ccc(Nc2cc(-c3nn[cH:1]n3[CH3:1])ccn2)cc1 0.65
NC(=O)c1ccccc1Nc1ccnc(Oc2ccccc2)c1 NC(=O)c1ccccc1Nc1ccnc(O[CH3:1])c1 0.52
Cc1[nH]c(C=C2C(=O)Nc3ccc(F)cc32)c(C)c1C(=O)NCC(O)CN1CCOCC1 Cc1[nH]c(C=C2C(=O)Nc3cc[cH:1]cc32)c(C)[cH:1]1 0.69
Cc1[nH]c(C=C2C(=O)Nc3ccc(F)cc32)c(C)c1C(=O)NCC(O)CN1CCOCC1 Cc1[nH]c(C=C2C(=O)Nc3cc[cH:1]cc32)c(C)c1[CH3:1] 0.51
O=Nc1c(O)n(Cc2cc(F)cc3c2OCOC3)c2cccc(-c3cccc(F)c3)c12 O=Nc1c(O)n(Cc2c[cH:1]c[cH:1][cH:1]2)c2ccc[cH:1]c12 0.91
O=Nc1c

### Rationale Extraction

The rationale extraction module will produce a list of triplets (molecule, rationale, score), where molecule is an active compound, rationale is a subgraph that explains the property and score is its predicted score. The following script uses 4 CPU cores (can be adjusted with `--ncpu` argument):

In [None]:
!python mcts.py --data data/jnk3/actives.txt --prop jnk3 --ncpu 4 > jnk3_rationales.txt
!python mcts.py --data data/gsk3/actives.txt --prop gsk3 --ncpu 4 > gsk3_rationales.txt

To construct multi-property rationales, we can merge the single-property rationales for GSK3 and JNK3:

In [None]:
!python merge_rationale.py --rationale1 data/gsk3/rationales.txt --rationale2 data/jnk3/rationales.txt > gsk3_jnk3.txt

### Generative Model Pre-training

The molecule completion model is pre-trained on the ChEMBL dataset. To construct the training set, run

In [None]:
!python preprocess.py --train data/chembl/all.txt --ncpu 4

!mkdir chembl-processed
!mv tensor-* chembl-processed

To train the molecule completion model, run

In [None]:
!python gnn_train.py --train chembl-processed --save_dir ckpt/chembl-molgen

### GSK3 + JNK3 + QED + SA Molecule Design

This task seeks to design dual inhibitors against GSK3 and JNK3 with drug-likeness and synthetic accessibility constraints. We have already computed multi-property rationales in `data/gsk3_jnk3_qed_sa/rationales.txt`. It is a subset of GSK3-JNK3 rationales with QED > 0.6 and SA < 4.0.

#### Step 1: Fine-tuning with Policy Gradient

Given a set of rationales, the model learns to complete them into full molecules. The molecule completion model has been pre-trained on ChEMBL, and it needs to be fine-tuned so that generated molecules will satisfy all the property constraints. To fine-tune the model on the GSK3 + JNK3 + QED + SA task, run

In [None]:
!python finetune.py \
  --init_model ckpt/chembl-h400beta0.3/model.20 --save_dir ckpt/tmp/ \
  --rationale data/gsk3_jnk3_qed_sa/rationales.txt --num_decode 200 --prop gsk3,jnk3,qed,sa --epoch 30 --alpha 0.5

#### Step 2: Molecule Generation

The molecule generation script will expand the extracted rationales into full molecules. The output is a list of pairs (rationale, molecule), where molecule is the completion of rationale. In the following example, each rationale is completed for 100 times, with different sampled latent vectors z.

In [None]:
!python decode.py --model ckpt/gsk3_jnk3_qed_sa/model.final > outputs.txt

#### Step 3: Evaluation

You can evaluate the outputs for the four property constraint task by

In [None]:
!python properties.py --prop gsk3,jnk3,qed,sa < outputs.txt | python scripts/qed_sa_dual_eval.py --ref_path data/dual_gsk3_jnk3/actives.txt

Here `--ref_path` contains all the reference molecules which is used for computing the novelty score.