## ISMD (Inversed Synthesizable Molecular Design) Totorial

This tutorial will proceed as follow:

1. initial setup and data preparation
2. descriptor preparation for forward model (likelihood)
3. forward model (likelihood) preparation
4. proposal mdoel preparation 
5. a complete ismd run

### 1.1 import packages

In [1]:
import warnings
warnings.filterwarnings('ignore')

import xenonpy
import onmt
from xenonpy.descriptor import Fingerprints
from xenonpy.inverse.iqspr import GaussianLogLikelihood
from xenonpy.contrib.ismd import ReactionDescriptor
from xenonpy.contrib.ismd import ReactantPool
from xenonpy.contrib.ismd import Reactor

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy

### 1.2 load data

In [2]:
# ground truth data
ground_truth_path = "/home/qiz/data/lab_database/ismd_data/STEREO_id_reactant_product_xlogp_tpsa.csv"
data = pd.read_csv(ground_truth_path)[:10000]
data.head()

Unnamed: 0,reactant_index,reactant,product,XLogP,TPSA
0,12163.22445,CCS(=O)(=O)Cl.OCCBr,CCS(=O)(=O)OCCBr,0.8,51.8
1,863.20896,CC(C)CS(=O)(=O)Cl.OCCCl,CC(C)CS(=O)(=O)OCCCl,1.6,51.8
2,249087.0,O=[N+]([O-])c1cccc2cnc(Cl)cc12,Nc1cccc2cnc(Cl)cc12,2.4,38.9
3,153658.2344,Cc1cc2c([N+](=O)[O-])cccc2c[n+]1[O-].O=P(Cl)(C...,Cc1cc2c([N+](=O)[O-])cccc2c(Cl)n1,3.3,58.7
4,297070.0,CCCCC[C@H](O)C=CC1C=CC(=O)C1CC=CCCCC(=O)O,CCCCC[C@H](O)C=CC1CCC(=O)C1CC=CCCCC(=O)O,3.8,74.6


In [3]:
# reactant pool
reactant_pool_path = "/home/qiz/data/lab_database/ismd_data/STEREO_pool.txt"

with open(reactant_pool_path, 'r') as f: 
    reactant_pool = f.read().splitlines()  # len(reactant_pool)=637645

# show the first three elements in the reactant pool
print(reactant_pool[:3])

['O=C(Cl)Oc1ccc(Cc2ccc(C(F)(F)F)cc2)cc1', 'CCc1cc(C2CCN(C(=O)OC(C)(C)C)CC2)ccc1Nc1ncc(C(F)(F)F)c(C#Cc2ccccc2CC(=O)OC)n1', 'CC(NC(=O)OCc1ccccc1)C(C)NC(=O)c1ccccc1O']


In [4]:
# similarity matrix of reactant pool
sim_matrix_path = "/home/qiz/data/lab_database/ismd_data/ZINC_sim_sparse.npz"
reactant_pool_sim = scipy.sparse.load_npz(sim_matrix_path).tocsr()

# show the list of indice whose molecule is similar to the first one in the reactant pool
print(reactant_pool_sim[0,:].nonzero()[1].tolist())

[0, 9850, 11897, 23561, 25594, 28947, 30750, 31361, 44204, 46017, 76945, 118108, 145556, 145734, 164311, 186671, 205326, 207174, 209595, 215653, 218310, 222491, 224002, 232232, 233447, 252758, 274284, 278177, 288659, 291331, 294003, 294172, 300867, 306289, 307663, 331897, 334538, 335455, 343644, 360531, 364663, 365676, 376086, 378821, 412563, 442160, 443411, 452943, 460860, 479253, 487849, 491373, 499241, 500259, 523929, 525478, 528040, 559770, 567735, 568783, 582833, 584542, 586316, 588491, 595094, 599275, 601808, 603887, 617189]


### 2.1 descripter
data is transformed in the following flow:

index of reactant -> smiles of reactant -> smiles of product -> fingerprint of product

In [5]:
# take some samples (index of reactant)
samples = data["reactant_index"][:10].tolist()
print(samples)

['12163.22445', '863.20896', '249087', '153658.23440', '297070', '208421', '412634.601987', '10425.19854', '9361.387984.30667', '50995.305035']


### 2.1.1 index of reactant -> smiles of reactant
Obtain the smiles by ReactantPool module via index

Note: the ReactantPool also used as proposal model in step 4

In [6]:
pool_obj = ReactantPool(pool_data=reactant_pool, similarity_matrix=reactant_pool_sim, splitter='.')

In [7]:
pool_obj.proposal(samples)

['249484.22445',
 '543266.20896',
 '28044',
 '399948.23440',
 '514511',
 '167418',
 '146642.601987',
 '100134.19854',
 '377994.387984.30667',
 '50995.107564']

### 2.1.2 smiles of reactant -> fingerprint of product

In [8]:
# build molecular transformer (smiles of reactant -> smiles of product)
reactor_path = "/home/qiz/data/lab_database/models/STEREO_mixed_augm_model_average_20.pt"
ChemicalReactor = Reactor()
ChemicalReactor.BuildReactor(model_list=[reactor_path], max_length=100, n_best=1, gpu=0)

In [9]:
# build fingerprint descriptor (smiles of product -> fingerprint of product)

RDKit_FPs = Fingerprints(featurizers=['ECFP', 'MACCS'], input_type='smiles')

In [10]:
# build reaction descriptor (index of reactant -> fingerprint of product)
# a combination of reactor and fingerprint descripter

RD = ReactionDescriptor(descriptor_calculator=RDKit_FPs,reactor=ChemicalReactor,reactant_pool=pool_obj)

In [11]:
sample_fps = RD.transform(samples)
sample_fps.head(3)

Unnamed: 0,maccs:0,maccs:1,maccs:2,maccs:3,maccs:4,maccs:5,maccs:6,maccs:7,maccs:8,maccs:9,...,ecfp3:2038,ecfp3:2039,ecfp3:2040,ecfp3:2041,ecfp3:2042,ecfp3:2043,ecfp3:2044,ecfp3:2045,ecfp3:2046,ecfp3:2047
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3 Log-likelihood calculator

Compute the log-likelihood given the samples(index of reactant)

In [12]:
# set target
prop = ['XLogP', 'TPSA']
target_range = {'XLogP': (-2, 2), 'TPSA': (0, 25)}

# build Gaussian likelihood calculator and set the target of region of the properties
likelihood_calculator = GaussianLogLikelihood(descriptor=RD, targets = target_range)

In [13]:
%%time

# train forward models inside ismd
likelihood_calculator.fit(data['reactant_index'], data[prop])

RDKit ERROR: [13:24:45] Can't kekulize mol.  Unkekulized atoms: 1 2 3 4 5 6 13
RDKit ERROR: 
RDKit ERROR: [13:24:45] Can't kekulize mol.  Unkekulized atoms: 6 13 14 15 22 23 24
RDKit ERROR: 
RDKit ERROR: [13:24:45] Can't kekulize mol.  Unkekulized atoms: 9 10 11 13 14 15 16 18 19
RDKit ERROR: 
RDKit ERROR: [13:24:45] Can't kekulize mol.  Unkekulized atoms: 9 10 11 13 14 15 16 18 19
RDKit ERROR: 
RDKit ERROR: [13:24:45] SMILES Parse Error: unclosed ring for input: 'CC(=O)O[C@H]1[C@]2(O)C[C@]3(O)SS[C@]4(CO)C(=O)N(C)C(=O)N3[C@@H]2C[C@@]13OC(C)C(C)(C)C3=O'
RDKit ERROR: [13:24:45] SMILES Parse Error: unclosed ring for input: 'COc1cccc(C23CCc4[nH]nc(O)c4C2)c1'
RDKit ERROR: [13:24:45] Can't kekulize mol.  Unkekulized atoms: 1 2 3 4 5 7 8 9 10 12 13
RDKit ERROR: 
RDKit ERROR: [13:24:45] SMILES Parse Error: syntax error while parsing: CC[C@H](C)[C@H](NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H]1CCCN1C(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)OCc1ccccc1)[C@@H](
RDKit ERROR: [13:24:4

CPU times: user 4min 24s, sys: 4.48 s, total: 4min 28s
Wall time: 2min 16s


In [14]:
# predicted properties of samples
property_prediction = likelihood_calculator.predict(samples)
print(property_prediction.head())

   XLogP: mean  XLogP: std  TPSA: mean  TPSA: std
0     3.282271    1.982079   60.131416  40.738042
1     3.133565    1.989335   57.167321  40.914520
2     2.854156    1.991209   63.369932  40.970915
3     3.131680    1.995231   52.716277  41.059153
4     3.048982    2.006740   77.328248  41.295918


In [15]:
# compute the log likelihood of samples
likelihood_prediction = likelihood_calculator(samples, **target_range)
print(likelihood_prediction.head())

      XLogP      TPSA
0 -1.366536 -2.085252
1 -1.274870 -2.004692
2 -1.119061 -2.175631
3 -1.272124 -1.895516
4 -1.221978 -2.631307


### 4 proposal model

proposal from the given reactant pool, sample(index of reactant) is modified by randomly changing one reactant to a similar one.

In [16]:
# proposal based on samples
new_samples = pool_obj.proposal(samples)
print(samples)
print(new_samples)

['12163.22445', '863.20896', '249087', '153658.23440', '297070', '208421', '412634.601987', '10425.19854', '9361.387984.30667', '50995.305035']
['49642.22445', '49518.20896', '423560', '153658.635380', '143872', '520818', '601987.601987', '10425.584959', '9361.387984.91120', '50995.50771']


### 5 complete run of ismd

In [17]:
# set up initial reactants
cans = [smi for i, smi in enumerate(data['reactant_index'])
        if (data['XLogP'].iloc[i] > 4)]
init_samples = np.random.choice(cans, 10)
print(init_samples)

['309392.132804' '520688.48372' '595608.382533' '471747.9540'
 '29691.626061' '226418.524475' '300500' '38620.302355' '190313.190313'
 '354190.50012']


In [18]:
# set up annealing schedule
beta = np.hstack([np.linspace(0.01,0.2,20),np.linspace(0.21,0.4,10),np.linspace(0.4,1,10),np.linspace(1,1,10)])
print('Number of steps: %i' % len(beta))
print(beta)

Number of steps: 50
[0.01       0.02       0.03       0.04       0.05       0.06
 0.07       0.08       0.09       0.1        0.11       0.12
 0.13       0.14       0.15       0.16       0.17       0.18
 0.19       0.2        0.21       0.23111111 0.25222222 0.27333333
 0.29444444 0.31555556 0.33666667 0.35777778 0.37888889 0.4
 0.4        0.46666667 0.53333333 0.6        0.66666667 0.73333333
 0.8        0.86666667 0.93333333 1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.        ]


In [19]:
# library for running ismd in XenonPy-iQSPR
from xenonpy.inverse.iqspr import IQSPR

# set up likelihood and modifier models in iQSPR
ismd = IQSPR(estimator=likelihood_calculator, modifier=pool_obj)
    
np.random.seed(201906) # fix the random seed
# main loop of iQSPR
ismd_samples, ismd_loglike, ismd_prob, ismd_freq = [], [], [], []
for s, ll, p, freq in ismd(init_samples, beta, yield_lpf=True):
    ismd_samples.append(s)
    ismd_loglike.append(ll)
    ismd_prob.append(p)
    ismd_freq.append(freq)
# record all outputs
ismd_results = {
    "samples": ismd_samples,
    "loglike": ismd_loglike,
    "prob": ismd_prob,
    "freq": ismd_freq,
    "beta": beta
}


RDKit ERROR: [13:25:28] Can't kekulize mol.  Unkekulized atoms: 5 6 7 15 16 17 18 20 21
RDKit ERROR: 
RDKit ERROR: [13:25:33] Can't kekulize mol.  Unkekulized atoms: 5 6 7 8 10 12 14 16 18
RDKit ERROR: 
RDKit ERROR: [13:25:43] SMILES Parse Error: unclosed ring for input: 'C[C@]12O[C@@H]3O[C@H]1O[C@@H](CO)[C@@H]3O'
RDKit ERROR: [13:25:52] SMILES Parse Error: extra open parentheses for input: 'C/C(=C\C(=O)Oc1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl)c1ccc(OP(=O)(O)OC[C@H]2O[C@@H](n3c(=O)sc4c(=O)[nH]c(N)nc43)C[C@@H]2OC(=O)c2ccc(C)c'


In [20]:
# have a look at the result
ismd_result_df = pd.DataFrame(ismd_results)
ismd_result_df.head()

Unnamed: 0,samples,loglike,prob,freq,beta
0,"[190313.190313, 226418.524475, 29691.626061, 3...",XLogP TPSA 0 -1.167471 -1.696214 1 ...,"[0.10034176181501515, 0.09965952074382722, 0.0...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",0.01
1,"[118532.382533, 190313.146317, 226418.279371, ...",XLogP TPSA 0 -1.147225 -2.349376 1 ...,"[0.09949206751089879, 0.10066490254398162, 0.2...","[1, 1, 2, 1, 1, 1, 1, 1, 1]",0.02
2,"[119657.132804, 26859.190313, 281169.395970, 2...",XLogP TPSA 0 -0.897741 -2.222192 1 ...,"[0.10075220379229193, 0.10086582774911296, 0.0...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",0.03
3,"[151010.382533, 279547.395970, 281169.609634, ...",XLogP TPSA 0 -1.282635 -1.976090 1 ...,"[0.1006297428198096, 0.09941539970586154, 0.09...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",0.04
4,"[279547.231768, 281169.494451, 281169.65387, 2...",XLogP TPSA 0 -1.316572 -2.294022 1 ...,"[0.09891860395431068, 0.1019427190535249, 0.10...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",0.05


In [21]:
ismd_result_df['samples'][0]

array(['190313.190313', '226418.524475', '29691.626061', '300500',
       '309392.132804', '354190.50012', '38620.302355', '471747.9540',
       '520688.48372', '595608.382533'], dtype='<U32')

In [22]:
ismd_result_df['loglike'][0]

Unnamed: 0,XLogP,TPSA
0,-1.167471,-1.696214
1,-1.601733,-1.944191
2,-1.309444,-1.995235
3,-1.08257,-2.308092
4,-1.161176,-1.838347
5,-1.015167,-1.749178
6,-0.718269,-2.209244
7,-1.474706,-2.157337
8,-1.184331,-1.836367
9,-1.166416,-2.437978


In [23]:
ismd_result_df['prob'][0]

array([0.10034176, 0.09965952, 0.09990023, 0.09981437, 0.10020555,
       0.10044149, 0.10027774, 0.09957373, 0.10018434, 0.09960127])