## AttentiveFP: Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism

ABSTRACT: Hunting for chemicals with favorable pharmacological, toxicological, and pharmacokinetic properties
remains a formidable challenge for drug discovery. Deep
learning provides us with powerful tools to build predictive
models that are appropriate for the rising amounts of data, but
the gap between what these neural networks learn and what
human beings can comprehend is growing. Moreover, this gap
may induce distrust and restrict deep learning applications in
practice. Here, we introduce a new graph neural network
architecture called <code>AttentiveFP</code> for molecular representation that uses a graph attention mechanism to learn from relevant drug discovery datasets. We demonstrate that <code>AttentiveFP</code> achieves state-of-the-art predictive performances on a variety of datasets and that what it learns is interpretable. The feature visualization for <code>AttentiveFP</code> suggests that it automatically learns nonlocal intramolecular interactions from specified tasks, which can help us gain chemical insights directly from data beyond human perception.

Link to paper: https://pubs.acs.org/doi/pdf/10.1021/acs.jmedchem.9b00959

Credit: https://github.com/OpenDrugAI/AttentiveFP


In [1]:
# Clone the repository and cd into directory
!git clone https://github.com/saams4u/AttentiveFP.git
%cd AttentiveFP

Cloning into 'AttentiveFP'...
remote: Enumerating objects: 70, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 70 (delta 3), reused 7 (delta 2), pack-reused 61[K
Unpacking objects: 100% (70/70), done.
/content/AttentiveFP


In [2]:
# Install RDKit
!pip install rdkit-pypi==2021.3.1.5

Collecting rdkit-pypi==2021.3.1.5
[?25l  Downloading https://files.pythonhosted.org/packages/7a/8b/4eab9cd448c40fd0d3034771963c903579a6697454682b6fbb9114beca91/rdkit_pypi-2021.3.1.5-cp37-cp37m-manylinux2014_x86_64.whl (18.0MB)
[K     |████████████████████████████████| 18.0MB 272kB/s 
[?25hInstalling collected packages: rdkit-pypi
Successfully installed rdkit-pypi-2021.3.1.5


### Example: Malaria Bioactivity

In [4]:
import os

from IPython.display import SVG, display

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as Data

torch.manual_seed(8)

import time
import numpy as np
import gc
import sys

sys.setrecursionlimit(50000)

import pickle

torch.backends.cudnn.benchmark = True
torch.set_default_tensor_type('torch.cuda.FloatTensor')

# from tensorboardX import SummaryWriter
torch.nn.Module.dump_patches = True

import copy
import pandas as pd

from sklearn.metrics import roc_auc_score, matthews_corrcoef, recall_score, accuracy_score, r2_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, precision_score, precision_recall_curve
from sklearn.metrics import auc, f1_score

from rdkit import Chem
from rdkit.Chem import AllChem, QED, rdMolDescriptors, MolSurf, rdDepictor
from rdkit.Chem.Draw import SimilarityMaps, rdMolDraw2D

%matplotlib inline

from numpy.polynomial.polynomial import polyfit

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib

import seaborn as sns; sns.set()
import sascorer

import AttentiveLayers, AttentiveLayers_viz
import Featurizer, Featurizer_aromaticity_rm, getFeatures, getFeatures_aromaticity_rm

from getFeatures import save_smiles_dicts, get_smiles_array
from AttentiveLayers import Fingerprint

In [5]:
task_name = 'Malaria Bioactivity'
tasks = ['Loge EC50']

raw_filename = "data/malaria-processed.csv"
feature_filename = raw_filename.replace('.csv','.pickle')
filename = raw_filename.replace('.csv','')
prefix_filename = raw_filename.split('/')[-1].replace('.csv','')

smiles_tasks_df = pd.read_csv(raw_filename, names = ["Loge EC50", "smiles"])
smilesList = smiles_tasks_df.smiles.values

if os.path.isfile(feature_filename):
    feature_dicts = pickle.load(open(feature_filename, "rb" ))
else:
    feature_dicts = save_smiles_dicts(smilesList,filename)

print("number of all smiles: ",len(smilesList))

atom_num_dist = []
remained_smiles = []
canonical_smiles_list = []

for smiles in smilesList:
    try:        
        mol = Chem.MolFromSmiles(smiles)
        atom_num_dist.append(len(mol.GetAtoms()))
        remained_smiles.append(smiles)
        canonical_smiles_list.append(Chem.MolToSmiles(Chem.MolFromSmiles(smiles), isomericSmiles=True))
    except:
        print(smiles)
        pass

print("number of successfully processed smiles: ", len(remained_smiles))
smiles_tasks_df = smiles_tasks_df[smiles_tasks_df["smiles"].isin(remained_smiles)]

print(smiles_tasks_df)
smiles_tasks_df['cano_smiles'] = canonical_smiles_list

plt.figure(figsize=(5, 3))
sns.set(font_scale=1.5)
ax = sns.distplot(atom_num_dist, bins=28, kde=False)
plt.tight_layout()

plt.savefig("atom_num_dist_"+prefix_filename+".png",dpi=200)
plt.show()
plt.close()

feature dicts file saved as data/malaria-processed.pickle
number of all smiles:  9999
number of successfully processed smiles:  9999
      Loge EC50                                             smiles
0      2.708050  COc1ccc(C)c2sc(nc12)N(Cc3cccnc3)C(=O)c4ccc(cc4...
1      2.708050               CC(Sc1ccc(Cl)cc1)C(=O)Nc2ccc(Br)cc2F
2      2.708050  Cc1ccc(cc1)S(=O)(=O)NCC(N2CCN(CC2)c3ccccc3F)c4...
3      2.708050       Cc1ccc2oc(nc2c1)c3cccc(NC(=O)C(Cl)(Cl)Cl)c3C
4      2.708050          FC(F)(F)c1cccc(NC(=O)C2=Cc3ccccc3OC2=O)c1
...         ...                                                ...
9994  -5.880997  CC[C@@]1(CCC(O1)[C@@]2(C)CC[C@@]3(C[C@@H](O)[C...
9995  -6.017809  CC1=CC(=O)OC[C@]23C[C@H](O)C(=C[C@H]2O[C@@H]4C...
9996  -6.019453      CCOC(=O)C(=O)N1c2ccc(OC)cc2C3=C(SSC3=S)C1(C)C
9997  -6.021511                  CCN(CC)CCCC(C)Nc1ccnc2cc(Cl)ccc12
9998  -6.921854                   CC1(C)N=C(N)N=C(N)N1c2ccc(Cl)cc2

[9999 rows x 2 columns]




In [6]:
random_seed = 68
start_time = str(time.ctime()).replace(':','-').replace(' ','_')

batch_size = 200
epochs = 800

p_dropout= 0.03
fingerprint_dim = 200

weight_decay = 4.3 # also known as l2_regularization_lambda
learning_rate = 4
radius = 2
T = 1

per_task_output_units_num = 1 # for regression model
output_units_num = len(tasks) * per_task_output_units_num

In [7]:
# feature_dicts = get_smiles_dicts(smilesList)
remained_df = smiles_tasks_df[smiles_tasks_df["cano_smiles"].isin(feature_dicts['smiles_to_atom_mask'].keys())]
uncovered_df = smiles_tasks_df.drop(remained_df.index)
uncovered_df

Unnamed: 0,Loge EC50,smiles,cano_smiles


In [8]:
test_df = remained_df.sample(frac=0.2,random_state=random_seed)
train_df = remained_df.drop(test_df.index)

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

print(len(test_df), sorted(test_df.cano_smiles.values))

2000 ['Brc1ccc2ncnc(NCC3CCCO3)c2c1', 'C#CCN(CC=C)Cc1nc(-c2cc(OC)ccc2OC)oc1C', 'C#CCN(Cc1cc2c(O)nc(N)nc2cc1C)c1ccc(C(=O)N[C@@H](CCC(=O)O)C(=O)O)cc1', 'C#CCN(Cc1ccc2nc(N)nc(O)c2c1)c1ccc(C(=O)N[C@@H](CCC(=O)O)C(=O)O)cc1', 'C#CCSc1c(CCCC)cnc2c1c(=O)n(C)c(=O)n2C', 'C#CC[n+]1ccn2c(C)ccc2c1CC', 'C#Cc1cccc(NC(=O)[C@H](CC2CCCCC2)Nc2ccc(C#N)c3ccccc23)c1', 'C(#Cc1ccccc1)CN(Cc1ccccn1)Cc1ccccn1', 'C(=Cc1ccccc1)C=NNc1ccnc2ccccc12', 'C(=NNc1nc2ccccc2[nH]1)c1cccc(Oc2ccccc2)c1', 'C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@@H]2CCC[C@H](C)C(O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1', 'C/C(C=C1Sc2ccc3ccoc3c2N1CCO)=C\\c1sc2ccc3occc3c2[n+]1CCO', 'C/C=C/C(C(=O)NCCCC)N1C(=O)c2cc(NC(C)=O)ccc2NC(=O)[C@@H]1C', 'C/C=C/[C@H]1O[C@@](O)([C@@H](C)C(O)C(C)[C@H]2OC(=O)C(OC)=CC(C)=C[C@@H](C)[C@H](O)C(CC)[C@@H](O)[C@H](C)CC(C)=CC=C[C@@H]2OC)C[C@@H](O[C@H]2C[C@@H](O)[C@H](OC(N)=O)[C@@H](C)O2)[C@@H]1C', 'C/C=C1/C[C@H]2[C@@H](OC)Nc3cc(O)c(OC)cc3C(=O)N2C1', 'C1=C(c2ccc(CN3CCCCC3)cc2)N2CCN=C2c2ccccc21', 'C1=CN(Cc2ccccc2)C=CC1=C1C=N

In [9]:
x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array([canonical_smiles_list[0]],feature_dicts)

num_atom_features = x_atom.shape[-1]
num_bond_features = x_bonds.shape[-1]

loss_function = nn.MSELoss()

model = Fingerprint(radius, T, num_atom_features, num_bond_features,
            fingerprint_dim, output_units_num, p_dropout)
model.cuda()

# optimizer = optim.Adam(model.parameters(), learning_rate, weight_decay=weight_decay)
optimizer = optim.Adam(model.parameters(), 10**-learning_rate, weight_decay=10**-weight_decay)
# optimizer = optim.SGD(model.parameters(), 10**-learning_rate, weight_decay=10**-weight_decay)
# tensorboard = SummaryWriter(log_dir="runs/"+start_time+"_"+prefix_filename+"_"+str(fingerprint_dim)+"_"+str(p_dropout))

model_parameters = filter(lambda p: p.requires_grad, model.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])

print(params)

for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.data.shape)

863604
atom_fc.weight torch.Size([200, 39])
atom_fc.bias torch.Size([200])
neighbor_fc.weight torch.Size([200, 49])
neighbor_fc.bias torch.Size([200])
GRUCell.0.weight_ih torch.Size([600, 200])
GRUCell.0.weight_hh torch.Size([600, 200])
GRUCell.0.bias_ih torch.Size([600])
GRUCell.0.bias_hh torch.Size([600])
GRUCell.1.weight_ih torch.Size([600, 200])
GRUCell.1.weight_hh torch.Size([600, 200])
GRUCell.1.bias_ih torch.Size([600])
GRUCell.1.bias_hh torch.Size([600])
align.0.weight torch.Size([1, 400])
align.0.bias torch.Size([1])
align.1.weight torch.Size([1, 400])
align.1.bias torch.Size([1])
attend.0.weight torch.Size([200, 200])
attend.0.bias torch.Size([200])
attend.1.weight torch.Size([200, 200])
attend.1.bias torch.Size([200])
mol_GRUCell.weight_ih torch.Size([600, 200])
mol_GRUCell.weight_hh torch.Size([600, 200])
mol_GRUCell.bias_ih torch.Size([600])
mol_GRUCell.bias_hh torch.Size([600])
mol_align.weight torch.Size([1, 400])
mol_align.bias torch.Size([1])
mol_attend.weight torch.Si

In [10]:
def train(model, dataset, optimizer, loss_function):
    model.train()
    np.random.seed(epoch)
    valList = np.arange(0,dataset.shape[0])

    #shuffle them
    np.random.shuffle(valList)
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch)   
        
    for counter, batch in enumerate(batch_list):
        batch_df = dataset.loc[batch,:]
        smiles_list = batch_df.cano_smiles.values
        y_val = batch_df[tasks[0]].values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
        
        optimizer.zero_grad()
        loss = loss_function(mol_prediction, torch.Tensor(y_val).view(-1,1))     
        loss.backward()
        optimizer.step()

def eval(model, dataset):
    model.eval()
    test_MAE_list = []
    test_MSE_list = []
    valList = np.arange(0,dataset.shape[0])
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch) 

    for counter, batch in enumerate(batch_list):
        batch_df = dataset.loc[batch,:]
        smiles_list = batch_df.cano_smiles.values
        print(batch_df)
        y_val = batch_df[tasks[0]].values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
        MAE = F.l1_loss(mol_prediction, torch.Tensor(y_val).view(-1,1), reduction='none')        
        MSE = F.mse_loss(mol_prediction, torch.Tensor(y_val).view(-1,1), reduction='none')
        print(x_mask[:2],atoms_prediction.shape, mol_prediction,MSE)
        
        test_MAE_list.extend(MAE.data.squeeze().cpu().numpy())
        test_MSE_list.extend(MSE.data.squeeze().cpu().numpy())

    return np.array(test_MAE_list).mean(), np.array(test_MSE_list).mean()

In [11]:
best_param ={}
best_param["train_epoch"] = 0
best_param["test_epoch"] = 0
best_param["train_MSE"] = 9e8
best_param["test_MSE"] = 9e8

for epoch in range(800):
    train_MAE, train_MSE = eval(model, train_df)
    test_MAE, test_MSE = eval(model, test_df)

    if train_MSE < best_param["train_MSE"]:
        best_param["train_epoch"] = epoch
        best_param["train_MSE"] = train_MSE

    if test_MSE < best_param["test_MSE"]:
        best_param["test_epoch"] = epoch
        best_param["test_MSE"] = test_MSE

        if test_MSE < 1.1:
             torch.save(model, 'saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(epoch)+'.pt')
    if (epoch - best_param["train_epoch"] >2) and (epoch - best_param["test_epoch"] >18):        
        break
        
    print(epoch, train_MSE, test_MSE)
    
    train(model, train_df, optimizer, loss_function)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        [-0.2405],
        [-0.5702],
        [-0.4328],
        [-0.3424],
        [-0.6913],
        [-1.1007],
        [-1.7524],
        [ 0.3732],
        [-0.8988],
        [-0.2757],
        [-2.5649],
        [-0.8163],
        [-1.3173],
        [-0.2572],
        [-1.1371],
        [-1.1295],
        [ 0.0405],
        [-1.0224],
        [-1.3242],
        [-0.7857],
        [-0.7580],
        [-1.0891],
        [-1.9349],
        [-2.6613],
        [ 0.0923],
        [-2.0675],
        [-1.1656],
        [-1.7320],
        [-0.0798],
        [-0.2989],
        [-1.5479],
        [-1.6282],
        [-1.2799],
        [-1.3100],
        [-0.9730],
        [ 0.0658],
        [-1.5036],
        [-0.2392],
        [-1.3781],
        [-2.9009],
        [-2.3212],
        [-0.7003],
        [-0.5294],
        [-0.6178],
        [-1.9307],
        [-1.4975],
        [-2.4591],
        [-2.3054],
        [-1.4007],
    

In [12]:
# evaluate model
best_model = torch.load('saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(best_param["test_epoch"])+'.pt')     

best_model_dict = best_model.state_dict()
best_model_wts = copy.deepcopy(best_model_dict)

model.load_state_dict(best_model_wts)
(best_model.align[0].weight == model.align[0].weight).all()
test_MAE, test_MSE = eval(model, test_df)
print("best epoch:",best_param["test_epoch"],"\n","test MSE:",test_MSE)

     Loge EC50  ...                                        cano_smiles
0     0.184240  ...  COc1ccccc1C1CCN(C2CCC(NC(=O)/C=C/c3cc(C(F)(F)F...
1    -1.077439  ...  CCCN(CCC)CCCOc1ccc(C(=O)c2c(-c3ccc(OCCCN(CCC)C...
2    -0.075931  ...  CCCCCN1CCCN(Cc2cccc(NC(=O)c3ccc(Cl)c(Cl)c3)c2)CC1
3     2.708050  ...         CCc1c(C)nc(-n2nc(C)cc2NC(=O)c2ccccc2I)nc1O
4     0.341593  ...  COc1ccc(CNCCc2cccs2)cc1-c1ccc(OC)c(S(=O)(=O)NC...
..         ...  ...                                                ...
195  -2.477276  ...  O=C(NC1CCN(CC2(c3ccc(Cl)cc3)CCCCC2)CC1)c1cc[nH]n1
196   0.004068  ...  CC(Cc1ccc(NC(=O)c2ccc(CC(C)NCc3ccc(Cl)c(Cl)c3)...
197   2.708050  ...  CCOC(=O)C1C(CN(C)c2ccc(F)cc2)=NC(=O)NC1c1ccc(C...
198  -1.934005  ...  N#Cc1cccc(NC(=O)Nc2ccc(-c3ccc(-c4nc5cc(C(F)(F)...
199  -0.916291  ...           COc1cc2ncnc(NC3CCN(Cc4ccccc4)CC3)c2cc1OC

[200 rows x 3 columns]
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 

### Example: Photovoltaic Efficiency

In [None]:
task_name = 'Photovoltaic efficiency'
tasks = ['PCE']

raw_filename = "data/cep-processed.csv"
feature_filename = raw_filename.replace('.csv','.pickle')
filename = raw_filename.replace('.csv','')
prefix_filename = raw_filename.split('/')[-1].replace('.csv','')

smiles_tasks_df = pd.read_csv(raw_filename)
smilesList = smiles_tasks_df.smiles.values
print("number of all smiles: ",len(smilesList))

if os.path.isfile(feature_filename):
    feature_dicts = pickle.load(open(feature_filename, "rb" ))
else:
    feature_dicts = save_smiles_dicts(smilesList,filename)

atom_num_dist = []
remained_smiles = []
canonical_smiles_list = []

for smiles in smilesList:
    try:        
        mol = Chem.MolFromSmiles(smiles)
        atom_num_dist.append(len(mol.GetAtoms()))
        remained_smiles.append(smiles)
        canonical_smiles_list.append(Chem.MolToSmiles(Chem.MolFromSmiles(smiles), isomericSmiles=True))
    except:
        print(smiles)
        pass

print("number of successfully processed smiles: ", len(remained_smiles))
smiles_tasks_df = smiles_tasks_df[smiles_tasks_df["smiles"].isin(remained_smiles)]
smiles_tasks_df['cano_smiles'] =canonical_smiles_list

plt.figure(figsize=(5, 3))
sns.set(font_scale=1.5)
ax = sns.distplot(atom_num_dist, bins=28, kde=False)

plt.tight_layout()
plt.savefig("atom_num_dist_"+prefix_filename+".png",dpi=200)
plt.show()
plt.close()

number of all smiles:  29978
number of successfully processed smiles:  29978




In [None]:
random_seed = 888
start_time = str(time.ctime()).replace(':','-').replace(' ','_')

batch_size = 200
epochs = 800

p_dropout= 0.15
fingerprint_dim = 200

weight_decay = 4.5 # also known as l2_regularization_lambda
learning_rate = 3.6
radius = 3
T = 1

per_task_output_units_num = 1 # for regression model
output_units_num = len(tasks) * per_task_output_units_num

In [None]:
# feature_dicts = get_smiles_dicts(smilesList)
remained_df = smiles_tasks_df[smiles_tasks_df["cano_smiles"].isin(feature_dicts['smiles_to_atom_mask'].keys())]
uncovered_df = smiles_tasks_df.drop(remained_df.index)
uncovered_df

feature dicts file saved as /content/AttentiveFP/data/cep-processed.pickle


Unnamed: 0,smiles,PCE,cano_smiles


In [None]:
test_df = remained_df.sample(frac=1/5,random_state=random_seed)
train_df = remained_df.drop(test_df.index)

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

print(len(test_df),sorted(test_df.cano_smiles.values))

5996 ['C1=C(C2=CC=C(c3cccc4c[nH]cc34)[SiH2]2)[SiH2]C(c2ccc[nH]2)=C1', 'C1=C(C2=CC=C(c3nccc4nsnc34)[SiH2]2)CC(c2cccc3cscc23)=C1', 'C1=C(C2=CC=C(c3nccc4nsnc34)[SiH2]2)[SiH2]C(c2scc3[nH]ccc23)=C1', 'C1=C(C2=Cc3[nH]ccc3C2)Cc2ccccc21', 'C1=C(C2=Cc3c(ccc4c3=C[SiH2]C=4)[SiH2]2)Cc2cc[se]c21', 'C1=C(C2=Cc3c(ccc4cscc34)[SiH2]2)[SiH2]c2cc[nH]c21', 'C1=C(c2cc3c(ccc4ccccc43)cn2)[SiH2]c2cccnc21', 'C1=C(c2cc3c4c(ccc3c3ccccc23)=C[SiH2]C=4)Cc2ccncc21', 'C1=C(c2cc3c4c[nH]cc4ccc3c3c[nH]cc23)[SiH2]c2ccc3c(c21)=C[SiH2]C=3', 'C1=C(c2cc3c4c[nH]cc4ccc3c3ccccc23)[SiH2]c2ccc3nsnc3c21', 'C1=C(c2cc3ccccc3c3c[nH]cc23)[SiH2]c2ccoc21', 'C1=C(c2cc3cnccc3c3c[nH]cc23)[SiH2]c2ccc3cscc3c21', 'C1=C(c2cc3ncccc3[se]2)[SiH2]c2ccccc21', 'C1=C(c2cc3ncccc3o2)Cc2ccc3cocc3c21', 'C1=C(c2ccc(-c3ccc[nH]3)cc2)CC(c2scc3cc[nH]c23)=C1', 'C1=C(c2ccc(-c3ccc[nH]3)cn2)CC(c2ccco2)=C1', 'C1=C(c2ccc(-c3cccc4c3=C[SiH2]C=4)nc2)CC(c2cccc3nsnc23)=C1', 'C1=C(c2ccc(-c3cccc4c3=C[SiH2]C=4)nc2)[SiH2]C(c2cccc3cocc23)=C1', 'C1=C(c2ccc(-c3cccc4c[nH]cc34)[

In [None]:
x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array([canonical_smiles_list[0]],feature_dicts)
loss_function = nn.MSELoss()

num_atom_features = x_atom.shape[-1]
num_bond_features = x_bonds.shape[-1]

model = Fingerprint(radius, T, num_atom_features, num_bond_features,
            fingerprint_dim, output_units_num, p_dropout)
model.cuda()

# optimizer = optim.Adam(model.parameters(), learning_rate, weight_decay=weight_decay)
optimizer = optim.Adam(model.parameters(), 10**-learning_rate, weight_decay=10**-weight_decay)
# optimizer = optim.SGD(model.parameters(), 10**-learning_rate, weight_decay=10**-weight_decay)
# tensorboard = SummaryWriter(log_dir="runs/"+start_time+"_"+prefix_filename+"_"+str(fingerprint_dim)+"_"+str(p_dropout))

model_parameters = filter(lambda p: p.requires_grad, model.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
print(params)

for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.data.shape)

1145405
atom_fc.weight torch.Size([200, 39])
atom_fc.bias torch.Size([200])
neighbor_fc.weight torch.Size([200, 49])
neighbor_fc.bias torch.Size([200])
GRUCell.0.weight_ih torch.Size([600, 200])
GRUCell.0.weight_hh torch.Size([600, 200])
GRUCell.0.bias_ih torch.Size([600])
GRUCell.0.bias_hh torch.Size([600])
GRUCell.1.weight_ih torch.Size([600, 200])
GRUCell.1.weight_hh torch.Size([600, 200])
GRUCell.1.bias_ih torch.Size([600])
GRUCell.1.bias_hh torch.Size([600])
GRUCell.2.weight_ih torch.Size([600, 200])
GRUCell.2.weight_hh torch.Size([600, 200])
GRUCell.2.bias_ih torch.Size([600])
GRUCell.2.bias_hh torch.Size([600])
align.0.weight torch.Size([1, 400])
align.0.bias torch.Size([1])
align.1.weight torch.Size([1, 400])
align.1.bias torch.Size([1])
align.2.weight torch.Size([1, 400])
align.2.bias torch.Size([1])
attend.0.weight torch.Size([200, 200])
attend.0.bias torch.Size([200])
attend.1.weight torch.Size([200, 200])
attend.1.bias torch.Size([200])
attend.2.weight torch.Size([200, 200]

In [None]:
def train(model, dataset, optimizer, loss_function):
    model.train()
    np.random.seed(epoch)
    valList = np.arange(0,dataset.shape[0])

    #shuffle them
    np.random.shuffle(valList)
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch)   

    for counter, batch in enumerate(batch_list):
        batch_df = dataset.loc[batch,:]
        smiles_list = batch_df.cano_smiles.values
        y_val = batch_df[tasks[0]].values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
        
        optimizer.zero_grad()
        loss = loss_function(mol_prediction, torch.Tensor(y_val).view(-1,1))     
        loss.backward()
        optimizer.step()

def eval(model, dataset):
    model.eval()
    test_MAE_list = []
    test_MSE_list = []
    valList = np.arange(0,dataset.shape[0])
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch) 
    for counter, batch in enumerate(batch_list):
        batch_df = dataset.loc[batch,:]
        smiles_list = batch_df.cano_smiles.values
        print(batch_df)
        y_val = batch_df[tasks[0]].values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
        MAE = F.l1_loss(mol_prediction, torch.Tensor(y_val).view(-1,1), reduction='none')        
        MSE = F.mse_loss(mol_prediction, torch.Tensor(y_val).view(-1,1), reduction='none')
        print(x_mask[:2],atoms_prediction.shape, mol_prediction,MSE)
        
        test_MAE_list.extend(MAE.data.squeeze().cpu().numpy())
        test_MSE_list.extend(MSE.data.squeeze().cpu().numpy())
        
    return np.array(test_MAE_list).mean(), np.array(test_MSE_list).mean()

In [None]:
best_param ={}
best_param["train_epoch"] = 0
best_param["test_epoch"] = 0
best_param["train_MSE"] = 9e8
best_param["test_MSE"] = 9e8

for epoch in range(800):
    train_MAE, train_MSE = eval(model, train_df)
    test_MAE, test_MSE = eval(model, test_df)

    if train_MSE < best_param["train_MSE"]:
        best_param["train_epoch"] = epoch
        best_param["train_MSE"] = train_MSE
    if test_MSE < best_param["test_MSE"]:
        best_param["test_epoch"] = epoch
        best_param["test_MSE"] = test_MSE
        if test_MSE < 0.9:
             torch.save(model, 'saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(epoch)+'.pt')
    if (epoch - best_param["train_epoch"] >2) and (epoch - best_param["test_epoch"] >18):        
        break
    print(epoch, train_MSE, test_MSE)
    
    train(model, train_df, optimizer, loss_function)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        [6.5327e-03],
        [9.4773e-03],
        [2.4889e-01],
        [5.8520e-03],
        [1.4389e-01],
        [3.6791e-02],
        [7.3747e+01],
        [1.9169e-01],
        [1.9907e-05],
        [1.0974e-02],
        [3.9780e-03],
        [5.0348e-02],
        [6.9856e-05],
        [6.0729e-02],
        [7.4032e-02],
        [2.8800e-04]], grad_fn=<MseLossBackward>)
                                                 smiles  ...                                        cano_smiles
3600        [nH]1c2cc(ncc2c2c3cocc3c3C=CCc3c12)-c1ccco1  ...      C1=Cc2c(c3[nH]c4cc(-c5ccco5)ncc4c3c3cocc23)C1
3601  c1sc(-c2ccc(nc2)-c2ccc(-c3cccc4c[nH]cc34)c3coc...  ...  c1cc(-c2ccc(-c3ccc(-c4scc5ccoc45)cn3)c3cocc23)...
3602  c1sc(-c2cc3cnc4c5ccccc5sc4c3c3=CCC=c23)c2[nH]c...  ...  C1=c2c(-c3scc4cc[nH]c34)cc3cnc4c5ccccc5sc4c3c2...
3603  C1C=c2c3c(sc4cc(C5=CC=CC5)c5nsnc5c34)c3c4cscc4...  ...  C1=CCC(c2cc3sc4c(c5c(c6ccc7cscc7c64)=CCC=5)c3

In [None]:
# evaluate model
best_model = torch.load('saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(best_param["test_epoch"])+'.pt')     

# best_model_dict = best_model.state_dict()
# best_model_wts = copy.deepcopy(best_model_dict)

# model.load_state_dict(best_model_wts)
# (best_model.align[0].weight == model.align[0].weight).all()
test_MAE, test_MSE = eval(best_model, test_df)
print("best epoch:",best_param["test_epoch"],"\n","test MSE:",test_MSE)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        [3.0970e-01],
        [1.3815e-04],
        [6.9379e-02],
        [6.4682e-02],
        [7.3690e+01],
        [5.5596e-01],
        [2.6784e-02],
        [2.9299e-02],
        [2.2962e-03],
        [9.6965e-02],
        [6.4013e-03],
        [4.7466e-02],
        [2.7309e-01],
        [3.4328e-03]], grad_fn=<MseLossBackward>)
                                                 smiles  ...                                        cano_smiles
3600        [nH]1c2cc(ncc2c2c3cocc3c3C=CCc3c12)-c1ccco1  ...      C1=Cc2c(c3[nH]c4cc(-c5ccco5)ncc4c3c3cocc23)C1
3601  c1sc(-c2ccc(nc2)-c2ccc(-c3cccc4c[nH]cc34)c3coc...  ...  c1cc(-c2ccc(-c3ccc(-c4scc5ccoc45)cn3)c3cocc23)...
3602  c1sc(-c2cc3cnc4c5ccccc5sc4c3c3=CCC=c23)c2[nH]c...  ...  C1=c2c(-c3scc4cc[nH]c34)cc3cnc4c5ccccc5sc4c3c2...
3603  C1C=c2c3c(sc4cc(C5=CC=CC5)c5nsnc5c34)c3c4cscc4...  ...  C1=CCC(c2cc3sc4c(c5c(c6ccc7cscc7c64)=CCC=5)c3c...
3604      c1cc2sc3c(oc4cc([se]c34)-c3cc

### Example: Solubility



In [None]:
task_name = 'solubility'
tasks = ['measured log solubility in mols per litre']

raw_filename = "data/delaney-processed.csv"
feature_filename = raw_filename.replace('.csv','.pickle')
filename = raw_filename.replace('.csv','')
prefix_filename = raw_filename.split('/')[-1].replace('.csv','')

smiles_tasks_df = pd.read_csv(raw_filename)
smilesList = smiles_tasks_df.smiles.values
print("number of all smiles: ",len(smilesList))

if os.path.isfile(feature_filename):
    feature_dicts = pickle.load(open(feature_filename, "rb" ))
else:
    feature_dicts = save_smiles_dicts(smilesList,filename)

atom_num_dist = []
remained_smiles = []
canonical_smiles_list = []

for smiles in smilesList:
    try:        
        mol = Chem.MolFromSmiles(smiles)
        atom_num_dist.append(len(mol.GetAtoms()))
        remained_smiles.append(smiles)
        canonical_smiles_list.append(Chem.MolToSmiles(Chem.MolFromSmiles(smiles), isomericSmiles=True))
    except:
        print(smiles)
        pass

print("number of successfully processed smiles: ", len(remained_smiles))
smiles_tasks_df = smiles_tasks_df[smiles_tasks_df["smiles"].isin(remained_smiles)]

print(smiles_tasks_df)
smiles_tasks_df['cano_smiles'] =canonical_smiles_list
assert canonical_smiles_list[8]==Chem.MolToSmiles(Chem.MolFromSmiles(smiles_tasks_df['cano_smiles'][8]), isomericSmiles=True)

plt.figure(figsize=(5, 3))
sns.set(font_scale=1.5)
ax = sns.distplot(atom_num_dist, bins=28, kde=False)

plt.tight_layout()
plt.savefig("atom_num_dist_"+prefix_filename+".png",dpi=200)
plt.show()
plt.close()

number of all smiles:  1128
number of successfully processed smiles:  1128
         Compound ID  ...                                             smiles
0          Amigdalin  ...  OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...
1           Fenfuram  ...                             Cc1occc1C(=O)Nc2ccccc2
2             citral  ...                               CC(C)=CCCC(C)=CC(=O)
3             Picene  ...                 c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
4          Thiophene  ...                                            c1ccsc1
...              ...  ...                                                ...
1123       halothane  ...                                   FC(F)(F)C(Cl)Br 
1124          Oxamyl  ...                          CNC(=O)ON=C(SC)C(=O)N(C)C
1125       Thiometon  ...                                  CCSCCSP(=S)(OC)OC
1126  2-Methylbutane  ...                                            CCC(C)C
1127        Stirofos  ...              COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl





In [None]:
random_seed = 888 # 69，103, 107
start_time = str(time.ctime()).replace(':','-').replace(' ','_')

batch_size = 200
epochs = 200

p_dropout= 0.2
fingerprint_dim = 200

weight_decay = 5 # also known as l2_regularization_lambda
learning_rate = 2.5
radius = 2
T = 2

per_task_output_units_num = 1 # for regression model
output_units_num = len(tasks) * per_task_output_units_num

In [None]:
# feature_dicts = get_smiles_dicts(smilesList)
remained_df = smiles_tasks_df[smiles_tasks_df["cano_smiles"].isin(feature_dicts['smiles_to_atom_mask'].keys())]
uncovered_df = smiles_tasks_df.drop(remained_df.index)
uncovered_df

C
feature dicts file saved as /content/AttentiveFP/data/delaney-processed.pickle


Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles,cano_smiles
934,Methane,-0.636,0,16.043,0,0,0,0.0,-0.9,C,C


In [None]:
test_df = remained_df.sample(frac=0.2,random_state=random_seed)
train_df = remained_df.drop(test_df.index)

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

print(len(test_df),sorted(test_df.cano_smiles.values))

225 ['BrCBr', 'Brc1ccc(I)cc1', 'Brc1cccc2ccccc12', 'Brc1ccccc1', 'Brc1ccccc1Br', 'C#CCCC', 'C1CCCCC1', 'C1CCCCCCC1', 'C1CCOC1', 'C=C(C)C', 'C=C(C)C1CC=C(C)C(=O)C1', 'C=C(Cl)CSC(=S)N(CC)CC', 'C=CC(=O)OC', 'C=CC(C)(O)CCC=C(C)C', 'C=CCC1(c2ccccc2)C(=O)NC(=O)NC1=O', 'C=Cc1ccccc1', 'CC(=O)C1(C)CCC2C3C=C(C)C4=CC(=O)CCC4(C)C3CCC21C', 'CC(=O)C1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3CCC21C', 'CC(=O)Nc1ccc(O)cc1', 'CC(=O)OC(C)C', 'CC(=O)OCC(=O)C1(O)C(C)CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC21C', 'CC(=O)OCC(=O)C1(O)C(OC(C)=O)CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC21C', 'CC(=O)OCC(=O)C1(O)CCC2C3CCC4=CC(=O)CCC4(C)C3C(=O)CC21C', 'CC(=O)OCC(C)C', 'CC(C)=CC1C(C(=O)OCc2cccc(Oc3ccccc3)c2)C1(C)C', 'CC(C)=CCCC(C)=CC=O', 'CC(C)C(=O)C(C)C', 'CC(C)C(C)C(C)C', 'CC(C)C(Nc1ccc(C(F)(F)F)cc1Cl)C(=O)OC(C#N)c1cccc(Oc2ccccc2)c1', 'CC(C)CBr', 'CC(C)CC(C)C', 'CC(C)CCOC=O', 'CC(C)O', 'CC(C)OC(=O)C(O)(c1ccc(Br)cc1)c1ccc(Br)cc1', 'CC(C)c1ccc(NC(=O)N(C)C)cc1', 'CC(C)c1ccccc1', 'CC(Cl)(Cl)Cl', 'CC(N)=O', 'CC1(C)C(C=C(Cl)C(F)(F)F)C1C(=O)OC(

In [None]:
x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array([canonical_smiles_list[0]],feature_dicts)
loss_function = nn.MSELoss()

num_atom_features = x_atom.shape[-1]
num_bond_features = x_bonds.shape[-1]

model = Fingerprint(radius, T, num_atom_features, num_bond_features,
            fingerprint_dim, output_units_num, p_dropout)
model.cuda()

# optimizer = optim.Adam(model.parameters(), learning_rate, weight_decay=weight_decay)
optimizer = optim.Adam(model.parameters(), 10**-learning_rate, weight_decay=10**-weight_decay)
# optimizer = optim.SGD(model.parameters(), 10**-learning_rate, weight_decay=10**-weight_decay)

# tensorboard = SummaryWriter(log_dir="runs/"+start_time+"_"+prefix_filename+"_"+str(fingerprint_dim)+"_"+str(p_dropout))

model_parameters = filter(lambda p: p.requires_grad, model.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
print(params)

for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.data.shape)

863604
atom_fc.weight torch.Size([200, 39])
atom_fc.bias torch.Size([200])
neighbor_fc.weight torch.Size([200, 49])
neighbor_fc.bias torch.Size([200])
GRUCell.0.weight_ih torch.Size([600, 200])
GRUCell.0.weight_hh torch.Size([600, 200])
GRUCell.0.bias_ih torch.Size([600])
GRUCell.0.bias_hh torch.Size([600])
GRUCell.1.weight_ih torch.Size([600, 200])
GRUCell.1.weight_hh torch.Size([600, 200])
GRUCell.1.bias_ih torch.Size([600])
GRUCell.1.bias_hh torch.Size([600])
align.0.weight torch.Size([1, 400])
align.0.bias torch.Size([1])
align.1.weight torch.Size([1, 400])
align.1.bias torch.Size([1])
attend.0.weight torch.Size([200, 200])
attend.0.bias torch.Size([200])
attend.1.weight torch.Size([200, 200])
attend.1.bias torch.Size([200])
mol_GRUCell.weight_ih torch.Size([600, 200])
mol_GRUCell.weight_hh torch.Size([600, 200])
mol_GRUCell.bias_ih torch.Size([600])
mol_GRUCell.bias_hh torch.Size([600])
mol_align.weight torch.Size([1, 400])
mol_align.bias torch.Size([1])
mol_attend.weight torch.Si

In [None]:
def train(model, dataset, optimizer, loss_function):
    model.train()
    np.random.seed(epoch)
    valList = np.arange(0,dataset.shape[0])
    
    #shuffle them
    np.random.shuffle(valList)
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch)   

    for counter, batch in enumerate(batch_list):
        batch_df = dataset.loc[batch,:]
        smiles_list = batch_df.cano_smiles.values
        y_val = batch_df[tasks[0]].values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
        
        model.zero_grad()
        loss = loss_function(mol_prediction, torch.Tensor(y_val).view(-1,1))     
        loss.backward()
        optimizer.step()

def eval(model, dataset):
    model.eval()
    test_MAE_list = []
    test_MSE_list = []
    valList = np.arange(0,dataset.shape[0])
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch) 

    for counter, batch in enumerate(batch_list):
        batch_df = dataset.loc[batch,:]
        smiles_list = batch_df.cano_smiles.values
        print(batch_df)
        y_val = batch_df[tasks[0]].values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
        MAE = F.l1_loss(mol_prediction, torch.Tensor(y_val).view(-1,1), reduction='none')        
        MSE = F.mse_loss(mol_prediction, torch.Tensor(y_val).view(-1,1), reduction='none')
        print(x_mask[:2],atoms_prediction.shape, mol_prediction,MSE)
        
        test_MAE_list.extend(MAE.data.squeeze().cpu().numpy())
        test_MSE_list.extend(MSE.data.squeeze().cpu().numpy())

    return np.array(test_MAE_list).mean(), np.array(test_MSE_list).mean()

In [None]:
best_param ={}
best_param["train_epoch"] = 0
best_param["test_epoch"] = 0
best_param["train_MSE"] = 9e8
best_param["test_MSE"] = 9e8

for epoch in range(800):
    train_MAE, train_MSE = eval(model, train_df)
    test_MAE, test_MSE = eval(model, test_df)

    if train_MSE < best_param["train_MSE"]:
        best_param["train_epoch"] = epoch
        best_param["train_MSE"] = train_MSE

    if test_MSE < best_param["test_MSE"]:
        best_param["test_epoch"] = epoch
        best_param["test_MSE"] = test_MSE
        if test_MSE < 0.35:
             torch.save(model, 'saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(epoch)+'.pt')
    if (epoch - best_param["train_epoch"] >10) and (epoch - best_param["test_epoch"] >18):        
        break
    print(epoch, train_MSE, test_MSE)
    
    train(model, train_df, optimizer, loss_function)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        [6.8878e-02],
        [1.2117e-04],
        [6.1625e-02],
        [1.1025e-02],
        [8.2167e-03],
        [3.4613e-01],
        [1.5582e-02],
        [1.2623e-02],
        [2.4261e-01],
        [3.1612e-04],
        [2.3293e+00],
        [2.3700e+00],
        [8.1108e-03],
        [6.9644e-01],
        [1.4063e+00],
        [4.8853e-02],
        [1.2787e-01],
        [4.9951e-01],
        [8.7744e-03],
        [5.0927e-01],
        [1.9355e-01],
        [3.5432e-01],
        [3.4909e-02],
        [1.4570e-02],
        [4.5561e-01],
        [4.6312e-03],
        [1.3948e-01],
        [2.2541e-01],
        [1.9747e-01],
        [1.1778e-03],
        [8.1480e-02],
        [1.5078e-01],
        [4.0593e-04],
        [2.4728e-02],
        [2.4883e-01],
        [1.9224e-03],
        [4.5016e-02],
        [6.4182e-01],
        [3.5148e-03],
        [5.6527e-03],
        [3.5276e-03],
        [2.6184e-02],
        [2.

In [None]:
# evaluate model
best_model = torch.load('saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(best_param["test_epoch"])+'.pt')     

best_model_dict = best_model.state_dict()
best_model_wts = copy.deepcopy(best_model_dict)

model.load_state_dict(best_model_wts)
(best_model.align[0].weight == model.align[0].weight).all()
test_MAE, test_MSE = eval(model, test_df)
print("best epoch:",best_param["test_epoch"],"\n","test MSE:",test_MSE)

            Compound ID  ...                        cano_smiles
0       o-Chloroaniline  ...                        Nc1ccccc1Cl
1          hydrobenzoin  ...           OC(c1ccccc1)C(O)c1ccccc1
2        m-Nitroaniline  ...            Nc1cccc([N+](=O)[O-])c1
3     1-Bromonapthalene  ...                   Brc1cccc2ccccc12
4    3,4-Dichlorophenol  ...                  Oc1ccc(Cl)c(Cl)c1
..                  ...  ...                                ...
195         Succinimide  ...                      O=C1CCC(=O)N1
196      Dipropyl ether  ...                            CCCOCCC
197           Equilenin  ...  CC12CCc3c(ccc4cc(O)ccc34)C1CCC2=O
198     1-Chloroheptane  ...                          CCCCCCCCl
199    2-Methy-2-Butene  ...                           CC=C(C)C

[200 rows x 11 columns]
[[1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 

### Example: Bioactivity_BACE

In [None]:
task_name = 'BACE'
tasks = ['Class']
raw_filename = "data/bace.csv"

feature_filename = raw_filename.replace('.csv','.pickle')
filename = raw_filename.replace('.csv','')
prefix_filename = raw_filename.split('/')[-1].replace('.csv','')

smiles_tasks_df = pd.read_csv(raw_filename)
smilesList = smiles_tasks_df.mol.values
print("number of all smiles: ",len(smilesList))

if os.path.isfile(feature_filename):
    feature_dicts = pickle.load(open(feature_filename, "rb" ))
else:
    feature_dicts = save_smiles_dicts(smilesList,filename)

atom_num_dist = []
remained_smiles = []
canonical_smiles_list = []

for smiles in smilesList:
    try:        
        mol = Chem.MolFromSmiles(smiles)
        atom_num_dist.append(len(mol.GetAtoms()))
        remained_smiles.append(smiles)
        canonical_smiles_list.append(Chem.MolToSmiles(Chem.MolFromSmiles(smiles), isomericSmiles=True))
    except:
        print("not successfully processed smiles: ", smiles)
        pass

print("number of successfully processed smiles: ", len(remained_smiles))
smiles_tasks_df = smiles_tasks_df[smiles_tasks_df["mol"].isin(remained_smiles)]

print(smiles_tasks_df)
smiles_tasks_df['cano_smiles'] =canonical_smiles_list

assert canonical_smiles_list[8]==Chem.MolToSmiles(Chem.MolFromSmiles(smiles_tasks_df['cano_smiles'][8]), isomericSmiles=True)

plt.figure(figsize=(5, 3))
sns.set(font_scale=1.5)
ax = sns.distplot(atom_num_dist, bins=28, kde=False)

plt.tight_layout()
plt.savefig("atom_num_dist_"+prefix_filename+".png",dpi=200)
plt.show()
plt.close()

print(len([i for i in atom_num_dist if i<51]),len([i for i in atom_num_dist if i>50]))

number of all smiles:  1513
number of successfully processed smiles:  1513
                                                    mol  ... canvasUID
0     O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2c...  ...         1
1     Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(...  ...         2
2     S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...  ...         3
3     S1(=O)(=O)C[C@@H](Cc2cc(O[C@H](COCC)C(F)(F)F)c...  ...         4
4     S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...  ...         5
...                                                 ...  ...       ...
1508          Clc1cc2nc(n(c2cc1)C(CC(=O)NCC1CCOCC1)CC)N  ...      1543
1509          Clc1cc2nc(n(c2cc1)C(CC(=O)NCc1ncccc1)CC)N  ...      1544
1510             Brc1cc(ccc1)C1CC1C=1N=C(N)N(C)C(=O)C=1  ...      1545
1511       O=C1N(C)C(=NC(=C1)C1CC1c1cc(ccc1)-c1ccccc1)N  ...      1546
1512                Clc1cc2nc(n(c2cc1)CCCC(=O)NCC1CC1)N  ...      1547

[1513 rows x 595 columns]




1471 42


In [None]:
random_seed = 88
start_time = str(time.ctime()).replace(':','-').replace(' ','_')
start = time.time()

batch_size = 100
epochs = 800
p_dropout = 0.1
fingerprint_dim = 150

radius = 3
T = 2
weight_decay = 2.9 # also known as l2_regularization_lambda
learning_rate = 3.5

per_task_output_units_num = 2 # for classification model with 2 classes
output_units_num = len(tasks) * per_task_output_units_num

In [None]:
# smilesList = [smiles for smiles in canonical_smiles_list if len(Chem.MolFromSmiles(smiles).GetAtoms())<151]
# uncovered = [smiles for smiles in canonical_smiles_list if len(Chem.MolFromSmiles(smiles).GetAtoms())>150]

# smiles_tasks_df = smiles_tasks_df[~smiles_tasks_df["cano_smiles"].isin(uncovered)]
# feature_dicts = get_smiles_dicts(smilesList)

remained_df = smiles_tasks_df[smiles_tasks_df["cano_smiles"].isin(feature_dicts['smiles_to_atom_mask'].keys())]
uncovered_df = smiles_tasks_df.drop(remained_df.index)
uncovered_df

feature dicts file saved as /content/AttentiveFP/data/bace.pickle


Unnamed: 0,mol,CID,Class,Model,pIC50,MW,AlogP,HBA,HBD,RB,HeavyAtomCount,ChiralCenterCount,ChiralCenterCountAllPossible,RingCount,PSA,Estate,MR,Polar,sLi_Key,ssBe_Key,ssssBem_Key,sBH2_Key,ssBH_Key,sssB_Key,ssssBm_Key,sCH3_Key,dCH2_Key,ssCH2_Key,tCH_Key,dsCH_Key,aaCH_Key,sssCH_Key,ddC_Key,tsC_Key,dssC_Key,aasC_Key,aaaC_Key,ssssC_Key,sNH3_Key,sNH2_Key,...,Ring perimeter (RNGPERM),Ring bridge count (RNGBDGE),Molecule cyclized degree (MCD),Ring Fusion density (RFDELTA),Ring complexity index (RCI),Van der Waals surface area (VSA),MR1 (MR1),MR2 (MR2),MR3 (MR3),MR4 (MR4),MR5 (MR5),MR6 (MR6),MR7 (MR7),MR8 (MR8),ALOGP1 (ALOGP1),ALOGP2 (ALOGP2),ALOGP3 (ALOGP3),ALOGP4 (ALOGP4),ALOGP5 (ALOGP5),ALOGP6 (ALOGP6),ALOGP7 (ALOGP7),ALOGP8 (ALOGP8),ALOGP9 (ALOGP9),ALOGP10 (ALOGP10),PEOE1 (PEOE1),PEOE2 (PEOE2),PEOE3 (PEOE3),PEOE4 (PEOE4),PEOE5 (PEOE5),PEOE6 (PEOE6),PEOE7 (PEOE7),PEOE8 (PEOE8),PEOE9 (PEOE9),PEOE10 (PEOE10),PEOE11 (PEOE11),PEOE12 (PEOE12),PEOE13 (PEOE13),PEOE14 (PEOE14),canvasUID,cano_smiles


In [None]:
weights = []

for i,task in enumerate(tasks):    
    negative_df = remained_df[remained_df[task] == 0][["mol",task]]
    positive_df = remained_df[remained_df[task] == 1][["mol",task]]
    weights.append([(positive_df.shape[0]+negative_df.shape[0])/negative_df.shape[0],\
                    (positive_df.shape[0]+negative_df.shape[0])/positive_df.shape[0]])

test_df = remained_df.sample(frac=1/10, random_state=random_seed) # test set
training_data = remained_df.drop(test_df.index) # training data

# training data is further divided into validation set and train set
valid_df = training_data.sample(frac=1/9, random_state=random_seed) # validation set
train_df = training_data.drop(valid_df.index) # train set

train_df = train_df.reset_index(drop=True)
valid_df = valid_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

In [None]:
x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array([canonical_smiles_list[0]],feature_dicts)

num_atom_features = x_atom.shape[-1]
num_bond_features = x_bonds.shape[-1]

loss_function = [nn.CrossEntropyLoss(torch.Tensor(weight),reduction='mean') for weight in weights]
model = Fingerprint(radius, T, num_atom_features,num_bond_features,
            fingerprint_dim, output_units_num, p_dropout)
model.cuda()
# tensorboard = SummaryWriter(log_dir="runs/"+start_time+"_"+prefix_filename+"_"+str(fingerprint_dim)+"_"+str(p_dropout))

# optimizer = optim.Adam(model.parameters(), learning_rate, weight_decay=weight_decay)
optimizer = optim.Adam(model.parameters(), 10**-learning_rate, weight_decay=10**-weight_decay)
model_parameters = filter(lambda p: p.requires_grad, model.parameters())

params = sum([np.prod(p.size()) for p in model_parameters])
print(params)

for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.data.shape)

649206
atom_fc.weight torch.Size([150, 39])
atom_fc.bias torch.Size([150])
neighbor_fc.weight torch.Size([150, 49])
neighbor_fc.bias torch.Size([150])
GRUCell.0.weight_ih torch.Size([450, 150])
GRUCell.0.weight_hh torch.Size([450, 150])
GRUCell.0.bias_ih torch.Size([450])
GRUCell.0.bias_hh torch.Size([450])
GRUCell.1.weight_ih torch.Size([450, 150])
GRUCell.1.weight_hh torch.Size([450, 150])
GRUCell.1.bias_ih torch.Size([450])
GRUCell.1.bias_hh torch.Size([450])
GRUCell.2.weight_ih torch.Size([450, 150])
GRUCell.2.weight_hh torch.Size([450, 150])
GRUCell.2.bias_ih torch.Size([450])
GRUCell.2.bias_hh torch.Size([450])
align.0.weight torch.Size([1, 300])
align.0.bias torch.Size([1])
align.1.weight torch.Size([1, 300])
align.1.bias torch.Size([1])
align.2.weight torch.Size([1, 300])
align.2.bias torch.Size([1])
attend.0.weight torch.Size([150, 150])
attend.0.bias torch.Size([150])
attend.1.weight torch.Size([150, 150])
attend.1.bias torch.Size([150])
attend.2.weight torch.Size([150, 150])

In [None]:
def train(model, dataset, optimizer, loss_function):
    model.train()
    np.random.seed(epoch)
    valList = np.arange(0,dataset.shape[0])

    #shuffle them
    np.random.shuffle(valList)
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch)   

    for counter, train_batch in enumerate(batch_list):
        batch_df = dataset.loc[train_batch,:]
        smiles_list = batch_df.cano_smiles.values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
#         print(torch.Tensor(x_atom).size(),torch.Tensor(x_bonds).size(),torch.cuda.LongTensor(x_atom_index).size(),torch.cuda.LongTensor(x_bond_index).size(),torch.Tensor(x_mask).size())
        
        optimizer.zero_grad()
        loss = 0.0
        
        for i,task in enumerate(tasks):
            y_pred = mol_prediction[:, i * per_task_output_units_num:(i + 1) *
                                    per_task_output_units_num]
            y_val = batch_df[task].values

            validInds = np.where((y_val==0) | (y_val==1))[0]
#             validInds = np.where(y_val != -1)[0]
            if len(validInds) == 0:
                continue
            y_val_adjust = np.array([y_val[v] for v in validInds]).astype(float)
            validInds = torch.cuda.LongTensor(validInds).squeeze()
            y_pred_adjust = torch.index_select(y_pred, 0, validInds)

            loss += loss_function[i](
                y_pred_adjust,
                torch.cuda.LongTensor(y_val_adjust))
            
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

def eval(model, dataset):
    model.eval()
    y_val_list = {}
    y_pred_list = {}
    losses_list = []
    valList = np.arange(0,dataset.shape[0])
    batch_list = []

    for i in range(len(tasks)):
        y_val_list[i] = []
        y_pred_list[i] = []
    
    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch)   

    for counter, eval_batch in enumerate(batch_list):
        batch_df = dataset.loc[eval_batch,:]
        smiles_list = batch_df.cano_smiles.values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
        atom_pred = atoms_prediction.data[:,:,1].unsqueeze(2).cpu().numpy()

        for i,task in enumerate(tasks):
            y_pred = mol_prediction[:, i * per_task_output_units_num:(i + 1) *
                                    per_task_output_units_num]
            y_val = batch_df[task].values

            validInds = np.where((y_val==0) | (y_val==1))[0]
#             validInds = np.where((y_val=='0') | (y_val=='1'))[0]
#             print(validInds)
            if len(validInds) == 0:
                continue
            y_val_adjust = np.array([y_val[v] for v in validInds]).astype(float)
            validInds = torch.cuda.LongTensor(validInds).squeeze()
            y_pred_adjust = torch.index_select(y_pred, 0, validInds)
#             print(validInds)
            loss = loss_function[i](
                y_pred_adjust,
                torch.cuda.LongTensor(y_val_adjust))
#             print(y_pred_adjust)
            y_pred_adjust = F.softmax(y_pred_adjust,dim=-1).data.cpu().numpy()[:,1]
            losses_list.append(loss.cpu().detach().numpy())
        
            y_val_list[i].extend(y_val_adjust)
            y_pred_list[i].extend(y_pred_adjust)
                
    eval_roc = [roc_auc_score(y_val_list[i], y_pred_list[i]) for i in range(len(tasks))]
#     eval_prc = [auc(precision_recall_curve(y_val_list[i], y_pred_list[i])[1],precision_recall_curve(y_val_list[i], y_pred_list[i])[0]) for i in range(len(tasks))]
#     eval_precision = [precision_score(y_val_list[i],
#                                      (np.array(y_pred_list[i]) > 0.5).astype(int)) for i in range(len(tasks))]
#     eval_recall = [recall_score(y_val_list[i],
#                                (np.array(y_pred_list[i]) > 0.5).astype(int)) for i in range(len(tasks))]
    eval_loss = np.array(losses_list).mean()
    
    return eval_roc, eval_loss #eval_prc, eval_precision, eval_recall,

In [None]:
best_param ={}
best_param["roc_epoch"] = 0
best_param["loss_epoch"] = 0
best_param["valid_roc"] = 0
best_param["valid_loss"] = 9e8

for epoch in range(epochs):    
    train_roc, train_loss = eval(model, train_df)
    valid_roc, valid_loss = eval(model, valid_df)
    
    train_roc_mean = np.array(train_roc).mean()
    valid_roc_mean = np.array(valid_roc).mean()
    
#     tensorboard.add_scalars('ROC',{'train_roc':train_roc_mean,'valid_roc':valid_roc_mean},epoch)
#     tensorboard.add_scalars('Losses',{'train_losses':train_loss,'valid_losses':valid_loss},epoch)

    if valid_roc_mean > best_param["valid_roc"]:
        best_param["roc_epoch"] = epoch
        best_param["valid_roc"] = valid_roc_mean
        if valid_roc_mean > 0.85:
             torch.save(model, 'saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(epoch)+'.pt')             
    
    if valid_loss < best_param["valid_loss"]:
        best_param["loss_epoch"] = epoch
        best_param["valid_loss"] = valid_loss

    print("EPOCH:\t"+str(epoch)+'\n'\
        +"train_roc"+":"+str(train_roc)+'\n'\
        +"valid_roc"+":"+str(valid_roc)+'\n'\
#         +"train_roc_mean"+":"+str(train_roc_mean)+'\n'\
#         +"valid_roc_mean"+":"+str(valid_roc_mean)+'\n'\
        )
    if (epoch - best_param["roc_epoch"] >18) and (epoch - best_param["loss_epoch"] >28):        
        break
        
    torch.manual_seed(epoch)    
    train(model, train_df, optimizer, loss_function)

EPOCH:	0
train_roc:[0.6771461621085682]
valid_roc:[0.6334507042253521]

EPOCH:	1
train_roc:[0.6788372028973534]
valid_roc:[0.6147887323943662]

EPOCH:	2
train_roc:[0.6857197939904707]
valid_roc:[0.6283450704225352]

EPOCH:	3
train_roc:[0.6872139139056431]
valid_roc:[0.6341549295774648]

EPOCH:	4
train_roc:[0.6899611666528961]
valid_roc:[0.6345070422535212]

EPOCH:	5
train_roc:[0.6925225150789062]
valid_roc:[0.6367957746478873]

EPOCH:	6
train_roc:[0.6994519265195958]
valid_roc:[0.6408450704225351]

EPOCH:	7
train_roc:[0.7035473298631193]
valid_roc:[0.6417253521126761]

EPOCH:	8
train_roc:[0.7097386322950233]
valid_roc:[0.6463028169014085]

EPOCH:	9
train_roc:[0.7166928309785454]
valid_roc:[0.6524647887323943]

EPOCH:	10
train_roc:[0.7239279517475006]
valid_roc:[0.6577464788732394]

EPOCH:	11
train_roc:[0.7271447850395218]
valid_roc:[0.6605633802816903]

EPOCH:	12
train_roc:[0.7326585695006748]
valid_roc:[0.6661971830985915]

EPOCH:	13
train_roc:[0.7328789005480736]
valid_roc:[0.6651408

In [None]:
# evaluate model
best_model = torch.load('saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(best_param["roc_epoch"])+'.pt')     

best_model_dict = best_model.state_dict()
best_model_wts = copy.deepcopy(best_model_dict)

model.load_state_dict(best_model_wts)
(best_model.align[0].weight == model.align[0].weight).all()
test_roc, test_losses = eval(model, test_df)

print("best epoch:"+str(best_param["roc_epoch"])
      +"\n"+"test_roc:"+str(test_roc)
      +"\n"+"test_roc_mean:",str(np.array(test_roc).mean())
     )

best epoch:149
test_roc:[0.8681574239713775]
test_roc_mean: 0.8681574239713775


### Example: Bioactivity HIV

In [None]:
task_name = 'HIV'
tasks = ['HIV_active']
raw_filename = "data/HIV.csv"

feature_filename = raw_filename.replace('.csv','.pickle')
filename = raw_filename.replace('.csv','')

prefix_filename = raw_filename.split('/')[-1].replace('.csv','')
smiles_tasks_df = pd.read_csv(raw_filename)
smilesList = smiles_tasks_df.smiles.values
print("number of all smiles: ",len(smilesList))

atom_num_dist = []
remained_smiles = []
canonical_smiles_list = []

for smiles in smilesList:
    try:        
        mol = Chem.MolFromSmiles(smiles)
        atom_num_dist.append(len(mol.GetAtoms()))
        canonical_smiles_list.append(Chem.MolToSmiles(Chem.MolFromSmiles(smiles), isomericSmiles=True))
        remained_smiles.append(smiles)
    except:
        print("not successfully processed smiles: ", smiles)
        pass

print("number of successfully processed smiles: ", len(remained_smiles))
smiles_tasks_df = smiles_tasks_df[smiles_tasks_df["smiles"].isin(remained_smiles)]
smiles_tasks_df['cano_smiles'] =canonical_smiles_list

plt.figure(figsize=(5, 3))
sns.set(font_scale=1.5)
ax = sns.distplot(atom_num_dist, bins=28, kde=False)

plt.tight_layout()
plt.savefig("atom_num_dist_"+prefix_filename+".png",dpi=200)
plt.show()
plt.close()

print(len([i for i in atom_num_dist if i<51]),len([i for i in atom_num_dist if i>50]))

number of all smiles:  41127
number of successfully processed smiles:  41127




39650 1477


In [None]:
random_seed = 8
start_time = str(time.ctime()).replace(':','-').replace(' ','_')
start = time.time()

batch_size = 200
epochs = 800
p_dropout = 0.1
fingerprint_dim = 150

radius = 4
T = 2
weight_decay = 3.9 # also known as l2_regularization_lambda
learning_rate = 3

per_task_output_units_num = 2 # for classification model with 2 classes
output_units_num = len(tasks) * per_task_output_units_num

In [None]:
smilesList = [smiles for smiles in canonical_smiles_list if len(Chem.MolFromSmiles(smiles).GetAtoms())<101]

if os.path.isfile(feature_filename):
    feature_dicts = pickle.load(open(feature_filename, "rb" ))
else:
    feature_dicts = save_smiles_dicts(smilesList,filename)
# feature_dicts = get_smiles_dicts(smilesList)

remained_df = smiles_tasks_df[smiles_tasks_df["cano_smiles"].isin(feature_dicts['smiles_to_atom_mask'].keys())]
uncovered_df = smiles_tasks_df.drop(remained_df.index)
uncovered_df

C1CN[Co-4]23(N1)(NCCN2)NCCN3
O=C1O[Cu-5]2(O)(O)(OC1=O)OC(=O)C(=O)O2
CCc1cc[n+]([Mn](SC#N)(SC#N)([n+]2ccc(CC)cc2)([n+]2ccc(CC)cc2)[n+]2ccc(CC)cc2)cc1
O=C1O[Al]23(OC1=O)(OC(=O)C(=O)O2)OC(=O)C(=O)O3
O=C1C[N+]23CC[N+]45CC(=O)O[Ni-4]24(O1)(OC(=O)C3)OC(=O)C5
CC1=[O+][Zr]234([O+]=C(C)C1)([O+]=C(C)CC(C)=[O+]2)([O+]=C(C)CC(C)=[O+]3)[O+]=C(C)CC(C)=[O+]4
O=C1C[N+]23CC[N+]45CC(=O)O[Cu-5]24(O1)(OC(=O)C3)OC(=O)C5
c1ccc2c3c(ccc2c1)O[Fe-4]12(Oc4ccc5ccccc5c4N=[O+]1)(Oc1ccc4ccccc4c1N=[O+]2)[O+]=N3
C[N+]1(C)COC(=S)S[Fe-4]123(SC(=S)OC[N+]2(C)C)SC(=S)OC[N+]3(C)C
c1c[n+]([Ni-4]([n+]2cc[nH]c2)([n+]2cc[nH]c2)([n+]2cc[nH]c2)([n+]2cc[nH]c2)[n+]2cc[nH]c2)c[nH]1
Cl[Pd-4]12([S+]=c3nc[nH]c4[nH]cnc34)([S+]=C3N=CNc4[nH]c[n+]1c43)[S+]=C1N=CNc3[nH]c[n+]2c31
CC1=[O+][Mn]23([O+]=C(C)C1)([O+]=C(C)CC(C)=[O+]2)[O+]=C(C)CC(C)=[O+]3
O=C1C[N+]23CCO[Fe-4]245(O1)OC(=O)C[N+]4(CC3)CC(=O)O5
Cl[Sn](Cl)(C12C3=C4C5=C1[Fe]45321678C2=C1C6C7=C28)C12C3=C4C5=C1[Fe]45321678C2=C1C6C7=C28
Br[Ni-4]12(Br)(NCCN1)NCCN2
C1CN[Ni-4]23(N1)(NCCN2)NCCN

Unnamed: 0,smiles,activity,HIV_active,cano_smiles
71,C1CN[Co-4]23(N1)(NCCN2)NCCN3,CI,0,C1CN[Co-4]23(N1)(NCCN2)NCCN3
79,O=C1O[Cu-5]2(O)(O)(OC1=O)OC(=O)C(=O)O2,CI,0,O=C1O[Cu-5]2(O)(O)(OC1=O)OC(=O)C(=O)O2
88,CCc1cc[n+]([Mn](SC#N)(SC#N)([n+]2ccc(CC)cc2)([...,CI,0,CCc1cc[n+]([Mn](SC#N)(SC#N)([n+]2ccc(CC)cc2)([...
137,O=C1O[Al]23(OC1=O)(OC(=O)C(=O)O2)OC(=O)C(=O)O3,CI,0,O=C1O[Al]23(OC1=O)(OC(=O)C(=O)O2)OC(=O)C(=O)O3
138,O=C1C[N+]23CC[N+]45CC(=O)O[Ni-4]24(O1)(OC(=O)C...,CI,0,O=C1C[N+]23CC[N+]45CC(=O)O[Ni-4]24(O1)(OC(=O)C...
...,...,...,...,...
40746,CC1OC(OC2C(O)COC(OC3C(C)OC(OC4C(OC(=O)C56CCC(C...,CM,1,CC1OC(OC2C(O)COC(OC3C(C)OC(OC4C(OC(=O)C56CCC(C...
40975,CC(C)CCCC(C)C1CCC2C3CCC4CC(CCC=C(c5cc(Cl)c(OCc...,CI,0,CC(C)CCCC(C)C1CCC2C3CCC4CC(CCC=C(c5cc(Cl)c(OCc...
41000,CC(C)CCCC(C)C1CCC2C3CCC4CC(CCC=C(c5cc(Cl)c(OCc...,CI,0,CC(C)CCCC(C)C1CCC2C3CCC4CC(CCC=C(c5cc(Cl)c(OCc...
41027,CC(C)CCCC(C)C1CCC2C3CCC4CC(CCC=C(c5cc(Cl)c(OCc...,CI,0,CC(C)CCCC(C)C1CCC2C3CCC4CC(CCC=C(c5cc(Cl)c(OCc...


In [None]:
weights = []

for i,task in enumerate(tasks):    
    negative_df = remained_df[remained_df[task] == 0][["smiles",task]]
    positive_df = remained_df[remained_df[task] == 1][["smiles",task]]
    weights.append([(positive_df.shape[0]+negative_df.shape[0])/negative_df.shape[0],\
                    (positive_df.shape[0]+negative_df.shape[0])/positive_df.shape[0]])

test_df = remained_df.sample(frac=1/10, random_state=random_seed) # test set
training_data = remained_df.drop(test_df.index) # training data

# training data is further divided into validation set and train set
valid_df = training_data.sample(frac=1/9, random_state=random_seed) # validation set
train_df = training_data.drop(valid_df.index) # train set

train_df = train_df.reset_index(drop=True)
valid_df = valid_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

In [None]:
x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array([canonical_smiles_list[0]],feature_dicts)

num_atom_features = x_atom.shape[-1]
num_bond_features = x_bonds.shape[-1]

loss_function = [nn.CrossEntropyLoss(torch.Tensor(weight),reduction='mean') for weight in weights]

model = Fingerprint(radius, T, num_atom_features,num_bond_features,
            fingerprint_dim, output_units_num, p_dropout)
model.cuda()

# tensorboard = SummaryWriter(log_dir="runs/"+start_time+"_"+prefix_filename+"_"+str(fingerprint_dim)+"_"+str(p_dropout))
# optimizer = optim.Adam(model.parameters(), learning_rate, weight_decay=weight_decay)

optimizer = optim.Adam(model.parameters(), 10**-learning_rate, weight_decay=10**-weight_decay)
model_parameters = filter(lambda p: p.requires_grad, model.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
print(params)

for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.data.shape)

808057
atom_fc.weight torch.Size([150, 39])
atom_fc.bias torch.Size([150])
neighbor_fc.weight torch.Size([150, 49])
neighbor_fc.bias torch.Size([150])
GRUCell.0.weight_ih torch.Size([450, 150])
GRUCell.0.weight_hh torch.Size([450, 150])
GRUCell.0.bias_ih torch.Size([450])
GRUCell.0.bias_hh torch.Size([450])
GRUCell.1.weight_ih torch.Size([450, 150])
GRUCell.1.weight_hh torch.Size([450, 150])
GRUCell.1.bias_ih torch.Size([450])
GRUCell.1.bias_hh torch.Size([450])
GRUCell.2.weight_ih torch.Size([450, 150])
GRUCell.2.weight_hh torch.Size([450, 150])
GRUCell.2.bias_ih torch.Size([450])
GRUCell.2.bias_hh torch.Size([450])
GRUCell.3.weight_ih torch.Size([450, 150])
GRUCell.3.weight_hh torch.Size([450, 150])
GRUCell.3.bias_ih torch.Size([450])
GRUCell.3.bias_hh torch.Size([450])
align.0.weight torch.Size([1, 300])
align.0.bias torch.Size([1])
align.1.weight torch.Size([1, 300])
align.1.bias torch.Size([1])
align.2.weight torch.Size([1, 300])
align.2.bias torch.Size([1])
align.3.weight torch.S

In [None]:
def train(model, dataset, optimizer, loss_function):
    model.train()
    np.random.seed(epoch)
    valList = np.arange(0,dataset.shape[0])

    #shuffle them
    np.random.shuffle(valList)
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch) 

    for counter, train_batch in enumerate(batch_list):
        batch_df = dataset.loc[train_batch,:]
        smiles_list = batch_df.cano_smiles.values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
#         print(torch.Tensor(x_atom).size(),torch.Tensor(x_bonds).size(),torch.cuda.LongTensor(x_atom_index).size(),torch.cuda.LongTensor(x_bond_index).size(),torch.Tensor(x_mask).size())
        
        optimizer.zero_grad()
        loss = 0.0

        for i,task in enumerate(tasks):
            y_pred = mol_prediction[:, i * per_task_output_units_num:(i + 1) *
                                    per_task_output_units_num]
            y_val = batch_df[task].values

            validInds = np.where((y_val==0) | (y_val==1))[0]
#             validInds = np.where(y_val != -1)[0]
            if len(validInds) == 0:
                continue
            y_val_adjust = np.array([y_val[v] for v in validInds]).astype(float)
            validInds = torch.cuda.LongTensor(validInds).squeeze()
            y_pred_adjust = torch.index_select(y_pred, 0, validInds)

            loss += loss_function[i](
                y_pred_adjust,
                torch.cuda.LongTensor(y_val_adjust))
            
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

def eval(model, dataset):
    model.eval()
    y_val_list = {}
    y_pred_list = {}
    losses_list = []
    valList = np.arange(0,dataset.shape[0])
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch)  

    for counter, eval_batch in enumerate(batch_list):
        batch_df = dataset.loc[eval_batch,:]
        smiles_list = batch_df.cano_smiles.values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
        atom_pred = atoms_prediction.data[:,:,1].unsqueeze(2).cpu().numpy()
        
        for i,task in enumerate(tasks):
            y_pred = mol_prediction[:, i * per_task_output_units_num:(i + 1) *
                                    per_task_output_units_num]
            y_val = batch_df[task].values

            validInds = np.where((y_val==0) | (y_val==1))[0]
#             validInds = np.where((y_val=='0') | (y_val=='1'))[0]
#             print(validInds)
            if len(validInds) == 0:
                continue
            y_val_adjust = np.array([y_val[v] for v in validInds]).astype(float)
            validInds = torch.cuda.LongTensor(validInds).squeeze()
            y_pred_adjust = torch.index_select(y_pred, 0, validInds)
#             print(validInds)
            loss = loss_function[i](
                y_pred_adjust,
                torch.cuda.LongTensor(y_val_adjust))
#             print(y_pred_adjust)
            y_pred_adjust = F.softmax(y_pred_adjust,dim=-1).data.cpu().numpy()[:,1]
            losses_list.append(loss.cpu().detach().numpy())
            try:
                y_val_list[i].extend(y_val_adjust)
                y_pred_list[i].extend(y_pred_adjust)
            except:
                y_val_list[i] = []
                y_pred_list[i] = []
                y_val_list[i].extend(y_val_adjust)
                y_pred_list[i].extend(y_pred_adjust)
                
    eval_roc = [roc_auc_score(y_val_list[i], y_pred_list[i]) for i in range(len(tasks))]
#     eval_prc = [auc(precision_recall_curve(y_val_list[i], y_pred_list[i])[1],precision_recall_curve(y_val_list[i], y_pred_list[i])[0]) for i in range(len(tasks))]
#     eval_precision = [precision_score(y_val_list[i],
#                                      (np.array(y_pred_list[i]) > 0.5).astype(int)) for i in range(len(tasks))]
#     eval_recall = [recall_score(y_val_list[i],
#                                (np.array(y_pred_list[i]) > 0.5).astype(int)) for i in range(len(tasks))]
    eval_loss = np.array(losses_list).mean()
    
    return eval_roc, eval_loss #eval_prc, eval_precision, eval_recall, 

In [None]:
best_param ={}
best_param["roc_epoch"] = 0
best_param["loss_epoch"] = 0
best_param["valid_roc"] = 0
best_param["valid_loss"] = 9e8

for epoch in range(epochs):    
    train_roc, train_loss = eval(model, train_df)
    valid_roc, valid_loss = eval(model, valid_df)
    train_roc_mean = np.array(train_roc).mean()
    valid_roc_mean = np.array(valid_roc).mean()
    
#     tensorboard.add_scalars('ROC',{'train_roc':train_roc_mean,'valid_roc':valid_roc_mean},epoch)
#     tensorboard.add_scalars('Losses',{'train_losses':train_loss,'valid_losses':valid_loss},epoch)

    if valid_roc_mean > best_param["valid_roc"]:
        best_param["roc_epoch"] = epoch
        best_param["valid_roc"] = valid_roc_mean
        if valid_roc_mean > 0.80:
             torch.save(model, 'saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(epoch)+'.pt')             
    
    if valid_loss < best_param["valid_loss"]:
        best_param["loss_epoch"] = epoch
        best_param["valid_loss"] = valid_loss

    print("EPOCH:\t"+str(epoch)+'\n'\
        +"train_roc"+":"+str(train_roc)+'\n'\
        +"valid_roc"+":"+str(valid_roc)+'\n'\
#         +"train_roc_mean"+":"+str(train_roc_mean)+'\n'\
#         +"valid_roc_mean"+":"+str(valid_roc_mean)+'\n'\
        )
    if (epoch - best_param["roc_epoch"] >16) and (epoch - best_param["loss_epoch"] >18):        
        break
        
    train(model, train_df, optimizer, loss_function)

EPOCH:	0
train_roc:[0.3576718002116027]
valid_roc:[0.3844487230459274]

EPOCH:	1
train_roc:[0.7180680257295052]
valid_roc:[0.7143795258495529]

EPOCH:	2
train_roc:[0.7483357766348686]
valid_roc:[0.7364944852202114]

EPOCH:	3
train_roc:[0.7621301951002843]
valid_roc:[0.7399476442129433]

EPOCH:	4
train_roc:[0.7708520926296769]
valid_roc:[0.7506108919692834]

EPOCH:	5
train_roc:[0.7725496963749798]
valid_roc:[0.7460632647317659]

EPOCH:	6
train_roc:[0.7748986160448451]
valid_roc:[0.7795696282828464]

EPOCH:	7
train_roc:[0.7861898572276156]
valid_roc:[0.7749214886555016]

EPOCH:	8
train_roc:[0.7874842756930314]
valid_roc:[0.776330895721746]

EPOCH:	9
train_roc:[0.791599234966536]
valid_roc:[0.7905790853819248]

EPOCH:	10
train_roc:[0.7910599749924081]
valid_roc:[0.7852184245911379]

EPOCH:	11
train_roc:[0.7942709352126133]
valid_roc:[0.7722478590860968]

EPOCH:	12
train_roc:[0.7944213022752153]
valid_roc:[0.7861297369255719]

EPOCH:	13
train_roc:[0.8052723033894257]
valid_roc:[0.785180453

In [None]:
# evaluate model
best_model = torch.load('saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(best_param["roc_epoch"])+'.pt')     

# best_model_dict = best_model.state_dict()
# best_model_wts = copy.deepcopy(best_model_dict)

# model.load_state_dict(best_model_wts)
# (best_model.align[0].weight == model.align[0].weight).all()

test_roc, test_losses = eval(best_model, test_df)

print("best epoch:"+str(best_param["roc_epoch"])
      +"\n"+"test_roc:"+str(test_roc)
      +"\n"+"test_roc_mean:",str(np.array(test_roc).mean())
     )

best epoch:74
test_roc:[0.8497731663288572]
test_roc_mean: 0.8497731663288572


### Example: Bioactivity MUV

In [None]:
task_name = 'muv'
tasks = [
    "MUV-466","MUV-548","MUV-600","MUV-644","MUV-652","MUV-689","MUV-692","MUV-712","MUV-713","MUV-733","MUV-737","MUV-810","MUV-832","MUV-846","MUV-852","MUV-858","MUV-859"
]
raw_filename = "data/muv.csv"

feature_filename = raw_filename.replace('.csv','.pickle')
filename = raw_filename.replace('.csv','')
prefix_filename = raw_filename.split('/')[-1].replace('.csv','')

smiles_tasks_df = pd.read_csv(raw_filename)
smilesList = smiles_tasks_df.smiles.values
print("number of all smiles: ",len(smilesList))

atom_num_dist = []
remained_smiles = []
canonical_smiles_list = []

for smiles in smilesList:
    try:        
        mol = Chem.MolFromSmiles(smiles)
        atom_num_dist.append(len(mol.GetAtoms()))
        remained_smiles.append(smiles)
        canonical_smiles_list.append(Chem.MolToSmiles(Chem.MolFromSmiles(smiles), isomericSmiles=True))
    except:
        print("not successfully processed smiles: ", smiles)
        pass

print("number of successfully processed smiles: ", len(remained_smiles))
smiles_tasks_df = smiles_tasks_df[smiles_tasks_df["smiles"].isin(remained_smiles)]

print(smiles_tasks_df)
smiles_tasks_df['cano_smiles'] =canonical_smiles_list
assert canonical_smiles_list[8]==Chem.MolToSmiles(Chem.MolFromSmiles(smiles_tasks_df['cano_smiles'][8]), isomericSmiles=True)

plt.figure(figsize=(5, 3))
sns.set(font_scale=1.5)
ax = sns.distplot(atom_num_dist, bins=28, kde=False)

plt.tight_layout()
plt.savefig("atom_num_dist_"+prefix_filename+".png",dpi=200)
plt.show()
plt.close()

print(len([i for i in atom_num_dist if i<51]),len([i for i in atom_num_dist if i>50]))

number of all smiles:  93087
number of successfully processed smiles:  93087
       MUV-466  ...                                             smiles
0          NaN  ...    Cc1cccc(N2CCN(C(=O)C34CC5CC(CC(C5)C3)C4)CC2)c1C
1          0.0  ...                Cn1ccnc1SCC(=O)Nc1ccc(Oc2ccccc2)cc1
2          NaN  ...  COc1cc2c(cc1NC(=O)CN1C(=O)NC3(CCc4ccccc43)C1=O...
3          NaN  ...  O=C1/C(=C/NC2CCS(=O)(=O)C2)c2ccccc2C(=O)N1c1cc...
4          0.0  ...                          NC(=O)NC(Cc1ccccc1)C(=O)O
...        ...  ...                                                ...
93082      NaN  ...                           O=C(NCc1ccccc1Cl)C1CCCO1
93083      NaN  ...        COc1cc(NCCCCCN2C(=O)c3ccccc3C2=O)c2ncccc2c1
93084      NaN  ...                 CCN(CC)c1ccc(/C=C2/C(=O)ON=C2C)cc1
93085      NaN  ...   Cc1cc(=O)oc2cc(OCC(=O)c3ccc4c(c3)NC(=O)CO4)ccc12
93086      NaN  ...           COc1ccc([N+](=O)[O-])cc1NC(=O)c1ccc(C)o1

[93087 rows x 19 columns]




93087 0


In [None]:
random_seed = 68
start_time = str(time.ctime()).replace(':','-').replace(' ','_')
start = time.time()

batch_size = 100
epochs = 800
p_dropout = 0.2
fingerprint_dim = 250

radius = 3
T = 2
weight_decay = 3.5 # also known as l2_regularization_lambda
learning_rate = 3.7

per_task_output_units_num = 2 # for classification model with 2 classes
output_units_num = len(tasks) * per_task_output_units_num

In [None]:
smilesList = [smiles for smiles in canonical_smiles_list if len(Chem.MolFromSmiles(smiles).GetAtoms())<151]
# uncovered = [smiles for smiles in canonical_smiles_list if len(Chem.MolFromSmiles(smiles).GetAtoms())>150]

# smiles_tasks_df = smiles_tasks_df[~smiles_tasks_df["cano_smiles"].isin(uncovered)]

if os.path.isfile(feature_filename):
    feature_dicts = pickle.load(open(feature_filename, "rb" ))
else:
    feature_dicts = save_smiles_dicts(smilesList,filename)
# feature_dicts = get_smiles_dicts(smilesList)

remained_df = smiles_tasks_df[smiles_tasks_df["smiles"].isin(feature_dicts['smiles_to_atom_mask'].keys())]
uncovered_df = smiles_tasks_df.drop(remained_df.index)
uncovered_df

feature dicts file saved as /content/AttentiveFP/data/muv.pickle


Unnamed: 0,MUV-466,MUV-548,MUV-600,MUV-644,MUV-652,MUV-689,MUV-692,MUV-712,MUV-713,MUV-733,MUV-737,MUV-810,MUV-832,MUV-846,MUV-852,MUV-858,MUV-859,mol_id,smiles,cano_smiles
11942,,0.0,,,,,0.0,0.0,,,,,,,,,,CID750288,COC(=O)Cn1cnc2cccc3cccc1c32,COC(=O)CN1C=Nc2cccc3cccc1c23
15844,,,,,,0.0,,,,,,,,,,,,CID6901636,CNC(=S)N/N=c1\c(=O)c2cccc3cccc1c32,CNC(=S)N/N=C1\C(=O)c2cccc3cccc1c23
17840,,,,0.0,,,,,,,,,,,,,,CID5771339,CC(=O)c1ccc2cccc3[nH]c(C)nc1c23,CC(=O)c1ccc2cccc3c2c1N=C(C)N3
23103,0.0,,,,0.0,,,0.0,,0.0,,,,,,,0.0,CID1046667,CCOC(=O)c1cn2nc(N3CCCC3)sc3c(F)c(F)c(F)c(c1=O)c32,CCOC(=O)c1cn2c3c(c(F)c(F)c(F)c3c1=O)SC(N1CCCC1...
23165,,,,,,0.0,,,,,,0.0,,,,,,CID3149395,COc1ccc(-c2nc3cccc4cccc([nH]2)c43)cc1,COc1ccc(C2=Nc3cccc4cccc(c34)N2)cc1
32750,,,,,,,,,,,,0.0,,,,,,CID801625,COc1ccccc1-c1nc2cccc3cccc([nH]1)c32,COc1ccccc1C1=Nc2cccc3cccc(c23)N1
39313,,,0.0,,,,,,,,,,0.0,,,,,CID2313925,CC(=O)n1c(=O)c2cccc3cccc(c(=O)n1C(C)=O)c32,CC(=O)N1C(=O)c2cccc3cccc(c23)C(=O)N1C(C)=O
40311,,,,,,0.0,,,,,,,,,,,,CID4053114,CC(=O)c1ccc2cccc3c2c1ncn3C,CC(=O)c1ccc2cccc3c2c1N=CN3C
46460,,,,,,,,,,,,,,,,0.0,,CID2179252,C=CCn1c(=O)n(/C=C/C)c2cccc3cccc1c32,C=CCN1C(=O)N(/C=C/C)c2cccc3cccc1c23
66186,,,,,,,0.0,0.0,,,0.0,,,,,,,CID703829,O=c1c2ccccc2nc2n1-n1cnnc1-c1ccccc1-2,O=c1c2ccccc2nc2c3ccccc3c3nncn3n12


In [None]:
weights = []

for i,task in enumerate(tasks):    
    negative_df = remained_df[remained_df[task] == 0][["smiles",task]]
    positive_df = remained_df[remained_df[task] == 1][["smiles",task]]

    negative_test = negative_df.sample(frac=1/10,random_state=random_seed)
    negative_valid = negative_df.drop(negative_test.index).sample(frac=1/9,random_state=random_seed)
    negative_train = negative_df.drop(negative_test.index).drop(negative_valid.index)
    
    positive_test = positive_df.sample(frac=1/10,random_state=random_seed)
    positive_valid = positive_df.drop(positive_test.index).sample(frac=1/9,random_state=random_seed)
    positive_train = positive_df.drop(positive_test.index).drop(positive_valid.index)
    
    weights.append([(positive_test.shape[0]+negative_test.shape[0])/negative_test.shape[0],\
                    (positive_test.shape[0]+negative_test.shape[0])/positive_test.shape[0]])
    
    train_df_new = pd.concat([negative_train,positive_train])
    valid_df_new = pd.concat([negative_valid,positive_valid])
    test_df_new = pd.concat([negative_test,positive_test])

    if i==0:
        train_df = train_df_new
        test_df = test_df_new
        valid_df = valid_df_new
    else:
        train_df = pd.merge(train_df, train_df_new, on='smiles', how='outer') 
        test_df = pd.merge(test_df, test_df_new, on='smiles', how='outer')
        valid_df = pd.merge(valid_df, valid_df_new, on='smiles', how='outer')

In [None]:
x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array([smilesList[0]],feature_dicts)

num_atom_features = x_atom.shape[-1]
num_bond_features = x_bonds.shape[-1]

loss_function = [nn.CrossEntropyLoss(torch.Tensor(weight),reduction='mean') for weight in weights]
model = Fingerprint(radius, T, num_atom_features,num_bond_features,
            fingerprint_dim, output_units_num, p_dropout)
model.cuda()

# tensorboard = SummaryWriter(log_dir="runs/"+start_time+"_"+prefix_filename+"_"+str(fingerprint_dim)+"_"+str(p_dropout))
# optimizer = optim.Adam(model.parameters(), learning_rate, weight_decay=weight_decay)

optimizer = optim.Adam(model.parameters(), 10**-learning_rate, weight_decay=10**-weight_decay)
model_parameters = filter(lambda p: p.requires_grad, model.parameters())

params = sum([np.prod(p.size()) for p in model_parameters])
print(params)

for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.data.shape)

1790038
atom_fc.weight torch.Size([250, 39])
atom_fc.bias torch.Size([250])
neighbor_fc.weight torch.Size([250, 49])
neighbor_fc.bias torch.Size([250])
GRUCell.0.weight_ih torch.Size([750, 250])
GRUCell.0.weight_hh torch.Size([750, 250])
GRUCell.0.bias_ih torch.Size([750])
GRUCell.0.bias_hh torch.Size([750])
GRUCell.1.weight_ih torch.Size([750, 250])
GRUCell.1.weight_hh torch.Size([750, 250])
GRUCell.1.bias_ih torch.Size([750])
GRUCell.1.bias_hh torch.Size([750])
GRUCell.2.weight_ih torch.Size([750, 250])
GRUCell.2.weight_hh torch.Size([750, 250])
GRUCell.2.bias_ih torch.Size([750])
GRUCell.2.bias_hh torch.Size([750])
align.0.weight torch.Size([1, 500])
align.0.bias torch.Size([1])
align.1.weight torch.Size([1, 500])
align.1.bias torch.Size([1])
align.2.weight torch.Size([1, 500])
align.2.bias torch.Size([1])
attend.0.weight torch.Size([250, 250])
attend.0.bias torch.Size([250])
attend.1.weight torch.Size([250, 250])
attend.1.bias torch.Size([250])
attend.2.weight torch.Size([250, 250]

In [None]:
def train(model, dataset, optimizer, loss_function):
    model.train()
    np.random.seed(epoch)
    valList = np.arange(0,dataset.shape[0])

    #shuffle them
    np.random.shuffle(valList)
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch)   

    for counter, train_batch in enumerate(batch_list):
        batch_df = dataset.loc[train_batch,:]
        smiles_list = batch_df.smiles.values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
#         print(torch.Tensor(x_atom).size(),torch.Tensor(x_bonds).size(),torch.cuda.LongTensor(x_atom_index).size(),torch.cuda.LongTensor(x_bond_index).size(),torch.Tensor(x_mask).size())
        
        model.zero_grad()

        # Step 4. Compute your loss function. (Again, Torch wants the target wrapped in a variable)
        loss = 0.0
        for i,task in enumerate(tasks):
            y_pred = mol_prediction[:, i * per_task_output_units_num:(i + 1) *
                                    per_task_output_units_num]
            y_val = batch_df[task].values

            validInds = np.where((y_val==0) | (y_val==1))[0]
#             validInds = np.where(y_val != -1)[0]
            if len(validInds) == 0:
                continue
            y_val_adjust = np.array([y_val[v] for v in validInds]).astype(float)
            validInds = torch.cuda.LongTensor(validInds).squeeze()
            y_pred_adjust = torch.index_select(y_pred, 0, validInds)

            loss += loss_function[i](
                y_pred_adjust,
                torch.cuda.LongTensor(y_val_adjust))
            
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
        
def eval(model, dataset):
    model.eval()
    y_val_list = {}
    y_pred_list = {}
    losses_list = []
    valList = np.arange(0,dataset.shape[0])
    batch_list = []

    for i in range(0, dataset.shape[0], batch_size):
        batch = valList[i:i+batch_size]
        batch_list.append(batch) 

    for counter, eval_batch in enumerate(batch_list):
        batch_df = dataset.loc[eval_batch,:]
        smiles_list = batch_df.smiles.values
        
        x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array(smiles_list,feature_dicts)
        atoms_prediction, mol_prediction = model(torch.Tensor(x_atom),torch.Tensor(x_bonds),torch.cuda.LongTensor(x_atom_index),torch.cuda.LongTensor(x_bond_index),torch.Tensor(x_mask))
        atom_pred = atoms_prediction.data[:,:,1].unsqueeze(2).cpu().numpy()

        for i,task in enumerate(tasks):
            y_pred = mol_prediction[:, i * per_task_output_units_num:(i + 1) *
                                    per_task_output_units_num]
            y_val = batch_df[task].values

            validInds = np.where((y_val==0) | (y_val==1))[0]
#             validInds = np.where((y_val=='0') | (y_val=='1'))[0]
#             print(validInds)
            if len(validInds) == 0:
                continue
            y_val_adjust = np.array([y_val[v] for v in validInds]).astype(float)
            validInds = torch.cuda.LongTensor(validInds).squeeze()
            y_pred_adjust = torch.index_select(y_pred, 0, validInds)
#             print(validInds)
            loss = loss_function[i](
                y_pred_adjust,
                torch.cuda.LongTensor(y_val_adjust))
#             print(y_pred_adjust)
            y_pred_adjust = F.softmax(y_pred_adjust,dim=-1).data.cpu().numpy()[:,1]
            losses_list.append(loss.cpu().detach().numpy())
            try:
                y_val_list[i].extend(y_val_adjust)
                y_pred_list[i].extend(y_pred_adjust)
            except:
                y_val_list[i] = []
                y_pred_list[i] = []
                y_val_list[i].extend(y_val_adjust)
                y_pred_list[i].extend(y_pred_adjust)
#             print(y_val,y_pred,validInds,y_val_adjust,y_pred_adjust)            
    eval_roc = [roc_auc_score(y_val_list[i], y_pred_list[i]) for i in range(len(tasks))]
    eval_prc = [auc(precision_recall_curve(y_val_list[i], y_pred_list[i])[1],precision_recall_curve(y_val_list[i], y_pred_list[i])[0]) for i in range(len(tasks))]
#     eval_precision = [precision_score(y_val_list[i],
#                                      (np.array(y_pred_list[i]) > 0.5).astype(int)) for i in range(len(tasks))]
#     eval_recall = [recall_score(y_val_list[i],
#                                (np.array(y_pred_list[i]) > 0.5).astype(int)) for i in range(len(tasks))]
    eval_loss = np.array(losses_list).mean()
    
    return eval_roc, eval_prc, eval_loss # eval_precision, eval_recall, 

In [None]:
best_param ={}
best_param["roc_epoch"] = 0
best_param["loss_epoch"] = 0
best_param["valid_roc"] = 0
best_param["valid_loss"] = 9e8

for epoch in range(epochs):    
    train_roc, train_prc, train_loss = eval(model, train_df)
    valid_roc, valid_prc, valid_loss = eval(model, valid_df)
    train_roc_mean = np.array(train_roc).mean()
    valid_roc_mean = np.array(valid_roc).mean()
    train_prc_mean = np.array(train_prc).mean()
    valid_prc_mean = np.array(valid_prc).mean()
    
#     tensorboard.add_scalars('ROC',{'train_roc':train_roc_mean,'valid_roc':valid_roc_mean},epoch)
#     tensorboard.add_scalars('Losses',{'train_losses':train_loss,'valid_losses':valid_loss},epoch)

    if valid_roc_mean > best_param["valid_roc"]:
        best_param["roc_epoch"] = epoch
        best_param["valid_roc"] = valid_roc_mean
        if valid_roc_mean > 0.75:
             torch.save(model, 'saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(epoch)+'.pt')             
    
    if valid_loss < best_param["valid_loss"]:
        best_param["loss_epoch"] = epoch
        best_param["valid_loss"] = valid_loss

    print("EPOCH:\t"+str(epoch)+'\n'\
        +"train_roc"+":"+str(train_roc)+'\n'\
        +"valid_roc"+":"+str(valid_roc)+'\n'\
        +"train_roc_mean"+":"+str(train_roc_mean)+'\n'\
        +"valid_roc_mean"+":"+str(valid_roc_mean)+'\n'\
        +"train_prc_mean"+":"+str(train_prc_mean)+'\n'\
        +"valid_prc_mean"+":"+str(valid_prc_mean)+'\n'\
        )
    if (epoch - best_param["roc_epoch"] >6) and (epoch - best_param["loss_epoch"] >8):        
        break
        
    train(model, train_df, optimizer, loss_function)

EPOCH:	0
train_roc:[0.4967834741693488, 0.19534348972100124, 0.574257180119645, 0.7371480482024044, 0.5888294820396811, 0.43160427508253596, 0.4578627652292951, 0.34963963852079993, 0.4776754088131114, 0.5742790008762882, 0.3999673792569143, 0.542576106487699, 0.5051638480810842, 0.26187485808356037, 0.7088595597909713, 0.5244908491070277, 0.6442557527383883]
valid_roc:[0.6022957461174882, 0.2111111111111111, 0.4718820861678005, 0.721498743431574, 0.7285362026451468, 0.19331960649736904, 0.36344969199178645, 0.5827538247566064, 0.2599594868332208, 0.41069397042093286, 0.5393360618462937, 0.701802418434862, 0.8729508196721311, 0.38192552225249776, 0.7147742818057455, 0.4176843057440072, 0.6912364130434783]
train_roc_mean:0.4982712421364562
valid_roc_mean:0.521482958398356
train_prc_mean:0.00551369328951437
valid_prc_mean:0.004575792850201917

EPOCH:	1
train_roc:[0.5595067324035343, 0.7520508090483286, 0.4631035411527884, 0.8017790342385562, 0.6879922231033764, 0.5348963870703001, 0.6304

In [None]:
# evaluate model
best_model = torch.load('saved_models/model_'+prefix_filename+'_'+start_time+'_'+str(best_param["roc_epoch"])+'.pt')     

# best_model_dict = best_model.state_dict()
# best_model_wts = copy.deepcopy(best_model_dict)

# model.load_state_dict(best_model_wts)
# (best_model.align[0].weight == model.align[0].weight).all()

test_roc, test_prc, test_losses = eval(best_model, test_df)

print("best epoch:"+str(best_param["roc_epoch"])
      +"\n"+"test_roc:"+str(test_roc)
      +"\n"+"test_roc_mean:",str(np.array(test_roc).mean())
      +"\n"+"test_prc_mean:",str(np.array(test_prc).mean())
     )

best epoch:31
test_roc:[0.7454422687373397, 0.973922902494331, 0.9501133786848073, 0.9748686314827507, 0.9417171037883884, 0.9226721574010524, 0.4396532055669633, 0.789522484932777, 0.8726085977942831, 0.9440273037542662, 0.8051386994088222, 0.9219712525667352, 0.9719945355191256, 0.987511353315168, 0.9997720018239853, 0.9513794663048395, 0.5591032608695652]
test_roc_mean: 0.8677305061438352
test_prc_mean: 0.15666023637334905
