# Train BonDNet 

In this notebook, we show how to train the BonDNet graph neural network model for bond dissociation energy (BDE) prediction. We only show how to train on CPUs. See [train_bde_distributed.py](./) for a script for training on GPUs (a single GPU or distributed training on multiple GPUs). 

In [1]:
import torch
import time 
import pandas as pd 
from torch.nn import MSELoss
from bondnet.data.dataset import ReactionNetworkDataset, ReactionNetworkDatasetGraphs
from bondnet.data.dataloader import DataLoaderReactionNetwork
from bondnet.data.featurizer import (
    AtomFeaturizerMinimum, BondAsNodeFeaturizerMinimum, 
    GlobalFeaturizer, BondAsNodeFeaturizerFull,
    AtomFeaturizerGraph, BondAsNodeGraphFeaturizer, GlobalFeaturizerGraph,
)
from bondnet.data.grapher import HeteroMoleculeGraph, HeteroCompleteGraphFromPandas, HeteroCompleteGraphFromDGLAndPandas
from bondnet.data.dataset import train_validation_test_split
from bondnet.model.gated_reaction_network import GatedGCNReactionNetwork
from bondnet.scripts.create_label_file import read_input_files
from bondnet.model.metric import WeightedL1Loss
from bondnet.utils import seed_torch
from torchsummary import summary

  from .autonotebook import tqdm as notebook_tqdm
Using backend: pytorch


In [2]:
print(torch.__version__)

1.6.0


## Dataset 

We work with a small dataset consisting of 200 BDEs for netural and charged molecules. The dataset is specified in three files:
- `molecules.sdf` This file contains all the molecules (both reactants and products) in the bond dissociation reactions. The molecules are specified in SDF format. 
- `molecule_attributes.yaml` This file contains extra molecular attributes (charges here) for molecules given in `molecules.sdf`. Some molecular attributes can be inferred from its SDF block, and they are overrode by the attributes specified in the `molecule_attributes.yaml` file.  
- `reactions.csv` This file list the bond dissociation reations formed by the molecules given in `molecules.sdf`. Each line lists the reactant, products, and BDE of a reaction. The reactant and products are specified by their index in `molecules.sdf`. 

See [here](./examples/train) for the three files used in this notebook. 

#### Grapher 

BondNet is graph neutral network model that takes atom features (e.g. atom type), bond features (e.g. whether a bond is in a ring), and global features (e.g. total charge) as input. We extract the features for a molecule using a grapher.

In [3]:
def get_grapher():
    atom_featurizer = AtomFeaturizerMinimum()
    #bond_featurizer = BondAsNodeFeaturizerMinimum()
    #global_featurizer = GlobalFeaturizer(allowed_charges=[-2, -1, 0, 1])
    #grapher = HeteroMoleculeGraph(atom_featurizer, bond_featurizer, global_featurizer)
    
    atom_featurizer = AtomFeaturizerGraph()
    bond_featurizer = BondAsNodeGraphFeaturizer()
    global_featurizer = GlobalFeaturizerGraph(allowed_charges=[-2, -1, 0, 1])
    grapher = HeteroCompleteGraphFromDGLAndPandas(
        atom_featurizer, bond_featurizer, global_featurizer)
    return grapher

get_grapher()

<bondnet.data.grapher.HeteroCompleteGraphFromDGLAndPandas at 0x7fdf90243710>

#### Read dataset 

Let's now read the dataset and featurize the molecules using the above defined grapher. The dataset is split into a training set (80%), validation set (10%), and test set (10%). We will train our model using the training set, stop the training using the validation set, and report error on the test set. 

In [4]:
seed_torch()
path_mg_data = "/home/santiagovargas/Documents/Dataset/mg_dataset/"

#mols, attrs, labels = read_input_files(
#    'examples/train/molecules.sdf', 
#    'examples/train/molecule_attributes.yaml', 
#   'examples/train/reactions.yaml', 
#    #path_mg_data + "mg_struct_bond_rgrn.sdf",
#    #path_mg_data + "mg_feature_bond_rgrn.yaml",
#    #path_mg_data + "mg_label_bond_rgrn.yaml",
#)
#dataset = ReactionNetworkDataset(
#    grapher=get_grapher(),
#    molecules=mols,
#    labels=labels,
#    extra_features=attrs
#)

dataset = ReactionNetworkDatasetGraphs(
    grapher=get_grapher(),
    file=path_mg_data,
    out_file = './'

)

detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
cannot handle three or more products
detected bond in prod. not in react.
cannot handle three or more products
cannot handle three or more products
detected bond in prod. not in react.
detected bond in prod. not in react.
detected bond in prod. not in react.
cannot handle three or more products
c

RDKit ERROR: [17:11:18] Explicit valence for atom # 9 N, 4, is greater than permitted
[17:11:18] Explicit valence for atom # 9 N, 4, is greater than permitted
RDKit ERROR: [17:11:18] Explicit valence for atom # 5 N, 4, is greater than permitted
[17:11:18] Explicit valence for atom # 5 N, 4, is greater than permitted
RDKit ERROR: [17:11:18] Explicit valence for atom # 4 C, 5, is greater than permitted
[17:11:18] Explicit valence for atom # 4 C, 5, is greater than permitted
RDKit ERROR: [17:11:18] Explicit valence for atom # 13 N, 4, is greater than permitted
[17:11:18] Explicit valence for atom # 13 N, 4, is greater than permitted
RDKit ERROR: [17:11:18] Explicit valence for atom # 4 N, 4, is greater than permitted
[17:11:18] Explicit valence for atom # 4 N, 4, is greater than permitted
RDKit ERROR: [17:11:18] Explicit valence for atom # 4 N, 4, is greater than permitted
[17:11:18] Explicit valence for atom # 4 N, 4, is greater than permitted
RDKit ERROR: [17:11:19] Explicit valence for

features: 158
labels: 63
molecules: 158
number molecules:158
number of features: (158,)


AttributeError: 'tuple' object has no attribute 'remove'

In [5]:
trainset, valset, testset = train_validation_test_split(dataset, validation=0.2, test=0.2)
dataset_loader = DataLoaderReactionNetwork(dataset, batch_size=100,shuffle=True)
train_loader = DataLoaderReactionNetwork(trainset, batch_size=100,shuffle=True)
val_loader = DataLoaderReactionNetwork(valset, batch_size=len(valset), shuffle=False)
#test_loader = DataLoaderReactionNetwork(testset, batch_size=len(testset), shuffle=False)
#test_ind = 3
#elements = [i['name'] for i in dataset.pandas_df.iloc[test_ind]['reactant_molecule_graph']['molecule']['sites']]
#print(elements)
#print(len(dataset.molecules))

In [5]:
#dataset[0][0].reactions[1].reactants
print(dataset.all_labels)

NameError: name 'dataset' is not defined

## Model 

We create the BonDNet model by instantiating the `GatedGCNReactionNetwork` class and providing the parameters defining the model structure. 
- `embedding_size` The size to unify the atom, bond, and global feature length.
- `gated_num_layers` Number of graph to graph module to learn molecular representation. 
- `gated_hidden_size` Hidden layer size in the graph to graph modules. 
- `gated_activation` Activation function appleid after the hidden layers in the graph to graph modules. 
- `fc_num_layers` Number of hidden layers of the fully connected network to map reaction feature to the BDE. The reaction feature is obtained as the differece of the features between the products and the reactant. 
- `fc_hidden_size` Size of the hidden layers. 
- `fc_activation` Activation function applied after the hidden layers. 

There are other arguments (e.g. residual connection, dropout ratio, batch norm) that can be specified to fine control the model. See the documentation of the `GatedGCNReactionNetwork` for more information.  

In [7]:
model = GatedGCNReactionNetwork(
    in_feats=dataset.feature_size,
    embedding_size=24,
    gated_num_layers=2,
    gated_hidden_size=[64, 64, 64],
    gated_activation="ReLU",
    fc_num_layers=2,
    fc_hidden_size=[128, 64],
    fc_activation='ReLU'
)
#print(dict(model.named_parameters()))

## Train the model 

Before going to the main training loop, we define two functions: `train` and `evaluate` that will be used later. 

The `train` function optimizes the model parameters for an epoch. We note that our target BDEs are centered and then normalized by the standard deviation (done in the `ReactionNetworkDataset`.) So to measure the mean absolute error, we need to multiply the standard deviation back. This is acheived achieved by the `WeightedL1Loss` function passed as `metric_fn`.   

In [8]:
def train(optimizer, model, nodes, data_loader, loss_fn, metric_fn):

    model.train()

    epoch_loss = 0.0
    accuracy = 0.0
    count = 0.0

    for it, (batched_graph, label) in enumerate(data_loader):
        feats = {nt: batched_graph.nodes[nt].data["feat"] for nt in nodes}
        target = label["value"]
        stdev = label["scaler_stdev"]

        pred = model(batched_graph, feats, label["reaction"])
        pred = pred.view(-1)

        loss = loss_fn(pred, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step() # here is the actual optimizer step

        epoch_loss += loss.detach().item()
        accuracy += metric_fn(pred, target, stdev).detach().item()
        count += len(target)
    
    epoch_loss /= it + 1
    accuracy /= count

    return epoch_loss, accuracy

The `evaluate` function computes the mean absolute error for the validation set (or test set).

In [9]:
def evaluate(model, nodes, data_loader, metric_fn):
    model.eval()

    with torch.no_grad():
        accuracy = 0.0
        count = 0.0

        for batched_graph, label in data_loader:
            feats = {nt: batched_graph.nodes[nt].data["feat"] for nt in nodes}
            target = label["value"]
            stdev = label["scaler_stdev"]

            pred = model(batched_graph, feats, label["reaction"])
            pred = pred.view(-1)

            accuracy += metric_fn(pred, target, stdev).detach().item()
            count += len(target)

    return accuracy / count

Now, we have all the ingredients to train the model. 

We optimize the model parameters by minimizing a mean squared error loss function using the `Adam` optimizer with a learning rate of `0.001`. Here we train the model for 20 epochs; save the best performing model that gets the smallest mean absolute error on the validation set; and finally test model performance on the test set. 

In [10]:


t1 = time.time()
# optimizer, loss function and metric function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_func = MSELoss(reduction="mean")
metric = WeightedL1Loss(reduction="sum")

feature_names = ["atom", "bond", "global"]
best = 1e10
num_epochs = 20

# main training loop
print("# Epoch     Loss         TrainAcc        ValAcc")
for epoch in range(num_epochs):

    # train on training set 
    loss, train_acc = train(optimizer, model, feature_names, dataset_loader, loss_func, metric)
    # evaluate on validation set
    val_acc = evaluate(model, feature_names, dataset_loader, metric)
    # save checkpoint for best performing model 
    is_best = val_acc < best
    if is_best:
        best = val_acc
        torch.save(model.state_dict(), 'checkpoint.pkl')
    print("{:5d}   {:12.6e}   {:12.6e}   {:12.6e}".format(epoch, loss, train_acc, val_acc))


# load best performing model and test it's performance on the test set
checkpoint = torch.load("checkpoint.pkl")
model.load_state_dict(checkpoint)
test_acc = evaluate(model, feature_names, test_loader, metric)
print("TestAcc: {:12.6e}".format(test_acc))
t2 = time.time()
print("Time to Training: {:5.1f} seconds".format(float(t2 - t1)))


# Epoch     Loss         TrainAcc        ValAcc


AssertionError: products_ft (28) and mappings[atom] (8) have different length

In [None]:
# not working

#model.readout_layer.requires_grad_ = False
#for ind in range(len(model.gated_layers)):
#    model.gated_layers[ind].requires_grad_ = False
#for ind in range(len(model.fc_layers)):
#    model.fc_layers[ind].requires_grad_ = False
#print(dict(model.named_parameters()))

# works but messy 
# for ind, param in enumerate(model.parameters()):
#    print(ind)
#    param.requires_grad = True

model.gated_layers.requires_grad_(False)
model.fc_layers.requires_grad_(False)

import time 

t1 = time.time()
# optimizer, loss function and metric function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_func = MSELoss(reduction="mean")
metric = WeightedL1Loss(reduction="sum")

feature_names = ["atom", "bond", "global"]
best = 1e10
num_epochs = 20

# main training loop
print("# Epoch     Loss         TrainAcc        ValAcc")
for epoch in range(num_epochs):

    # train on training set 
    loss, train_acc = train( optimizer, model, feature_names, train_loader, loss_func, metric)

    # evaluate on validation set
    val_acc = evaluate(model, feature_names, val_loader, metric)

    # save checkpoint for best performing model 
    is_best = val_acc < best
    if is_best:
        best = val_acc
        torch.save(model.state_dict(), 'checkpoint.pkl')
        
    print("{:5d}   {:12.6e}   {:12.6e}   {:12.6e}".format(epoch, loss, train_acc, val_acc))


# load best performing model and test it's performance on the test set
checkpoint = torch.load("checkpoint.pkl")
model.load_state_dict(checkpoint)
test_acc = evaluate(model, feature_names, test_loader, metric)

print("TestAcc: {:12.6e}".format(test_acc))
t2 = time.time()

print("Time to Training: {:5.1f} seconds".format(float(t2 - t1)))



# Epoch     Loss         TrainAcc        ValAcc


RuntimeError: Boolean value of Tensor with more than one value is ambiguous

In [None]:


print(dict(model.named_parameters()))


{'embedding.linears.atom.weight': Parameter containing:
tensor([[-0.1229, -0.1820,  0.2259, -0.1601, -0.1236, -0.2565, -0.2475, -0.1548,
         -0.1487, -0.1562,  0.0024, -0.1058, -0.0682,  0.1771,  0.1847],
        [-0.1228,  0.0706, -0.0556,  0.1635,  0.0976,  0.2374,  0.2082,  0.1164,
          0.2426,  0.0646,  0.1697,  0.1223, -0.0323,  0.1194, -0.0056],
        [-0.1548, -0.2274,  0.0442,  0.1656, -0.1996,  0.0885, -0.2471,  0.2082,
          0.2225,  0.1668,  0.1877,  0.1623, -0.2363,  0.0457,  0.2221],
        [ 0.0475,  0.2010,  0.2128,  0.0443, -0.1652,  0.1681,  0.1750, -0.0066,
         -0.1921,  0.2215, -0.1688,  0.0517, -0.0322,  0.0566,  0.1166],
        [-0.2433,  0.0629, -0.1690,  0.2420,  0.1004, -0.0404,  0.0288, -0.2140,
         -0.1770, -0.1850, -0.2456,  0.1955,  0.0194,  0.0457,  0.0032],
        [-0.1956,  0.1031,  0.2124,  0.1097,  0.0513,  0.0603,  0.0997,  0.2145,
          0.1170,  0.1613, -0.2174,  0.1457, -0.1428,  0.0820,  0.0378],
        [ 0.1718, -0

In [None]:
# dict(model.named_parameters())
# print(dict(model.named_parameters())['embedding.linears.atom.weight'])
# for i in model.gated_layers:
#     i.requires_grad = False
# for i in model.fc_layers:
#     i.requires_grad = False
# model.parameters['gated_layers']
# model.fc_layers[-1]['weight'].requires_grad = False
# print(dict(model.fc_layers[-1].named_parameters())['weight'])
# print(dict(model.fc_layers.named_parameters()))
# fc_layers.4.bias
#for i in dict(model.named_parameters()):
print(model.parameters())


<generator object Module.parameters at 0x7f6a82d5e408>
