# Train BonDNet 

In this notebook, we show how to train the BonDNet graph neural network model for bond dissociation energy (BDE) prediction. We only show how to train on CPUs. See [train_bde_distributed.py](./) for a script for training on GPUs (a single GPU or distributed training on multiple GPUs). 

In [1]:
import torch
from torch.nn import MSELoss
from bondnet.data.dataset import ReactionNetworkDataset
from bondnet.data.dataloader import DataLoaderReactionNetwork
from bondnet.data.featurizer import AtomFeaturizerMinimum, BondAsNodeFeaturizerMinimum, GlobalFeaturizer, BondAsNodeFeaturizerFull
from bondnet.data.grapher import HeteroMoleculeGraph
from bondnet.data.dataset import train_validation_test_split
from bondnet.model.gated_reaction_network import GatedGCNReactionNetwork
from bondnet.scripts.create_label_file import read_input_files
from bondnet.model.metric import WeightedL1Loss
from bondnet.utils import seed_torch
from torchsummary import summary

  from .autonotebook import tqdm as notebook_tqdm
Using backend: pytorch


In [2]:
print(torch.__version__)

1.6.0


## Dataset 

We work with a small dataset consisting of 200 BDEs for netural and charged molecules. The dataset is specified in three files:
- `molecules.sdf` This file contains all the molecules (both reactants and products) in the bond dissociation reactions. The molecules are specified in SDF format. 
- `molecule_attributes.yaml` This file contains extra molecular attributes (charges here) for molecules given in `molecules.sdf`. Some molecular attributes can be inferred from its SDF block, and they are overrode by the attributes specified in the `molecule_attributes.yaml` file.  
- `reactions.csv` This file list the bond dissociation reations formed by the molecules given in `molecules.sdf`. Each line lists the reactant, products, and BDE of a reaction. The reactant and products are specified by their index in `molecules.sdf`. 

See [here](./examples/train) for the three files used in this notebook. 

#### Grapher 

BondNet is graph neutral network model that takes atom features (e.g. atom type), bond features (e.g. whether a bond is in a ring), and global features (e.g. total charge) as input. We extract the features for a molecule using a grapher.

In [3]:
def get_grapher():
    atom_featurizer = AtomFeaturizerMinimum()
    bond_featurizer = BondAsNodeFeaturizerMinimum()
    #bond_featurizer = BondAsNodeFeaturizerFull()
    # our example dataset contains molecules of charges -1, 0, and 1
    global_featurizer = GlobalFeaturizer(allowed_charges=[-2, -1, 0, 1, 2])

    grapher = HeteroMoleculeGraph(atom_featurizer, bond_featurizer, global_featurizer)
    
    return grapher

#### Read dataset 

Let's now read the dataset and featurize the molecules using the above defined grapher. The dataset is split into a training set (80%), validation set (10%), and test set (10%). We will train our model using the training set, stop the training using the validation set, and report error on the test set. 

In [4]:
# seed random number generators 
seed_torch()

#mols, attrs, labels = read_input_files(
#    'examples/train/molecules.sdf', 
#    'examples/train/molecule_attributes.yaml', 
#    'examples/train/reactions.yaml', 
#)
path_mg_data = "/home/santiagovargas/Documents/Dataset/mg_dataset/"

#mols, attrs, labels = read_input_files(
#    #'examples/train/molecules.sdf', 
#    #'examples/train/molecule_attributes.yaml', 
#    #'examples/train/reactions.yaml', 
#    path_mg_data + "mg_struct_bond_rgrn.sdf",
#    path_mg_data + "mg_feature_bond_rgrn.yaml",
#    path_mg_data + "mg_label_bond_rgrn.yaml",
#)
from bondnet.dataset.mg_barrier import process_data

#mols, labels, attrs = process_data()
mols = path_mg_data + "mg_struct_bond_rgrn.sdf"
attrs = path_mg_data + "mg_feature_bond_rgrn.yaml"
labels = path_mg_data + "mg_label_bond_rgrn.yaml"

dataset = ReactionNetworkDataset(
    grapher=get_grapher(),
    molecules=mols,
    labels=labels,
    extra_features=attrs
)

trainset, valset, testset = train_validation_test_split(dataset, validation=0.1, test=0.1)

# we train with a batch size of 100
train_loader = DataLoaderReactionNetwork(trainset, batch_size=100,shuffle=True)
val_loader = DataLoaderReactionNetwork(valset, batch_size=len(valset), shuffle=False)
test_loader = DataLoaderReactionNetwork(testset, batch_size=len(testset), shuffle=False)

[08:43:20] skipping block at line 110: 'BEGIN BOND'
RDKit ERROR: [08:43:20] Explicit valence for atom # 4 O, 3, is greater than permitted
RDKit ERROR: [08:43:20] ERROR: Could not sanitize molecule ending on line 317
RDKit ERROR: [08:43:20] ERROR: Explicit valence for atom # 4 O, 3, is greater than permitted
RDKit ERROR: [08:43:20] Explicit valence for atom # 4 O, 3, is greater than permitted
RDKit ERROR: [08:43:20] ERROR: Could not sanitize molecule ending on line 371
RDKit ERROR: [08:43:20] ERROR: Explicit valence for atom # 4 O, 3, is greater than permitted
RDKit ERROR: [08:43:20] Explicit valence for atom # 0 H, 2, is greater than permitted
RDKit ERROR: [08:43:20] ERROR: Could not sanitize molecule ending on line 397
RDKit ERROR: [08:43:20] ERROR: Explicit valence for atom # 0 H, 2, is greater than permitted
RDKit ERROR: [08:43:20] Explicit valence for atom # 1 O, 3, is greater than permitted
RDKit ERROR: [08:43:20] ERROR: Could not sanitize molecule ending on line 423
RDKit ERROR: 

In [14]:
dataset.get_molecules()

TypeError: get_molecules() missing 1 required positional argument: 'molecules'

## Model 

We create the BonDNet model by instantiating the `GatedGCNReactionNetwork` class and providing the parameters defining the model structure. 
- `embedding_size` The size to unify the atom, bond, and global feature length.
- `gated_num_layers` Number of graph to graph module to learn molecular representation. 
- `gated_hidden_size` Hidden layer size in the graph to graph modules. 
- `gated_activation` Activation function appleid after the hidden layers in the graph to graph modules. 
- `fc_num_layers` Number of hidden layers of the fully connected network to map reaction feature to the BDE. The reaction feature is obtained as the differece of the features between the products and the reactant. 
- `fc_hidden_size` Size of the hidden layers. 
- `fc_activation` Activation function applied after the hidden layers. 

There are other arguments (e.g. residual connection, dropout ratio, batch norm) that can be specified to fine control the model. See the documentation of the `GatedGCNReactionNetwork` for more information.  

In [5]:
model = GatedGCNReactionNetwork(
    in_feats=dataset.feature_size,
    embedding_size=24,
    gated_num_layers=3,
    gated_hidden_size=[64, 64, 64],
    gated_activation="ReLU",
    fc_num_layers=2,
    fc_hidden_size=[128, 64],
    fc_activation='ReLU'
)

print(dict(model.named_parameters()))


{'embedding.linears.atom.weight': Parameter containing:
tensor([[-0.1154, -0.1710,  0.2122, -0.1504, -0.1161, -0.2409, -0.2325, -0.1454,
         -0.1396, -0.1468,  0.0023, -0.0994, -0.0641,  0.1664,  0.1735, -0.1153,
          0.0663],
        [-0.0522,  0.1535,  0.0917,  0.2230,  0.1956,  0.1093,  0.2279,  0.0607,
          0.1594,  0.1149, -0.0303,  0.1121, -0.0052, -0.1454, -0.2136,  0.0416,
          0.1555],
        [-0.1875,  0.0831, -0.2321,  0.1955,  0.2090,  0.1567,  0.1763,  0.1524,
         -0.2220,  0.0429,  0.2086,  0.0446,  0.1888,  0.1999,  0.0416, -0.1552,
          0.1579],
        [ 0.1644, -0.0062, -0.1804,  0.2080, -0.1585,  0.0486, -0.0302,  0.0532,
          0.1095, -0.2286,  0.0591, -0.1587,  0.2273,  0.0943, -0.0380,  0.0271,
         -0.2010],
        [-0.1663, -0.1737, -0.2307,  0.1836,  0.0182,  0.0429,  0.0030, -0.1837,
          0.0969,  0.1995,  0.1031,  0.0482,  0.0566,  0.0936,  0.2015,  0.1099,
          0.1515],
        [-0.2042,  0.1368, -0.1341,  0.

## Train the model 

Before going to the main training loop, we define two functions: `train` and `evaluate` that will be used later. 

The `train` function optimizes the model parameters for an epoch. We note that our target BDEs are centered and then normalized by the standard deviation (done in the `ReactionNetworkDataset`.) So to measure the mean absolute error, we need to multiply the standard deviation back. This is acheived achieved by the `WeightedL1Loss` function passed as `metric_fn`.   

In [6]:
def train(optimizer, model, nodes, data_loader, loss_fn, metric_fn):

    model.train()

    epoch_loss = 0.0
    accuracy = 0.0
    count = 0.0

    for it, (batched_graph, label) in enumerate(data_loader):
        feats = {nt: batched_graph.nodes[nt].data["feat"] for nt in nodes}
        target = label["value"]
        stdev = label["scaler_stdev"]

        pred = model(batched_graph, feats, label["reaction"])
        pred = pred.view(-1)

        loss = loss_fn(pred, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step() # here is the actual optimizer step

        epoch_loss += loss.detach().item()
        accuracy += metric_fn(pred, target, stdev).detach().item()
        count += len(target)
    
    epoch_loss /= it + 1
    accuracy /= count

    return epoch_loss, accuracy

The `evaluate` function computes the mean absolute error for the validation set (or test set).

In [7]:
def evaluate(model, nodes, data_loader, metric_fn):
    model.eval()

    with torch.no_grad():
        accuracy = 0.0
        count = 0.0

        for batched_graph, label in data_loader:
            feats = {nt: batched_graph.nodes[nt].data["feat"] for nt in nodes}
            target = label["value"]
            stdev = label["scaler_stdev"]

            pred = model(batched_graph, feats, label["reaction"])
            pred = pred.view(-1)

            accuracy += metric_fn(pred, target, stdev).detach().item()
            count += len(target)

    return accuracy / count

Now, we have all the ingredients to train the model. 

We optimize the model parameters by minimizing a mean squared error loss function using the `Adam` optimizer with a learning rate of `0.001`. Here we train the model for 20 epochs; save the best performing model that gets the smallest mean absolute error on the validation set; and finally test model performance on the test set. 

In [8]:
import time 

t1 = time.time()
# optimizer, loss function and metric function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_func = MSELoss(reduction="mean")
metric = WeightedL1Loss(reduction="sum")

feature_names = ["atom", "bond", "global"]
best = 1e10
num_epochs = 20

# main training loop
print("# Epoch     Loss         TrainAcc        ValAcc")
for epoch in range(num_epochs):

    # train on training set 
    loss, train_acc = train( optimizer, model, feature_names, train_loader, loss_func, metric)

    # evaluate on validation set
    val_acc = evaluate(model, feature_names, val_loader, metric)

    # save checkpoint for best performing model 
    is_best = val_acc < best
    if is_best:
        best = val_acc
        torch.save(model.state_dict(), 'checkpoint.pkl')
        
    print("{:5d}   {:12.6e}   {:12.6e}   {:12.6e}".format(epoch, loss, train_acc, val_acc))


# load best performing model and test it's performance on the test set
checkpoint = torch.load("checkpoint.pkl")
model.load_state_dict(checkpoint)
test_acc = evaluate(model, feature_names, test_loader, metric)

print("TestAcc: {:12.6e}".format(test_acc))
t2 = time.time()
print("Time to Training: {:5.1f} seconds".format(float(t2 - t1)))


# Epoch     Loss         TrainAcc        ValAcc


AssertionError: product item not smaller than size

In [11]:
# not working

#model.readout_layer.requires_grad_ = False
#for ind in range(len(model.gated_layers)):
#    model.gated_layers[ind].requires_grad_ = False
#for ind in range(len(model.fc_layers)):
#    model.fc_layers[ind].requires_grad_ = False
#print(dict(model.named_parameters()))

# works but messy 
# for ind, param in enumerate(model.parameters()):
#    print(ind)
#    param.requires_grad = True

model.gated_layers.requires_grad_(False)
model.fc_layers.requires_grad_(False)

import time 

t1 = time.time()
# optimizer, loss function and metric function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_func = MSELoss(reduction="mean")
metric = WeightedL1Loss(reduction="sum")

feature_names = ["atom", "bond", "global"]
best = 1e10
num_epochs = 20

# main training loop
print("# Epoch     Loss         TrainAcc        ValAcc")
for epoch in range(num_epochs):

    # train on training set 
    loss, train_acc = train( optimizer, model, feature_names, train_loader, loss_func, metric)

    # evaluate on validation set
    val_acc = evaluate(model, feature_names, val_loader, metric)

    # save checkpoint for best performing model 
    is_best = val_acc < best
    if is_best:
        best = val_acc
        torch.save(model.state_dict(), 'checkpoint.pkl')
        
    print("{:5d}   {:12.6e}   {:12.6e}   {:12.6e}".format(epoch, loss, train_acc, val_acc))


# load best performing model and test it's performance on the test set
checkpoint = torch.load("checkpoint.pkl")
model.load_state_dict(checkpoint)
test_acc = evaluate(model, feature_names, test_loader, metric)

print("TestAcc: {:12.6e}".format(test_acc))
t2 = time.time()

print("Time to Training: {:5.1f} seconds".format(float(t2 - t1)))



# Epoch     Loss         TrainAcc        ValAcc
    0   1.075364e+00   2.650825e+00   1.882989e+00
    1   9.821787e-01   2.625252e+00   1.880130e+00
    2   9.507621e-01   2.612707e+00   1.885765e+00
    3   8.983829e-01   2.590667e+00   1.883977e+00
    4   8.983957e-01   2.570117e+00   1.879380e+00
    5   9.359967e-01   2.542933e+00   1.871620e+00
    6   9.032701e-01   2.518162e+00   1.860771e+00
    7   9.331724e-01   2.504215e+00   1.846488e+00
    8   8.535580e-01   2.479704e+00   1.833602e+00
    9   8.570958e-01   2.450828e+00   1.818076e+00
   10   7.650453e-01   2.422336e+00   1.795764e+00
   11   8.498226e-01   2.391902e+00   1.768186e+00
   12   8.430975e-01   2.369736e+00   1.738051e+00
   13   7.846934e-01   2.338836e+00   1.708720e+00
   14   7.764476e-01   2.295667e+00   1.687622e+00
   15   7.120104e-01   2.265451e+00   1.666801e+00
   16   7.470441e-01   2.241836e+00   1.648616e+00
   17   6.573917e-01   2.199967e+00   1.609619e+00
   18   6.524919e-01   2.158915e+0

In [12]:


print(dict(model.named_parameters()))


{'embedding.linears.atom.weight': Parameter containing:
tensor([[ 4.9460e-02, -2.4917e-01, -8.0144e-02,  7.2956e-03, -1.2729e-01,
         -5.7689e-03, -8.2238e-02,  2.1951e-01,  3.3598e-02,  8.4139e-02,
          1.1528e-01,  7.2560e-02,  2.2278e-01],
        [ 2.1574e-01,  2.5080e-01,  1.4124e-01,  4.1507e-02,  1.1311e-01,
         -3.8082e-02, -1.8386e-01,  1.6025e-01,  2.6416e-01,  3.8437e-02,
          1.6790e-01, -8.0157e-02,  2.2799e-01],
        [ 2.2316e-01, -8.4115e-02, -1.8951e-01, -1.9814e-01, -1.1727e-01,
          6.4106e-02, -1.6739e-01, -1.2682e-01,  7.3353e-02, -5.8392e-04,
         -1.4186e-01,  2.0208e-01,  2.6837e-02],
        [-1.7118e-01,  2.0525e-01, -4.1438e-02,  8.0060e-02, -2.5566e-01,
          1.5491e-01,  2.8335e-02,  3.2526e-02,  3.8630e-02, -1.1949e-01,
         -1.2733e-01, -5.5469e-02, -6.9119e-02],
        [-2.5243e-02,  1.4605e-01,  1.3689e-01,  2.3643e-01,  4.6312e-02,
         -9.2213e-02,  1.3232e-01, -1.0464e-02, -2.2488e-02,  1.9374e-01,
        

In [15]:
# dict(model.named_parameters())
# print(dict(model.named_parameters())['embedding.linears.atom.weight'])
# for i in model.gated_layers:
#     i.requires_grad = False
# for i in model.fc_layers:
#     i.requires_grad = False
# model.parameters['gated_layers']
# model.fc_layers[-1]['weight'].requires_grad = False
# print(dict(model.fc_layers[-1].named_parameters())['weight'])
# print(dict(model.fc_layers.named_parameters()))
# fc_layers.4.bias
#for i in dict(model.named_parameters()):
print(model.parameters())


<generator object Module.parameters at 0x7f6a82d5e408>


In [3]:
import pandas as pd 
mg_file_ref = "/home/santiagovargas/Documents/Dataset/20210721_madeira_g2.json" 
libe_file_ref = "/home/santiagovargas/Documents/Dataset/libe.json"
mg_df = pd.read_json(mg_file_ref)
libe_df = pd.read_json(libe_file_ref)

In [9]:
libe_df.columns

Index(['molecule_id', 'bonds', 'charge', 'chemical_system', 'composition',
       'elements', 'formula_alphabetical', 'molecule', 'molecule_graph',
       'number_atoms', 'number_elements', 'partial_charges', 'partial_spins',
       'point_group', 'species', 'spin_multiplicity', 'thermo', 'vibration',
       'xyz'],
      dtype='object')

In [8]:
mg_df.columns

Index(['molecule_id', 'bonds', 'charge', 'chemical_system', 'composition',
       'elements', 'formula_alphabetical', 'molecule', 'molecule_graph',
       'number_atoms', 'number_elements', 'partial_charges', 'partial_spins',
       'point_group', 'species', 'spin_multiplicity', 'thermo', 'vibration',
       'xyz'],
      dtype='object')