<a href="https://colab.research.google.com/github/zahra-teb/Graph-ML-Final-Project/blob/main/Graph_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are going to implement some graph classifiers(GNNs) for classification task on molecular dataset BBBP using Deep Graph Library.

First, let's install required packages.

In [1]:
!pip install dgl -f https://data.dgl.ai/wheels/repo.html

Looking in links: https://data.dgl.ai/wheels/repo.html
Collecting dgl
  Downloading dgl-1.1.1-cp310-cp310-manylinux1_x86_64.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dgl
Successfully installed dgl-1.1.1


In [2]:
!pip install dglgo -f https://data.dgl.ai/wheels-test/repo.html

Looking in links: https://data.dgl.ai/wheels-test/repo.html
Collecting dglgo
  Downloading dglgo-0.0.2-py3-none-any.whl (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.5/63.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting isort>=5.10.1 (from dglgo)
  Downloading isort-5.12.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autopep8>=1.6.0 (from dglgo)
  Downloading autopep8-2.0.2-py2.py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpydoc>=1.1.0 (from dglgo)
  Downloading numpydoc-1.5.0-py3-none-any.whl (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.4/52.4 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting ruamel.yaml>=0.17.20 (from dglgo)
  Downloading ruamel.yaml-0.17.32-py3-none-any.whl (1

#### Imports

We mainly use PyTorch and DGL.

In [3]:
%matplotlib inline
import os

os.environ["DGLBACKEND"] = "pytorch"
import dgl
import numpy as np
import networkx as nx
import torch
import torch.nn as nn
import dgl.function as fn
import torch.nn.functional as F
import shutil
from torch.utils.data import DataLoader
import cloudpickle
from dgl.nn import GraphConv

Now let's access the dataset uploaded in drive.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!mkdir BBBP_dataset

Unzipping dataset into the directory we just made:

In [6]:
!unzip /content/drive/MyDrive/graph_dataset.zip -d BBBP_dataset

Archive:  /content/drive/MyDrive/graph_dataset.zip
  inflating: BBBP_dataset/scaffold_0_smiles_train.pickle  
  inflating: BBBP_dataset/scaffold_0_test.bin  
  inflating: BBBP_dataset/scaffold_0_val.bin  
  inflating: BBBP_dataset/scaffold_0_smiles_val.pickle  
  inflating: BBBP_dataset/scaffold_0_smiles_test.pickle  
  inflating: BBBP_dataset/scaffold_0_train.bin  


#### Exploring data

In [7]:
train = dgl.load_graphs('BBBP_dataset/scaffold_0_train.bin')
type(train)

tuple

The training set is actually a pair: first component is a list of molecular graphs and the second component is a dictionary consist of labels, masks and global features of each graph:

In [8]:
train

([Graph(num_nodes=8, num_edges=14,
        ndata_schemes={'v': Scheme(shape=(128,), dtype=torch.float32)}
        edata_schemes={'e': Scheme(shape=(13,), dtype=torch.float32)}),
  Graph(num_nodes=2, num_edges=2,
        ndata_schemes={'v': Scheme(shape=(128,), dtype=torch.float32)}
        edata_schemes={'e': Scheme(shape=(13,), dtype=torch.float32)}),
  Graph(num_nodes=9, num_edges=16,
        ndata_schemes={'v': Scheme(shape=(128,), dtype=torch.float32)}
        edata_schemes={'e': Scheme(shape=(13,), dtype=torch.float32)}),
  Graph(num_nodes=4, num_edges=6,
        ndata_schemes={'v': Scheme(shape=(128,), dtype=torch.float32)}
        edata_schemes={'e': Scheme(shape=(13,), dtype=torch.float32)}),
  Graph(num_nodes=1, num_edges=0,
        ndata_schemes={'v': Scheme(shape=(128,), dtype=torch.float32)}
        edata_schemes={'e': Scheme(shape=(20,), dtype=torch.float32)}),
  Graph(num_nodes=10, num_edges=18,
        ndata_schemes={'v': Scheme(shape=(128,), dtype=torch.float32)}
      

The validation and test sets are similar to the train set.

In [9]:
train_masks = train[1]['masks']
torch.unique(train_masks)

tensor([1.])

In [10]:
valid = dgl.load_graphs('BBBP_dataset/scaffold_0_val.bin')
valid_masks = valid[1]['masks']
torch.unique(valid_masks)

tensor([1.])

In [11]:
test = dgl.load_graphs('BBBP_dataset/scaffold_0_test.bin')
test_masks = test[1]['masks']
torch.unique(test_masks)

tensor([1.])

So all values of masks tensor is 1. Thus we do not need masks. So we do not consider it in the following process.

#### Paths:

Here we just set some paths for saving the models and parameters;.

In [12]:
current_dir = "./"
checkpoint_path = current_dir + "save_models/model_checkpoints/" + "checkpoint"
os.makedirs(checkpoint_path, exist_ok=True)

best_model_path = current_dir + "save_models/best_model/"

folder_data_temp = current_dir +"data_temp/"
shutil.rmtree(folder_data_temp, ignore_errors=True)

#### Custom PyTorch Datasets

Now we implement a CustomDataset class. We do some reshaping on labels and global feature tensors. This class returns the graph and its coresponding label and global feature for each index.

In [13]:
""" Classification Dataset """
class DGLDatasetClass(torch.utils.data.Dataset):
    def __init__(self, address):
          self.address=address+".bin"
          self.list_graphs, train_labels_masks_globals = dgl.load_graphs(self.address)
          num_graphs =len(self.list_graphs)
          self.labels = train_labels_masks_globals["labels"].view(num_graphs,-1)
          self.globals = train_labels_masks_globals["globals"].view(num_graphs,-1)
    def __len__(self):
        return len(self.list_graphs)
    def __getitem__(self, idx):
        return  self.list_graphs[idx], self.labels[idx], self.globals[idx]

#### Defining Train, Validation and Test sets

In [14]:
path_data_temp =  'BBBP_dataset/scaffold_0'
train_set = DGLDatasetClass(address=path_data_temp+"_train")
val_set = DGLDatasetClass(address=path_data_temp+"_val")
test_set = DGLDatasetClass(address=path_data_temp+"_test")

print('Train set size: ', len(train_set))
print('Validation set size: ', len(val_set))
print('Test set size: ', len(test_set))

Train set size:  1631
Validation set size:  203
Test set size:  205


#### DataLoader

Here we load the training, validation and test data with pytorch DataLoader. We also implement a custom collate function as the default one is not going to work on graph data.

In [15]:
def collate(batch):
    # batch is a list of tuples (graphs, labels, masks, globals)
    # Concatenate a sequence of graphs
    graphs = [e[0] for e in batch]
    g = dgl.batch(graphs)

    # Concatenate a sequence of tensors (labels) along a new dimension
    labels = [e[1] for e in batch]
    labels = torch.stack(labels, 0)

    # Concatenate a sequence of tensors (globals) along a new dimension
    globals = [e[2] for e in batch]
    globals = torch.stack(globals, 0)

    return g, labels, globals


def loader(batch_size=64):
    train_dataloader = DataLoader(train_set,
                              batch_size=batch_size,
                              collate_fn=collate,
                              drop_last=False,
                              shuffle=True,
                              num_workers=1)

    val_dataloader =  DataLoader(val_set,
                             batch_size=batch_size,
                             collate_fn=collate,
                             drop_last=False,
                             shuffle=False,
                             num_workers=1)

    test_dataloader = DataLoader(test_set,
                             batch_size=batch_size,
                             collate_fn=collate,
                             drop_last=False,
                             shuffle=False,
                             num_workers=1)
    return train_dataloader, val_dataloader, test_dataloader

In [16]:
train_dataloader, val_dataloader, test_dataloader = loader(batch_size=64)

Let's observe one batch of train set:

In [18]:
g, labels, globals = next(iter(train_dataloader))
print(g)
print(globals)

Graph(num_nodes=508, num_edges=942,
      ndata_schemes={'v': Scheme(shape=(128,), dtype=torch.float32)}
      edata_schemes={'e': Scheme(shape=(13,), dtype=torch.float32)})
tensor([[9.6293e-01, 3.4845e-02, 3.6572e-02,  ..., 4.7036e-08, 1.0000e+00,
         8.1621e-01],
        [8.2602e-01, 2.4392e-01, 2.1237e-01,  ..., 4.7036e-08, 1.6663e-01,
         9.4531e-01],
        [9.6153e-01, 4.4801e-02, 6.1361e-02,  ..., 4.7036e-08, 1.6663e-01,
         8.7552e-01],
        ...,
        [9.7611e-01, 2.3887e-01, 4.5423e-01,  ..., 4.7036e-08, 1.6663e-01,
         4.2378e-01],
        [7.3355e-01, 3.3079e-01, 4.2144e-01,  ..., 4.7036e-08, 1.6663e-01,
         2.4726e-01],
        [7.4036e-01, 7.4415e-01, 7.9298e-01,  ..., 4.7036e-08, 1.6663e-01,
         1.5924e-02]])


## GNNs

#### Some variables

In [19]:
# Size of global feature of each graph
global_size = 200

# Number of epochs to train the model
num_epochs = 100

# Number of steps to wait if the model performance on the validation set does not improve
patience = 10

#Configurations to instantiate the model
config = {"node_feature_size":127, "edge_feature_size":12, "hidden_size":100}


In [20]:
config.get('node_feature_size')

127

#### GCN with two convolutional layer.

Here we use the GraphConv of DGL for message passing.

In [21]:
class GCN1(nn.Module):
    def __init__(self, config, global_size = 200, num_tasks = 1):
        super().__init__()
        self.config = config
        self.num_tasks = num_tasks

        # Node feature size
        self.node_feature_size = self.config.get('node_feature_size', 127)

        # Edge feature size
        self.edge_feature_size = self.config.get('edge_feature_size', 12)

        # Hidden size
        self.hidden_size = self.config.get('hidden_size', 100)

        self.conv1 = GraphConv(self.node_feature_size, self.hidden_size, allow_zero_in_degree='True')
        self.conv2 = GraphConv(self.hidden_size, self.num_tasks, allow_zero_in_degree='True')

    # def forward(self, g, in_feat):
    def forward(self, mol_dgl_graph, globals):
        mol_dgl_graph.ndata["v"]= mol_dgl_graph.ndata["v"][:,:self.node_feature_size]
        mol_dgl_graph.edata["e"] = mol_dgl_graph.edata["e"][:,:self.edge_feature_size]

        h = self.conv1(mol_dgl_graph, mol_dgl_graph.ndata["v"])
        h = F.relu(h)
        h = self.conv2(mol_dgl_graph, h)
        mol_dgl_graph.ndata["h"] = h

        return dgl.mean_nodes(mol_dgl_graph, "h")

#### Function in order to computing score(ROC-AUC)

In [22]:
from sklearn.metrics import roc_auc_score

def compute_score(model, data_loader):
    model.eval()
    metric = roc_auc_score
    with torch.no_grad():
        prediction_all= torch.empty(0)
        labels_all= torch.empty(0)
        for i, (mol_dgl_graph, labels, globals) in enumerate(data_loader):
            prediction = model(mol_dgl_graph, globals)
            prediction = torch.sigmoid(prediction)
            prediction_all = torch.cat((prediction_all, prediction), 0)
            labels_all = torch.cat((labels_all, labels), 0)
    try:
      score = metric(labels_all.int().cpu(), prediction_all.cpu()).item()
    except ValueError:
      score = 0
    return score


#### Loss Function

The task is classification. We use BCEWithLogitsLoss loss function.

In [20]:
def loss_func(output, label):
    pos_weight = torch.ones(1)
    criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
    loss = criterion(output,label)
    return loss

#### Training

Here we implement a function for training the model for one epoch(train_epoch). Then we use it in the main training function(train_evaluate). We save the best model in the path we set before in order to use it in the test phase.

In [21]:
# Training function

def train_epoch(train_dataloader, model, optimizer):
    epoch_train_loss = 0
    iterations = 0
    model.train() # Prepare model for training
    for i, (mol_dgl_graph, labels, globals) in enumerate(train_dataloader):
        prediction = model(mol_dgl_graph, globals)
        loss_train = loss_func(prediction, labels)
        optimizer.zero_grad(set_to_none=True)
        loss_train.backward()
        optimizer.step()
        epoch_train_loss += loss_train.detach().item()
        iterations += 1
    epoch_train_loss /= iterations

    return epoch_train_loss


def train_evaluate(model):
    optimizer = torch.optim.Adam(model.parameters(), lr = 0.0001)

    best_val = 0
    patience_count = 1
    epoch = 1

    while epoch <= num_epochs:
        if patience_count <= patience:
            model.train()
            loss_train = train_epoch(train_dataloader, model, optimizer)
            model.eval()
            score_val = compute_score(model, val_dataloader)
            if score_val > best_val:
                best_val = score_val
                print("Save checkpoint")
                path = os.path.join(checkpoint_path, 'checkpoint.pth')
                dict_checkpoint = {"score_val": score_val}
                dict_checkpoint.update({"model_state_dict": model.state_dict(), "optimizer_state": optimizer.state_dict()})
                with open(path, "wb") as outputfile:
                    cloudpickle.dump(dict_checkpoint, outputfile)
                patience_count = 1
            else:
                print("Patience", patience_count)
                patience_count += 1

            print("Epoch: {}/{} | Training Loss: {:.3f} | Valid Score: {:.3f}".format(
            epoch, num_epochs, loss_train, score_val))

            print(" ")
            print("Epoch: {}/{} | Best Valid Score Until Now: {:.3f}".format(epoch, num_epochs, best_val), "\n")
        epoch += 1

    # best model save
    shutil.rmtree(best_model_path, ignore_errors=True)
    shutil.copytree(checkpoint_path, best_model_path)

    print("Final results:")
    print("Average Valid Score: {:.3f}".format(np.mean(best_val)), "\n")


#### Function to compute test set score of the final saved model

The following function uses the best model and computes the ROC-AUC score for test set.




In [22]:
import time
start_time = time.time()

def test_evaluate(model):
    path = os.path.join(best_model_path, 'checkpoint.pth')
    with open(path, 'rb') as f:
        checkpoint = cloudpickle.load(f)
    model.load_state_dict(checkpoint["model_state_dict"])
    model.eval()
    test_score = compute_score(model, test_dataloader)

    print("Test Score: {:.3f}".format(test_score), "\n")
    print("Execution time: {:.3f} seconds".format(time.time() - start_time))


#### Train the model and evaluate its performance

In [23]:
model_1 = GCN1(config, global_size)

In [24]:
train_evaluate(model_1)
test_evaluate(model_1)

Save checkpoint
Epoch: 1/100 | Training Loss: 0.663 | Valid Score: 0.331
 
Epoch: 1/100 | Best Valid Score Until Now: 0.331 

Patience 1
Epoch: 2/100 | Training Loss: 0.633 | Valid Score: 0.283
 
Epoch: 2/100 | Best Valid Score Until Now: 0.331 

Patience 2
Epoch: 3/100 | Training Loss: 0.615 | Valid Score: 0.282
 
Epoch: 3/100 | Best Valid Score Until Now: 0.331 

Patience 3
Epoch: 4/100 | Training Loss: 0.606 | Valid Score: 0.288
 
Epoch: 4/100 | Best Valid Score Until Now: 0.331 

Patience 4
Epoch: 5/100 | Training Loss: 0.595 | Valid Score: 0.299
 
Epoch: 5/100 | Best Valid Score Until Now: 0.331 

Patience 5
Epoch: 6/100 | Training Loss: 0.592 | Valid Score: 0.314
 
Epoch: 6/100 | Best Valid Score Until Now: 0.331 

Patience 6
Epoch: 7/100 | Training Loss: 0.586 | Valid Score: 0.326
 
Epoch: 7/100 | Best Valid Score Until Now: 0.331 

Save checkpoint
Epoch: 8/100 | Training Loss: 0.584 | Valid Score: 0.337
 
Epoch: 8/100 | Best Valid Score Until Now: 0.337 

Save checkpoint
Epoch:

### GCN with 4 convolutional layers and batch normalization layers and dropouts.

In [25]:
class GCN2(nn.Module):
    def __init__(self, config, global_size = 200, num_tasks = 1):
        super().__init__()
        self.config = config
        self.num_tasks = num_tasks

        # Node feature size
        self.node_feature_size = self.config.get('node_feature_size', 127)

        # Edge feature size
        self.edge_feature_size = self.config.get('edge_feature_size', 12)

        # Hidden size
        self.hidden_size = self.config.get('hidden_size', 100)

        self.conv1 = GraphConv(self.node_feature_size, self.hidden_size, allow_zero_in_degree='True')
        self.bn1 = nn.BatchNorm1d(self.hidden_size)
        self.dropout1 = nn.Dropout(0.2)

        self.conv2 = GraphConv(self.hidden_size, self.hidden_size, allow_zero_in_degree='True')
        self.bn2 = nn.BatchNorm1d(self.hidden_size)
        self.dropout2 = nn.Dropout(0.2)

        self.conv3 = GraphConv(self.hidden_size, self.hidden_size, allow_zero_in_degree='True')
        self.bn3 = nn.BatchNorm1d(self.hidden_size)
        self.dropout3 = nn.Dropout(0.2)

        self.conv4 = GraphConv(self.hidden_size, self.num_tasks, allow_zero_in_degree='True')

    # def forward(self, g, in_feat):
    def forward(self, mol_dgl_graph, globals):
        mol_dgl_graph.ndata["v"]= mol_dgl_graph.ndata["v"][:,:self.node_feature_size]
        mol_dgl_graph.edata["e"] = mol_dgl_graph.edata["e"][:,:self.edge_feature_size]

        h = self.conv1(mol_dgl_graph, mol_dgl_graph.ndata["v"])
        h = self.bn1(h)
        h = F.relu(h)
        h = self.dropout1(h)

        h = self.conv2(mol_dgl_graph, h)
        h = self.bn2(h)
        h = F.relu(h)
        h = self.dropout2(h)

        h = self.conv3(mol_dgl_graph, h)
        h = self.bn3(h)
        h = F.relu(h)
        h = self.dropout3(h)

        h = self.conv4(mol_dgl_graph, h)

        mol_dgl_graph.ndata["h"] = h

        return dgl.mean_nodes(mol_dgl_graph, "h")

In [26]:
model_2 = GCN2(config, global_size)

In [27]:
train_evaluate(model_2)
test_evaluate(model_2)

Save checkpoint
Epoch: 1/100 | Training Loss: 0.630 | Valid Score: 0.646
 
Epoch: 1/100 | Best Valid Score Until Now: 0.646 

Save checkpoint
Epoch: 2/100 | Training Loss: 0.582 | Valid Score: 0.709
 
Epoch: 2/100 | Best Valid Score Until Now: 0.709 

Save checkpoint
Epoch: 3/100 | Training Loss: 0.539 | Valid Score: 0.734
 
Epoch: 3/100 | Best Valid Score Until Now: 0.734 

Save checkpoint
Epoch: 4/100 | Training Loss: 0.517 | Valid Score: 0.749
 
Epoch: 4/100 | Best Valid Score Until Now: 0.749 

Save checkpoint
Epoch: 5/100 | Training Loss: 0.512 | Valid Score: 0.757
 
Epoch: 5/100 | Best Valid Score Until Now: 0.757 

Save checkpoint
Epoch: 6/100 | Training Loss: 0.489 | Valid Score: 0.761
 
Epoch: 6/100 | Best Valid Score Until Now: 0.761 

Save checkpoint
Epoch: 7/100 | Training Loss: 0.479 | Valid Score: 0.766
 
Epoch: 7/100 | Best Valid Score Until Now: 0.766 

Save checkpoint
Epoch: 8/100 | Training Loss: 0.469 | Valid Score: 0.769
 
Epoch: 8/100 | Best Valid Score Until Now: 

We have improvement by adding more layers and batch normalization layers and dropout.

#### GraphSAGE

Here we implement a SAGEConv for message passing. We only use node features in the following GNN.

In [28]:
class SAGEConv(nn.Module):
    def __init__(self, in_feat, out_feat):
        super(SAGEConv, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat * 2, out_feat)

    def forward(self, g, h):
        with g.local_scope():
            g.ndata["h"] = h
            # update_all is a message passing API.
            g.update_all(
                message_func=fn.copy_u("h", "m"),
                reduce_func=fn.mean("m", "h_N"),
            )
            h_N = g.ndata["h_N"]
            h_total = torch.cat([h, h_N], dim=1)
            return self.linear(h_total)

In [29]:
class GraphSAGE1(nn.Module):
    def __init__(self, config, global_size = 200, num_tasks = 1):
        super().__init__()
        self.config = config
        self.num_tasks = num_tasks

        # Node feature size
        self.node_feature_size = self.config.get('node_feature_size', 127)

        # Edge feature size
        self.edge_feature_size = self.config.get('edge_feature_size', 12)

        # Hidden size
        self.hidden_size = self.config.get('hidden_size', 100)

        self.conv1 = SAGEConv(self.node_feature_size, self.hidden_size)
        self.conv2 = SAGEConv(self.hidden_size, self.num_tasks)

    def forward(self, mol_dgl_graph, globals):
        mol_dgl_graph.ndata["v"]= mol_dgl_graph.ndata["v"][:,:self.node_feature_size]
        mol_dgl_graph.edata["e"] = mol_dgl_graph.edata["e"][:,:self.edge_feature_size]
        h = self.conv1(mol_dgl_graph, mol_dgl_graph.ndata["v"])
        h = F.relu(h)
        h = self.conv2(mol_dgl_graph, h)
        mol_dgl_graph.ndata["h"] = h
        return dgl.mean_nodes(mol_dgl_graph, "h")

In [30]:
model_3 = GraphSAGE1(config, global_size)

In [31]:
train_evaluate(model_3)
test_evaluate(model_3)

Save checkpoint
Epoch: 1/100 | Training Loss: 0.651 | Valid Score: 0.383
 
Epoch: 1/100 | Best Valid Score Until Now: 0.383 

Patience 1
Epoch: 2/100 | Training Loss: 0.627 | Valid Score: 0.346
 
Epoch: 2/100 | Best Valid Score Until Now: 0.383 

Patience 2
Epoch: 3/100 | Training Loss: 0.609 | Valid Score: 0.339
 
Epoch: 3/100 | Best Valid Score Until Now: 0.383 

Patience 3
Epoch: 4/100 | Training Loss: 0.597 | Valid Score: 0.343
 
Epoch: 4/100 | Best Valid Score Until Now: 0.383 

Patience 4
Epoch: 5/100 | Training Loss: 0.587 | Valid Score: 0.357
 
Epoch: 5/100 | Best Valid Score Until Now: 0.383 

Patience 5
Epoch: 6/100 | Training Loss: 0.583 | Valid Score: 0.372
 
Epoch: 6/100 | Best Valid Score Until Now: 0.383 

Save checkpoint
Epoch: 7/100 | Training Loss: 0.576 | Valid Score: 0.390
 
Epoch: 7/100 | Best Valid Score Until Now: 0.390 

Save checkpoint
Epoch: 8/100 | Training Loss: 0.574 | Valid Score: 0.417
 
Epoch: 8/100 | Best Valid Score Until Now: 0.417 

Save checkpoint
E

Not a good performance!

#### GraphSAGE with 4 layers and batch normalization and dropouts.

In [32]:
class GraphSAGE2(nn.Module):
    def __init__(self, config, global_size = 200, num_tasks = 1):
        super().__init__()
        self.config = config
        self.num_tasks = num_tasks

        # Node feature size
        self.node_feature_size = self.config.get('node_feature_size', 127)

        # Edge feature size
        self.edge_feature_size = self.config.get('edge_feature_size', 12)

        # Hidden size
        self.hidden_size = self.config.get('hidden_size', 100)

        self.conv1 = SAGEConv(self.node_feature_size, self.hidden_size)
        self.bn1 = nn.BatchNorm1d(self.hidden_size)
        self.dropout1 = nn.Dropout(0.2)

        self.conv2 = SAGEConv(self.hidden_size, self.hidden_size)
        self.bn2 = nn.BatchNorm1d(self.hidden_size)
        self.dropout2 = nn.Dropout(0.2)

        self.conv3 = SAGEConv(self.hidden_size, self.hidden_size)
        self.bn3 = nn.BatchNorm1d(self.hidden_size)
        self.dropout3 = nn.Dropout(0.2)

        self.conv4 = SAGEConv(self.hidden_size, self.num_tasks)

    def forward(self, mol_dgl_graph, globals):
        mol_dgl_graph.ndata["v"]= mol_dgl_graph.ndata["v"][:,:self.node_feature_size]
        mol_dgl_graph.edata["e"] = mol_dgl_graph.edata["e"][:,:self.edge_feature_size]

        h = self.conv1(mol_dgl_graph, mol_dgl_graph.ndata["v"])
        h = self.bn1(h)
        h = F.relu(h)
        h = self.dropout1(h)

        h = self.conv2(mol_dgl_graph, h)
        h = self.bn2(h)
        h = F.relu(h)
        h = self.dropout2(h)

        h = self.conv3(mol_dgl_graph, h)
        h = self.bn3(h)
        h = F.relu(h)
        h = self.dropout3(h)

        h = self.conv4(mol_dgl_graph, h)

        mol_dgl_graph.ndata["h"] = h

        return dgl.mean_nodes(mol_dgl_graph, "h")

In [33]:
model_4 = GraphSAGE2(config, global_size)

In [34]:
train_evaluate(model_4)
test_evaluate(model_4)

Save checkpoint
Epoch: 1/100 | Training Loss: 0.668 | Valid Score: 0.727
 
Epoch: 1/100 | Best Valid Score Until Now: 0.727 

Save checkpoint
Epoch: 2/100 | Training Loss: 0.599 | Valid Score: 0.788
 
Epoch: 2/100 | Best Valid Score Until Now: 0.788 

Save checkpoint
Epoch: 3/100 | Training Loss: 0.552 | Valid Score: 0.803
 
Epoch: 3/100 | Best Valid Score Until Now: 0.803 

Patience 1
Epoch: 4/100 | Training Loss: 0.519 | Valid Score: 0.802
 
Epoch: 4/100 | Best Valid Score Until Now: 0.803 

Save checkpoint
Epoch: 5/100 | Training Loss: 0.493 | Valid Score: 0.809
 
Epoch: 5/100 | Best Valid Score Until Now: 0.809 

Patience 1
Epoch: 6/100 | Training Loss: 0.473 | Valid Score: 0.808
 
Epoch: 6/100 | Best Valid Score Until Now: 0.809 

Patience 2
Epoch: 7/100 | Training Loss: 0.460 | Valid Score: 0.809
 
Epoch: 7/100 | Best Valid Score Until Now: 0.809 

Patience 3
Epoch: 8/100 | Training Loss: 0.449 | Valid Score: 0.805
 
Epoch: 8/100 | Best Valid Score Until Now: 0.809 

Patience 4
E

A good improvement!

#### Custom GNNs

Here we implement a custom GNN. we use u_mul_v as message function and sum as aggregation function. We have 4 layers and batch normalization and dropouts in the architecture of the GNN.

In [65]:
class CustomGraphConv1(nn.Module):
    def __init__(self, in_feat, out_feat):
        super(CustomGraphConv1, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat * 2, out_feat)

    def forward(self, g, h):
        with g.local_scope():
            g.ndata["h"] = h
            # update_all is a message passing API.
            g.update_all(
                message_func=fn.u_mul_v("h", "h", "m"),
                reduce_func=fn.sum("m", "h_N"),
            )
            h_N = g.ndata["h_N"]
            h_total = torch.cat([h, h_N], dim=1)
            return self.linear(h_total)

In [66]:
class GNN1(nn.Module):
    def __init__(self, config, global_size = 200, num_tasks = 1):
        super().__init__()
        self.config = config
        self.num_tasks = num_tasks

        # Node feature size
        self.node_feature_size = self.config.get('node_feature_size', 127)

        # Edge feature size
        self.edge_feature_size = self.config.get('edge_feature_size', 12)

        # Hidden size
        self.hidden_size = self.config.get('hidden_size', 100)

        self.conv1 = CustomGraphConv1(self.node_feature_size, self.hidden_size)
        self.bn1 = nn.BatchNorm1d(self.hidden_size)
        self.dropout1 = nn.Dropout(0.2)

        self.conv2 = CustomGraphConv1(self.hidden_size, self.hidden_size)
        self.bn2 = nn.BatchNorm1d(self.hidden_size)
        self.dropout2 = nn.Dropout(0.2)

        self.conv3 = CustomGraphConv1(self.hidden_size, self.hidden_size)
        self.bn3 = nn.BatchNorm1d(self.hidden_size)
        self.dropout3 = nn.Dropout(0.2)

        self.conv4 = CustomGraphConv1(self.hidden_size, self.num_tasks)

    def forward(self, mol_dgl_graph, globals):
        mol_dgl_graph.ndata["v"]= mol_dgl_graph.ndata["v"][:,:self.node_feature_size]
        mol_dgl_graph.edata["e"] = mol_dgl_graph.edata["e"][:,:self.edge_feature_size]

        h = self.conv1(mol_dgl_graph, mol_dgl_graph.ndata["v"])
        h = self.bn1(h)
        h = F.relu(h)
        h = self.dropout1(h)

        h = self.conv2(mol_dgl_graph, h)
        h = self.bn2(h)
        h = F.relu(h)
        h = self.dropout2(h)

        h = self.conv3(mol_dgl_graph, h)
        h = self.bn3(h)
        h = F.relu(h)
        h = self.dropout3(h)

        h = self.conv4(mol_dgl_graph, h)

        mol_dgl_graph.ndata["h"] = h

        return dgl.mean_nodes(mol_dgl_graph, "h")

In [67]:
model_5 = GNN1(config, global_size)

In [68]:
train_evaluate(model_5)
test_evaluate(model_5)

Save checkpoint
Epoch: 1/100 | Training Loss: 0.795 | Valid Score: 0.549
 
Epoch: 1/100 | Best Valid Score Until Now: 0.549 

Save checkpoint
Epoch: 2/100 | Training Loss: 0.747 | Valid Score: 0.653
 
Epoch: 2/100 | Best Valid Score Until Now: 0.653 

Save checkpoint
Epoch: 3/100 | Training Loss: 0.805 | Valid Score: 0.715
 
Epoch: 3/100 | Best Valid Score Until Now: 0.715 

Patience 1
Epoch: 4/100 | Training Loss: 0.746 | Valid Score: 0.695
 
Epoch: 4/100 | Best Valid Score Until Now: 0.715 

Patience 2
Epoch: 5/100 | Training Loss: 0.715 | Valid Score: 0.694
 
Epoch: 5/100 | Best Valid Score Until Now: 0.715 

Patience 3
Epoch: 6/100 | Training Loss: 0.760 | Valid Score: 0.699
 
Epoch: 6/100 | Best Valid Score Until Now: 0.715 

Patience 4
Epoch: 7/100 | Training Loss: 0.694 | Valid Score: 0.696
 
Epoch: 7/100 | Best Valid Score Until Now: 0.715 

Patience 5
Epoch: 8/100 | Training Loss: 0.711 | Valid Score: 0.697
 
Epoch: 8/100 | Best Valid Score Until Now: 0.715 

Patience 6
Epoch:

Using edge features for message passing:

In [174]:
class CustomGraphConv2(nn.Module):
    def __init__(self, in_feat, out_feat):
        super(CustomGraphConv2, self).__init__()
        self.linear = nn.Linear(in_feat + 12 , out_feat)

    def forward(self, g, h, w):
        with g.local_scope():
            g.ndata["h"] = h
            g.edata["w"] = w
            g.update_all(
                message_func=fn.copy_e("w", "m"), # computing message using edge feature
                reduce_func=fn.sum("m", "h_N"),
            )
            h_N = g.ndata["h_N"]
            h_total = torch.cat([h, h_N], dim=1)
            # print(h_total.shape)
            return self.linear(h_total)


In [175]:
class GNN2(nn.Module):
    def __init__(self, config, global_size = 200, num_tasks = 1):
        super().__init__()
        self.config = config
        self.num_tasks = num_tasks

        # Node feature size
        self.node_feature_size = self.config.get('node_feature_size', 127)

        # Edge feature size
        self.edge_feature_size = self.config.get('edge_feature_size', 12)

        # Hidden size
        self.hidden_size = self.config.get('hidden_size', 100)

        self.conv1 = CustomGraphConv2(self.node_feature_size, self.hidden_size)
        self.bn1 = nn.BatchNorm1d(self.hidden_size)
        self.dropout1 = nn.Dropout(0.2)

        self.conv2 = CustomGraphConv2(self.hidden_size, self.hidden_size)
        self.bn2 = nn.BatchNorm1d(self.hidden_size)
        self.dropout2 = nn.Dropout(0.2)

        self.conv3 = CustomGraphConv2(self.hidden_size, self.hidden_size)
        self.bn3 = nn.BatchNorm1d(self.hidden_size)
        self.dropout3 = nn.Dropout(0.2)

        self.conv4 = CustomGraphConv2(self.hidden_size, self.num_tasks)

    def forward(self, mol_dgl_graph, globals):
        mol_dgl_graph.ndata["v"]= mol_dgl_graph.ndata["v"][:,:self.node_feature_size]
        mol_dgl_graph.edata["e"] = mol_dgl_graph.edata["e"][:,:self.edge_feature_size]

        h = self.conv1(mol_dgl_graph, mol_dgl_graph.ndata["v"], mol_dgl_graph.edata["e"])
        h = self.bn1(h)
        h = F.relu(h)
        h = self.dropout1(h)

        h = self.conv2(mol_dgl_graph, h, mol_dgl_graph.edata["e"])
        h = self.bn2(h)
        h = F.relu(h)
        h = self.dropout2(h)

        h = self.conv3(mol_dgl_graph, h, mol_dgl_graph.edata["e"])
        h = self.bn3(h)
        h = F.relu(h)
        h = self.dropout3(h)

        h = self.conv4(mol_dgl_graph, h, mol_dgl_graph.edata["e"])

        mol_dgl_graph.ndata["h"] = h

        return dgl.mean_nodes(mol_dgl_graph, "h")

In [176]:
model_6 = GNN2(config, global_size)

In [177]:
train_evaluate(model_6)
test_evaluate(model_6)

Save checkpoint
Epoch: 1/100 | Training Loss: 0.677 | Valid Score: 0.718
 
Epoch: 1/100 | Best Valid Score Until Now: 0.718 

Save checkpoint
Epoch: 2/100 | Training Loss: 0.631 | Valid Score: 0.721
 
Epoch: 2/100 | Best Valid Score Until Now: 0.721 

Save checkpoint
Epoch: 3/100 | Training Loss: 0.599 | Valid Score: 0.740
 
Epoch: 3/100 | Best Valid Score Until Now: 0.740 

Save checkpoint
Epoch: 4/100 | Training Loss: 0.566 | Valid Score: 0.761
 
Epoch: 4/100 | Best Valid Score Until Now: 0.761 

Save checkpoint
Epoch: 5/100 | Training Loss: 0.546 | Valid Score: 0.768
 
Epoch: 5/100 | Best Valid Score Until Now: 0.768 

Save checkpoint
Epoch: 6/100 | Training Loss: 0.525 | Valid Score: 0.776
 
Epoch: 6/100 | Best Valid Score Until Now: 0.776 

Save checkpoint
Epoch: 7/100 | Training Loss: 0.508 | Valid Score: 0.789
 
Epoch: 7/100 | Best Valid Score Until Now: 0.789 

Save checkpoint
Epoch: 8/100 | Training Loss: 0.492 | Valid Score: 0.796
 
Epoch: 8/100 | Best Valid Score Until Now: 

This is the best result. Looks like adding edge feature was efficient!