# Dataset ogbg-molhiv Project

## Introduction

Introduction of dataset taken from https://ogb.stanford.edu/docs/graphprop/#ogbg-mol

Graph: The ogbg-molhiv and ogbg-molpcba datasets are two molecular property prediction datasets of different sizes: ogbg-molhiv (small) and ogbg-molpcba (medium). They are adopted from the MoleculeNet [1], and are among the largest of the MoleculeNet datasets. All the molecules are pre-processed using RDKit [2]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The full description of the features is provided in code. The script to convert the SMILES string [3] to the above graph object can be found here. Note that the script requires RDkit to be installed. The script can be used to pre-process external molecule datasets so that those datasets share the same input feature space as the OGB molecule datasets. This is particularly useful for pre-training graph models, which has great potential to significantly increase generalization performance on the (downstream) OGB datasets [4].  

Beside the two main datasets, we additionally provide 10 smaller datasets from MoleculeNet. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression. Evaluators are also provided for these datasets. These datasets can be used to stress-test molecule-specific methods or transfer learning [4].  

For encoding these raw input features, we prepare simple modules called AtomEncoder and BondEncoder. They can be used as follows to embed raw atom and bond features to obtain atom_emb and bond_emb.





Prediction task: The task is to predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g, whether a molecule inhibits HIV virus replication or not. Note that some datasets (e.g., ogbg-molpcba) can have multiple tasks, and can contain nan that indicates the corresponding label is not assigned to the molecule. For evaluation metric, we closely follow [2]. Specifically, for ogbg-molhiv, we use ROC-AUC for evaluation. For ogbg-molpcba, as the class balance is extremely skewed (only 1.4% of data is positive) and the dataset contains multiple classification tasks, we use the Average Precision (AP) averaged over the tasks as the evaluation metric.



Dataset splitting: We adopt the scaffold splitting procedure that splits the molecules based on their two-dimensional structural frameworks. The scaffold splitting attempts to separate structurally different molecules into different subsets, which provides a more realistic estimate of model performance in prospective experimental settings [1].



## Data Loader

In [30]:
from ogb.graphproppred import PygGraphPropPredDataset
from torch_geometric.data import DataLoader

dataset = PygGraphPropPredDataset(name = "ogbg-molhiv", root='../')
 
split_idx = dataset.get_idx_split() 
train_loader = DataLoader(dataset[split_idx["train"]], batch_size=64, shuffle=True)
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=64, shuffle=False)
test_loader = DataLoader(dataset[split_idx["test"]], batch_size=64, shuffle=False)

In [31]:
# for step, data in enumerate(train_loader):
#     print(f'Step {step + 1}:')
#     print('=======')
#     print(f'Number of graphs in the current batch: {data.num_graphs}')
#     print(data)
#     print()

COO stores a list of (row, column, value) tuples. Ideally, the entries are sorted first by row index and then by column index, to improve random access times. This is another format that is good for incremental matrix construction.  
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html

data.x: Node feature matrix with shape [num_nodes, num_node_features]  
data.edge_index: Graph connectivity in COO format with shape `[2, num_edges]` and type torch.long i.e `[2,3]` node soucre (2) goes to destination node (3).  
data.edge_attr: Edge feature matrix with shape `[num_edges, num_edge_features]`.  
data.y: Target to train against (may have arbitrary shape), e.g., node-level targets of shape  

In [32]:
len(dataset), dataset[0], dataset[0].is_undirected()

(41127,
 Data(edge_attr=[40, 3], edge_index=[2, 40], x=[19, 9], y=[1, 1]),
 tensor(True))

In [33]:
dataset[0].edge_attr[0], dataset[0].edge_index[:,0]

(tensor([0, 0, 0]), tensor([0, 1]))

In [34]:

# Gather some statistics about the graph.
print(f'Dataset: {dataset}:')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')
print('======================')
data = dataset[0]  # Get the first graph object.
print(data)
print('==============================================================')
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
print(f'Contains self-loops: {data.contains_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')

Dataset: PygGraphPropPredDataset(41127):
Number of graphs: 41127
Number of features: 9
Number of classes: 2
Data(edge_attr=[40, 3], edge_index=[2, 40], x=[19, 9], y=[1, 1])
Number of nodes: 19
Number of edges: 40
Average node degree: 2.11
Contains self-loops: False
Is undirected: True


In [35]:
print('# of graphs = {0}\n# of classes = {1}\n# of node features = {2}\n# of edge features = {3}'.\
         format(len(dataset), dataset.num_classes, dataset.num_node_features, dataset.num_edge_features))

if isinstance(dataset, PygGraphPropPredDataset): # OGB datasets
    print('# of tasks = {}'.format(dataset.num_tasks))

# of graphs = 41127
# of classes = 2
# of node features = 9
# of edge features = 3
# of tasks = 1


We see that the first graph has 19 nodes, each one having 9 features. The graph has 40 edges, each one having 3 features.

In [36]:
dataset[0].x.shape, dataset[0].x, dataset[0].edge_index.shape, dataset[0].edge_index, dataset[0].edge_attr.shape, dataset[0].edge_attr, dataset[0].y

(torch.Size([19, 9]),
 tensor([[ 5,  0,  4,  5,  3,  0,  2,  0,  0],
         [ 5,  0,  4,  5,  2,  0,  2,  0,  0],
         [ 5,  0,  3,  5,  0,  0,  1,  0,  1],
         [ 7,  0,  2,  6,  0,  0,  1,  0,  1],
         [28,  0,  4,  2,  0,  0,  5,  0,  1],
         [ 7,  0,  2,  6,  0,  0,  1,  0,  1],
         [ 5,  0,  3,  5,  0,  0,  1,  0,  1],
         [ 5,  0,  4,  5,  2,  0,  2,  0,  0],
         [ 5,  0,  4,  5,  3,  0,  2,  0,  0],
         [ 5,  0,  4,  5,  2,  0,  2,  0,  1],
         [ 7,  0,  2,  6,  0,  0,  1,  0,  1],
         [ 5,  0,  3,  5,  0,  0,  1,  0,  1],
         [ 5,  0,  4,  5,  2,  0,  2,  0,  0],
         [ 5,  0,  4,  5,  3,  0,  2,  0,  0],
         [ 5,  0,  4,  5,  2,  0,  2,  0,  1],
         [ 5,  0,  3,  5,  0,  0,  1,  0,  1],
         [ 5,  0,  4,  5,  2,  0,  2,  0,  0],
         [ 5,  0,  4,  5,  3,  0,  2,  0,  0],
         [ 7,  0,  2,  6,  0,  0,  1,  0,  1]]),
 torch.Size([2, 40]),
 tensor([[ 0,  1,  1,  2,  2,  3,  3,  4,  4,  5,  5,  6,  6,

In [37]:
dataset.num_classes, dataset[100].y.item()

(2, 0)

In [38]:
from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
atom_encoder = AtomEncoder(emb_dim = 100)
bond_encoder = BondEncoder(emb_dim = 100)

x, edge_attr = dataset[0].x, dataset[0].edge_attr
atom_emb = atom_encoder(x) # x is input atom feature
edge_emb = bond_encoder(edge_attr) # edge_attr is input edge feature
atom_emb.shape, edge_emb.shape

(torch.Size([19, 100]), torch.Size([40, 100]))

## Training a Graph Neural Network (GNN)
Training a GNN for graph classification usually follows a simple recipe:
- Embed each node by performing multiple rounds of message passing. 
- Aggregate node embeddings into a unified graph embedding (readout layer). 
- Train a final classifier on the graph embedding. 

In [55]:
import torch
import torch.nn.functional as F
from torch.autograd import Variable
from torch_geometric.datasets import Entities
from torch_geometric.utils import k_hop_subgraph
from torch_geometric.nn import RGCNConv, FastRGCNConv

In [80]:
IN_CHANNELS = 9
OUT_CHANNELS = 16
NUM_RELATIONS = 5 * 5 * 2 #given by the possible value of 'possible_bond_type_list', 'possible_bond_stereo_list' and 'possible_is_conjugated_list'

In [None]:
https://github.com/lightaime/deep_gcns_torch/blob/master/examples/ogb/ogbg_mol/model.py

In [87]:
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = RGCNConv(IN_CHANNELS, OUT_CHANNELS, NUM_RELATIONS,num_bases=10)
        self.conv2 = RGCNConv(OUT_CHANNELS, dataset.num_classes, NUM_RELATIONS, num_bases=10)

    def forward(self, x, edge_index, edge_type):
        y = F.relu(self.conv1(x, edge_index, edge_type))
        print(y.shape)
        y = self.conv2(x, edge_index, edge_type)
        return F.log_softmax(y, dim=1)

In [88]:
model, data = Net().to(device), data.to(device) # puts model and data on GPU / CPU
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=0.0005)

In [92]:
def train():
    model.train()
    for batch_idx, data in enumerate(train_loader):
        optimizer.zero_grad()
        x , edge_index, edge_type = Variable(data.x).to(device), Variable(data.edge_index).to(device), Variable(data.edge_attr).to(device)
        print(x.shape, edge_index.shape, edge_type.shape)
        out = model(x , edge_index, edge_type)
        loss = F.nll_loss(out, data.y)
        loss.backward()
        optimizer.step()
        if batch_idx %100 ==0:
            print('epoch {} batch {} [{}/{}] training loss: {}'.format(epoch,batch_idx,batch_idx*len(x),
                    len(train_loader.dataset),loss.item()))
        return loss.item()

In [90]:
@torch.no_grad()
def test():
    model.eval()
    pred = model(data.x, data.edge_index, data.edge_type).argmax(dim=1)
    train_acc = pred[data.train_idx].eq(data.train_y).to(torch.float).mean()
    test_acc = pred[data.test_idx].eq(data.test_y).to(torch.float).mean()
    return train_acc.item(), test_acc.item()

In [93]:
for epoch in range(1, 51):
    loss = train()
    #train_acc, test_acc = test()
    #print(f'Epoch: {epoch:02d}, Loss: {loss:.4f}, Train: {train_acc:.4f} 'f'Test: {test_acc:.4f}')

torch.Size([1783, 9]) torch.Size([2, 3932]) torch.Size([3932, 3])


IndexError: too many indices for tensor of dimension 2

In [12]:
from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.nn import global_mean_pool

class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCN, self).__init__()
        torch.manual_seed(12345)
        self.conv1 = GCNConv(dataset.num_node_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.conv3 = GCNConv(hidden_channels, hidden_channels)
        self.lin = Linear(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index, batch):
        # 1. Obtain node embeddings 
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = self.conv2(x, edge_index)
        x = x.relu()
        x = self.conv3(x, edge_index)

        # 2. Readout layer
        x = global_mean_pool(x, batch)  # [batch_size, hidden_channels]

        # 3. Apply a final classifier
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin(x)
        
        return x

model = GCN(hidden_channels=64)
print(model)

NameError: name 'torch' is not defined

https://colab.research.google.com/drive/1I8a0DfQ3fI7Njc62__mVXUlcAleUclnb?usp=sharing

In [None]:
atom_encoder = AtomEncoder(emb_dim = 100)
bond_encoder = BondEncoder(emb_dim = 100)

x, edge_attr = dataset[0].x, dataset[0].edge_attr
atom_emb = atom_encoder(x) # x is input atom feature
edge_emb = bond_encoder(edge_attr) # edge_attr is input edge feature
atom_emb.shape, edge_emb.shape

In [None]:
class DeeperGCN(torch.nn.Module):
    def __init__(self, hidden_channels, num_layers):
        super(DeeperGCN, self).__init__()

        atom_encoder = AtomEncoder(emb_dim = hidden_channels)
        bond_encoder = BondEncoder(emb_dim = hidden_channels)

        self.node_encoder = atom_encoder(data.x.size(-1), hidden_channels)
        self.edge_encoder = bond_encoder(data.edge_attr.size(-1), hidden_channels)

        self.layers = torch.nn.ModuleList()

        
        for i in range(1, num_layers + 1):
            conv = GENConv(hidden_channels, hidden_channels, aggr='softmax',
                           t=1.0, learn_t=True, num_layers=2, norm='layer')
            norm = LayerNorm(hidden_channels, elementwise_affine=True)
            act = ReLU(inplace=True)

            layer = DeepGCNLayer(conv, norm, act, block='res+', dropout=0.1,
                                 ckpt_grad=i % 3)
            self.layers.append(layer)

        self.lin = Linear(hidden_channels, data.y.size(-1))

    def forward(self, x, edge_index, edge_attr):
        x = self.node_encoder(x)
        edge_attr = self.edge_encoder(edge_attr)

        x = self.layers[0].conv(x, edge_index, edge_attr)

        for layer in self.layers[1:]:
            x = layer(x, edge_index, edge_attr)

        x = self.layers[0].act(self.layers[0].norm(x))
        x = F.dropout(x, p=0.1, training=self.training)

        return self.lin(x)


to do GCNConv, GENConv and GATConv