# Obtaining the Dataset: Cora

PyTorch Geometric is an extension library to the popular deep learning framework PyTorch, and consists of various methods and utilities to ease the implementation of Graph Neural Networks.

At first, we need a dataset for training, validating and testing the Graph Neural Network (GNN). We load the Cora dataset (available in PyTorch Geometric framework) in this case. The citation network datasets "Cora", "CiteSeer" and "PubMed" are present in Planetoid.

In [49]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora')
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of classes: {dataset.num_classes}')#number of node classes

Dataset: Cora():
Number of graphs: 1
Number of classes: 7


PyTorch Geometric provides a **Data** class. An object of the **Data** class is a homogeneous graph. Such an object can hold node-level, link-level and graph-level attributes. 

In general, **Data** tries to mimic the behavior of a regular Python dictionary. 

Some of the commonly useful properties of this class and its objects are:
1. num_node_features: int
2. num_features: int
3. num_edge_features: int
4. num_node_types: int
5. num_edge_types: int
6. num_edges: int
7. num_nodes: int

In [27]:
for (i,graph) in zip(range(len(dataset)),dataset):
        print(f'{i}-th graph')
        print(f'------------')
        print(f'Number of nodes: {graph.num_nodes}')
        print(f'Number of node features: {graph.num_node_features}')
        print(f'Number of edges: {graph.num_edges}')
        print(f'Number of edge features: {graph.num_edge_features}')
        print(f'Average node degree: {graph.num_edges / graph.num_nodes:.2f}')
        print(f'Number of training nodes: {graph.train_mask.sum()}')
        print(f'Number of validation nodes: {graph.val_mask.sum()}')
        print(f'Number of test nodes: {graph.test_mask.sum()}')
        print(f'Has isolated nodes(nodes without edges): {graph.has_isolated_nodes()}')
        print(f'Has self-loops: {graph.has_self_loops()}')
        print(f'Is undirected: {graph.is_undirected()}')

0-th graph
------------
Number of nodes: 2708
Number of node features: 1433
Number of edges: 10556
Number of edge features: 0
Average node degree: 3.90
Number of training nodes: 140
Number of validation nodes: 500
Number of test nodes: 1000
Has isolated nodes(nodes without edges): False
Has self-loops: False
Is undirected: True


In [31]:
dataset[0].y #contains the ground-truth label of each of the 2708 nodes in the graph

tensor([3, 4, 4,  ..., 3, 3, 3])

In [124]:
len(dataset[0].y)

2708

In [32]:
import numpy as np
len(np.unique(np.array(dataset[0].y)))#np.unique() is a utility from the numpy library that removes duplicates in a numpy array

7

In [29]:
dataset[0].x #a tensor containing the node embeddings. Each embedding corresponds to one of the 7 ground-truth labels of the nodes.

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [30]:
len(dataset[0].x)

2708

In [128]:
dataset[0].x[2700] #embeddings of the 2701th node

tensor([0., 0., 1.,  ..., 0., 0., 0.])

By printing '**edge_index**', we can understand how PyG represents graph connectivity internally. We can see that for each edge, edge_index holds a tuple of two node indices, where the first value describes the node index of the source node and the second value describes the node index of the destination node of an edge.

This representation is known as the **COO format** (co-ordinate format) commonly used for representing sparse matrices. Instead of holding the adjacency information in a dense representation, PyG represents graphs sparsely, which refers to only holding the coordinates/values for which entries are non-zero.

In [33]:
dataset[0].edge_index

tensor([[   0,    0,    0,  ..., 2707, 2707, 2707],
        [ 633, 1862, 2582,  ...,  598, 1473, 2706]])

In [34]:
len(dataset[0].edge_index)

2

In [50]:
import torch
torch.transpose(dataset[0].edge_index,0,1) #shows the source and the destination nodes of each edge in the graph

tensor([[   0,  633],
        [   0, 1862],
        [   0, 2582],
        ...,
        [2707,  598],
        [2707, 1473],
        [2707, 2706]])

In [35]:
dataset[0].edge_attr #Edge feature matrix (default: None)

In [36]:
dataset[0].pos #Node position matrix (default: None)

## Implementing a Graph Neural Network (GNN)

In the following cell, we implement a 2-layered GNN. Each layer performs the following graph convolution operation ([Kipf et al. (2017)](https://arxiv.org/abs/1609.02907)):

$$
\mathbf{x}_v^{(\ell + 1)} = \mathbf{W}^{(\ell + 1)} \sum_{w \in \mathcal{N}(v) \, \cup \, \{ v \}} \frac{1}{c_{w,v}} \cdot \mathbf{x}_w^{(\ell)}
$$

where, $\mathbf{W}^{(\ell + 1)}$ denotes a trainable weight matrix of shape `[num_output_features, num_input_features]` and $c_{w,v}$ refers to a fixed normalization coefficient for each edge.

PyG implements this layer via its [`GCNConv`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv) class, specifically the `forward` function in it. It is executed by passing in the node-feature representation `x` and the COO graph-connectivity representation `edge_index`.

The Graph Neural Network (GNN) architecture is described in a child class `GCN` derived from the `torch.nn.Module` class of PyTorch. The `GCN` class contains the dimensions of individual convolution layers and the sequence of the convolution layers and non-linear activations (that follow each of the convolution layers). 

In [51]:
from torch_geometric.nn import GCNConv
import torch.nn.functional as F

class GCN(torch.nn.Module):#torch.nn.Module is the base class for all neural network modules in PyTorch 
    def __init__(self):#defines the layers of the GNN and initializes them
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 16)#GCNConv() performs message computation, aggregation of the messages, and then, updating of the node embeddings. The 1st parameter 'number of input features per node' and the 2nd argument 'number of features per output' are provided for initializing the parameters of the class GCNConv.
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data):#defines the computation flow of our network
        x, edge_index = data.x, data.edge_index#'x' represents the vector of node features and edge_index represents the adjacency matrix for connectivity

        x = self.conv1(x, edge_index)#conv1.forward() gets called here. The arguments 'x' and 'edge_index' are passed as inputs to forward().
        x = F.relu(x)#Applying Relu activation on the result of the above graph-convolution operation.
        
        x = F.dropout(x, training=self.training)#Randomly zero some of the elements of the input tensor 'x' with probability p(default: 0.5) using samples from a Bernoulli distribution.Also, the mode is set to 'training' because Dropout behaves differently during training and testing.
        
        x = self.conv2(x, edge_index)#conv2.forward() gets called here. The arguments 'x' and 'edge_index' are passed as inputs to forward().
        return F.log_softmax(x, dim=1)#Applying softmax activation on the result of the above graph-convolution operation.

Then, we choose the device on which we want to deploy the GNN and the training dataset.

A **torch.device** is an object representing the device on which a torch.Tensor is or will be allocated. The torch.device contains a device type ('cpu', 'cuda' or 'mps') 

In [52]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Next, we move the GNN parameters to the chosen device.

In [53]:
model = GCN().to(device)

Also, we move the dataset to the chosen device.

In [54]:
data = dataset[0].to(device)

## Training the GNN

Then, we choose the optimization algorithm (from the *torch.optim* package) for training/optimizing the parameters of our GNN. As an input, we provided *model.parameters()* to denote which parameters (tensors) to optimize. We also defined the decay constant and learning rate (lr).

**NOTE:** During training, the weights (parameters) of the GNN are learnt and not the embeddings of the nodes. The GNN just predicts or computes the embeddings of the nodes using *neural message passing*. The GNN just maps the input node-embeddings to new embeddings. So, the GNN can be seen as graph transformer, *i.e.*, it transforms the input graph into output graph.

In [55]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

Next, we set the mode of GCN module (torch.nn.Module) to training. '*model.train()*' simple changes the '*self.training*' flag via '*self.training = training*' recursively for all modules. 

**Note**: By default, the mode is set to training and that is why they omit '*model.train()*' call. 

In [56]:
model.train()

GCN(
  (conv1): GCNConv(1433, 16)
  (conv2): GCNConv(16, 7)
)

We train the GCN module for 200 epochs. 

In [57]:
for epoch in range(200):
    optimizer.zero_grad()#Initializes the gradients to zero at the beginning of each epoch
    out = model(data)#Calls the 'forward' method in the class GCN and stores the output/prediction of the network.
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])#Calculates the negative log likelihood loss (nll) between original ('data.y') and predicted data ('out') points
    loss.backward()#Backward pass for computing the gradients of the loss w.r.t to learnable parameters
    optimizer.step()#Updates the learnt parameters at the end of each epoch

**Final node-embeddings computed by the trained GNN:**

In [94]:
model(data) #model(data) is the tensor containing the node embeddings computed by the GNN

tensor([[-7.5913e+00, -9.7466e+00, -8.1143e+00,  ..., -7.7475e+00,
         -8.4288e+00, -7.6125e+00],
        [-1.0887e+01, -1.3381e+01, -1.6949e+01,  ..., -2.5868e-05,
         -1.3200e+01, -1.2966e+01],
        [-8.3471e+00, -1.0761e+01, -1.2356e+01,  ..., -8.2685e-04,
         -1.0384e+01, -1.0233e+01],
        ...,
        [-1.8628e+00, -1.1648e+00, -6.6731e+00,  ..., -5.1301e+00,
         -8.1067e-01, -2.7729e+00],
        [-6.0501e+00, -7.4291e+00, -6.4877e+00,  ..., -5.0024e+00,
         -6.2770e+00, -7.9431e+00],
        [-5.6314e+00, -6.7476e+00, -5.7861e+00,  ..., -4.9334e+00,
         -5.6416e+00, -7.4269e+00]], device='cuda:0',
       grad_fn=<LogSoftmaxBackward0>)

## Testing the trained GNN

Next, we set the mode of GCN module (torch.nn.Module) to testing. This is equivalent to executing: ***model.train(mode=False)***.

In [58]:
model.eval()

GCN(
  (conv1): GCNConv(1433, 16)
  (conv2): GCNConv(16, 7)
)

In [96]:
model(data)[0]

tensor([-7.5913e+00, -9.7466e+00, -8.1143e+00, -2.0090e-03, -7.7475e+00,
        -8.4288e+00, -7.6125e+00], device='cuda:0', grad_fn=<SelectBackward0>)

In [99]:
model(data)[0].argmax(dim=0)

tensor(3, device='cuda:0')

In [100]:
model(data)[1]

tensor([-1.0887e+01, -1.3381e+01, -1.6949e+01, -1.3455e+01, -2.5868e-05,
        -1.3200e+01, -1.2966e+01], device='cuda:0', grad_fn=<SelectBackward0>)

In [101]:
model(data)[1].argmax(dim=0)

tensor(4, device='cuda:0')

In [102]:
model(data)[0:2]

tensor([[-7.5913e+00, -9.7466e+00, -8.1143e+00, -2.0090e-03, -7.7475e+00,
         -8.4288e+00, -7.6125e+00],
        [-1.0887e+01, -1.3381e+01, -1.6949e+01, -1.3455e+01, -2.5868e-05,
         -1.3200e+01, -1.2966e+01]], device='cuda:0', grad_fn=<SliceBackward0>)

In [103]:
model(data)[0:2].argmax(dim=0) #dim 0 => along each column

tensor([0, 0, 0, 0, 1, 0, 0], device='cuda:0')

In [104]:
model(data)[0:2].argmax(dim=1) #dim 1 =>along each row (feature vector)

tensor([3, 4], device='cuda:0')

Obtain the predicted label by storing the label corresponding to the node embedding that has max probability. Note that, as mentioned above, each of the 7 embeddings of a node corresponds to one of the 7 labels possible for a node. 

In [122]:
pred = model(data).argmax(dim=1)
pred #tensor containing list of indices of the largest features of each node in the graph

tensor([3, 4, 4,  ..., 5, 3, 3], device='cuda:0')

In [107]:
len(pred)

2708

In [89]:
data.test_mask #tensor that labels each of the 2708 nodes of the graph as a test sample or non-test sample

tensor([False, False, False,  ...,  True,  True,  True], device='cuda:0')

In [108]:
len(data.test_mask)

2708

In [109]:
pred[data.test_mask]

tensor([1, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2,
        2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 5, 5, 4, 4, 4, 1, 1, 3, 0, 1, 1, 6,
        2, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5,
        5, 5, 2, 2, 2, 2, 2, 6, 6, 3, 0, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 6, 0, 6,
        3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 4, 4, 4, 4, 4, 3, 2, 5, 5, 5, 5,
        6, 5, 5, 5, 5, 6, 4, 4, 0, 3, 1, 0, 0, 0, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0,
        0, 0, 0, 0, 3, 4, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
        6, 6, 5, 6, 6, 0, 5, 5, 5, 0, 5, 4, 4, 4, 3, 3, 3, 3, 3, 1, 3, 3, 3, 6,
        3, 3, 1, 3, 3, 4, 4, 4, 3, 3, 3,

In [110]:
len(pred[data.test_mask])

1000

In [111]:
data

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

In [112]:
data.y

tensor([3, 4, 4,  ..., 3, 3, 3], device='cuda:0')

In [113]:
len(data.y)

2708

In [114]:
data.y[data.test_mask]

tensor([3, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2,
        2, 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 5, 2, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 3, 4, 4, 4, 4, 1, 1, 3, 1, 0, 3, 0,
        2, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5,
        5, 5, 2, 2, 2, 2, 1, 6, 6, 3, 0, 0, 5, 0, 5, 0, 3, 5, 3, 0, 0, 6, 0, 6,
        3, 3, 1, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 4, 4, 4, 0, 3, 3, 2, 5, 5, 5, 5,
        6, 5, 5, 5, 5, 0, 4, 4, 4, 0, 0, 5, 0, 0, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0,
        3, 0, 0, 0, 3, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 5, 5, 5, 5, 3, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
        6, 6, 5, 6, 6, 3, 5, 5, 5, 0, 5, 0, 4, 4, 3, 3, 3, 2, 2, 1, 3, 3, 3, 3,
        3, 3, 5, 3, 3, 4, 4, 3, 3, 3, 3,

In [116]:
len(data.y[data.test_mask])

1000

Compute the total number of correct predictions

In [72]:
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
correct

tensor(807, device='cuda:0')

Finally, we compute and print the accuracy performance of the GNN.

In [121]:
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.8070
