<a href="https://colab.research.google.com/github/radu93/graph_learning/blob/main/Node_Classification_with_GAT_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Node Classification



## Initial imports

In [None]:
import os
import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)

!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git

1.11.0+cu113
[K     |████████████████████████████████| 7.9 MB 32.8 MB/s 
[K     |████████████████████████████████| 3.5 MB 31.2 MB/s 
[?25h  Building wheel for torch-geometric (setup.py) ... [?25l[?25hdone


In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

## Implementing the forward method

### Linear Transformation

Consider the node $i$ with feature vector $\bar{h}_i\in\mathbb R^{F}$. We apply a linear transformation to get a new feature vector $\bar{h'}_i \in \mathbb{R}^{F'}$:
$$
\bar{h'}_i = \textbf{W}\cdot \bar{h}_i
$$
with $\textbf{W}\in\mathbb R^{F'\times F}$.

The following code is a sample linear transformation applied over *nb_nodes*.

In [None]:
in_features = 5
out_features = 2
nb_nodes = 3

W = nn.Parameter(torch.zeros(size=(in_features, out_features))) # xavier paramiter inizializator
nn.init.xavier_uniform_(W.data, gain=1.414)

print(W)

input = torch.rand(nb_nodes, in_features) 

print(input)

# linear transformation
h = torch.mm(input, W)
N = h.size()[0]

print(h.shape)

Parameter containing:
tensor([[-0.4490, -0.4999],
        [ 1.0120, -0.7833],
        [ 0.5757,  1.1265],
        [ 0.2161, -0.8341],
        [ 1.2974, -0.3723]], requires_grad=True)
tensor([[0.0161, 0.4029, 0.8167, 0.1829, 0.5342],
        [0.6370, 0.5044, 0.1990, 0.0086, 0.0902],
        [0.4763, 0.8376, 0.4879, 0.8873, 0.1465]])
torch.Size([3, 2])


### Attention Mechanism

This is an exemplification of the attention mechanism. Its purpose is to add learnable weights between neighboring nodes.

![title](https://github.com/AntonioLonga/PytorchGeometricTutorial/blob/main/Tutorial3/AttentionMechanism.png?raw=1)

Initialize the attention weights vector $a\in\mathbb{R}^{2F'}$. This vector acts like a global filter of learnable weights between neighboring nodes.

In [None]:
a = nn.Parameter(torch.zeros(size=(2*out_features, 1))) # xavier paramiter inizializator
nn.init.xavier_uniform_(a.data, gain=1.414)
print(a.shape)
print(a)

leakyrelu = nn.LeakyReLU(0.2)  # LeakyReLU

torch.Size([4, 1])
Parameter containing:
tensor([[-1.1643],
        [-0.9359],
        [ 1.2391],
        [ 1.1750]], requires_grad=True)


Compute $a_{input}\in\mathbb R^{N \times N \times 2F'}$, where $N$ is the number of nodes in the graph. It encodes all possible pairs of nodes together with their concatenated feature vectors.

In [None]:
a_input = torch.cat([h.repeat(1, N).view(N * N, -1), h.repeat(N, 1)], dim=1).view(N, -1, 2 * out_features)

print(a_input)

tensor([[[ 1.6033,  0.2450,  1.6033,  0.2450],
         [ 1.6033,  0.2450,  0.4578, -0.5301],
         [ 1.6033,  0.2450,  1.2966, -1.1393]],

        [[ 0.4578, -0.5301,  1.6033,  0.2450],
         [ 0.4578, -0.5301,  0.4578, -0.5301],
         [ 0.4578, -0.5301,  1.2966, -1.1393]],

        [[ 1.2966, -1.1393,  1.6033,  0.2450],
         [ 1.2966, -1.1393,  0.4578, -0.5301],
         [ 1.2966, -1.1393,  1.2966, -1.1393]]], grad_fn=<ViewBackward0>)


![title](https://github.com/AntonioLonga/PytorchGeometricTutorial/blob/main/Tutorial3/a_input.png?raw=1)

The multiplication $a_{input} \times a$ encodes the application of the filter $a$ over all the nodes. Note that for this exemplification we did not consider any adjacency matrix, so each node is connected with every other node (similar to a complete graph).
The result is passed through a LeakyReLU and stored in $e\in\mathbb{R}^{N \times N}$.

In [None]:
e = leakyrelu(torch.matmul(a_input, a).squeeze(2))

In [None]:
print(a_input.shape, a.shape)
print("")
print(torch.matmul(a_input, a).shape)
print("")
print(torch.matmul(a_input, a).squeeze(2).shape)

torch.Size([3, 3, 4]) torch.Size([4, 1])

torch.Size([3, 3, 1])

torch.Size([3, 3])


In [None]:
print(e)
print(e.shape)

tensor([[ 0.1786, -0.4303, -0.3656],
        [ 2.2377, -0.0185,  0.2311],
        [ 1.8312, -0.0998, -0.0351]], grad_fn=<LeakyReluBackward0>)
torch.Size([3, 3])


### Masked Attention

Let's now consider a random adjacency matrix, in order to get closer to a real graph.

In [None]:
# Masked Attention
adj = torch.randint(2, (3, 3))

print('adj:')
print(adj)

zero_vec  = -9e15 * torch.ones_like(e)
print(zero_vec.shape)

adj:
tensor([[1, 0, 1],
        [0, 1, 0],
        [1, 1, 1]])
torch.Size([3, 3])


The previous result $e\in\mathbb{R}^{N \times N}$ includes every possible edge between $N$ graph nodes. We select from $e$ only the edges specified by the adjacency matrix.

In [None]:
attention = torch.where(adj > 0, e, zero_vec)
print(adj,"\n",e,"\n",zero_vec)
print('attention:\n', attention)
print('attention shape:\n', attention.shape)


tensor([[1, 0, 1],
        [0, 1, 0],
        [1, 1, 1]]) 
 tensor([[ 0.1786, -0.4303, -0.3656],
        [ 2.2377, -0.0185,  0.2311],
        [ 1.8312, -0.0998, -0.0351]], grad_fn=<LeakyReluBackward0>) 
 tensor([[-9.0000e+15, -9.0000e+15, -9.0000e+15],
        [-9.0000e+15, -9.0000e+15, -9.0000e+15],
        [-9.0000e+15, -9.0000e+15, -9.0000e+15]])
attention:
 tensor([[ 1.7862e-01, -9.0000e+15, -3.6559e-01],
        [-9.0000e+15, -1.8499e-02, -9.0000e+15],
        [ 1.8312e+00, -9.9797e-02, -3.5074e-02]], grad_fn=<SWhereBackward0>)
attention shape:
 torch.Size([3, 3])


Now, pass the result through a softmax layer.

In [None]:
attention = F.softmax(attention, dim=1)
h_prime   = torch.matmul(attention, h)

In [None]:
attention

tensor([[0.6328, 0.0000, 0.3672],
        [0.0000, 1.0000, 0.0000],
        [0.7694, 0.1116, 0.1190]], grad_fn=<SoftmaxBackward0>)

In [None]:
h_prime

tensor([[ 1.4907, -0.2633],
        [ 0.4578, -0.5301],
        [ 1.4390, -0.0062]], grad_fn=<MmBackward0>)

This is the effect of applying an attention filter over node features:

In [None]:
print(h_prime,"\n",h)

tensor([[ 1.4907, -0.2633],
        [ 0.4578, -0.5301],
        [ 1.4390, -0.0062]], grad_fn=<MmBackward0>) 
 tensor([[ 1.6033,  0.2450],
        [ 0.4578, -0.5301],
        [ 1.2966, -1.1393]], grad_fn=<MmBackward0>)


# Building the Attentional Layer

This part puts together all the building blocks and implements a Graph Attentional Layer as a pytorch Module.

In [None]:
class GATLayer(nn.Module):
    def __init__(self, in_features, out_features, dropout, alpha, concat=True):
        super(GATLayer, self).__init__()
        self.dropout       = dropout        # drop prob = 0.6
        self.in_features   = in_features    # 
        self.out_features  = out_features   # 
        self.alpha         = alpha          # LeakyReLU with negative input slope, alpha = 0.2
        self.concat        = concat         # conacat = True for all layers except the output layer.

        
        # Xavier Initialization of Weights
        # Alternatively use weights_init to apply weights of choice 
        self.W = nn.Parameter(torch.zeros(size=(in_features, out_features)))
        nn.init.xavier_uniform_(self.W.data, gain=1.414)
        
        self.a = nn.Parameter(torch.zeros(size=(2*out_features, 1)))
        nn.init.xavier_uniform_(self.a.data, gain=1.414)
        
        # LeakyReLU
        self.leakyrelu = nn.LeakyReLU(self.alpha)

    def forward(self, input, adj):
        # Linear Transformation
        h = torch.mm(input, self.W) # matrix multiplication
        N = h.size()[0]

        # Attention Mechanism
        a_input = torch.cat([h.repeat(1, N).view(N * N, -1), h.repeat(N, 1)], dim=1).view(N, -1, 2 * self.out_features)
        e       = self.leakyrelu(torch.matmul(a_input, self.a).squeeze(2))

        # Masked Attention
        zero_vec  = -9e15*torch.ones_like(e)
        attention = torch.where(adj > 0, e, zero_vec)
        
        attention = F.softmax(attention, dim=1)
        attention = F.dropout(attention, self.dropout, training=self.training)
        h_prime   = torch.matmul(attention, h)

        if self.concat:
            return F.elu(h_prime)
        else:
            return h_prime

# Testing

## Cora Dataset

We will use **Cora**, a citation graph containing 2708 scientific publications. For each publication there is a 1433-dimensional feature vector, which is a bag-of-words representation (with a small, fixed dictionary) of the paper text. The edges in this graph represent citations, and are commonly treated as undirected. The goal is to classify each paper into one of seven classes (topics).

Load and inspect the Cora dataset using the PyTorch Geometric library:

In [None]:
from torch_geometric.data import Data
from torch_geometric.nn import GATConv
from torch_geometric.datasets import Planetoid
import torch_geometric.transforms as T

import matplotlib.pyplot as plt

name_data = 'Cora'
dataset = Planetoid(root= '/tmp/' + name_data, name = name_data)
dataset.transform = T.NormalizeFeatures()

print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')
print()

g0_data = dataset[0]
print(g0_data)
print(f'Number of nodes: {g0_data.num_nodes}')
print(f'Number of edges: {g0_data.num_edges}')
print(f'Average node degree: {g0_data.num_edges / g0_data.num_nodes:.2f}')
print(f'Number of training nodes: {g0_data.train_mask.sum()}')
print(f'Number of validation nodes: {g0_data.val_mask.sum()}')
print(f'Number of test nodes: {g0_data.test_mask.sum()}')
print()

print('Train mask:')
print(g0_data.train_mask)
print('Edge index:')
print(g0_data.edge_index)
print()

# Some more utility functions on a Data object 
print(f'Contains isolated nodes: {g0_data.has_isolated_nodes()}')
print(f'Contains self-loops: {g0_data.has_self_loops()}')
print(f'Is undirected: {g0_data.is_undirected()}')

Number of graphs: 1
Number of features: 1433
Number of classes: 7

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
Number of nodes: 2708
Number of edges: 10556
Average node degree: 3.90
Number of training nodes: 140
Number of validation nodes: 500
Number of test nodes: 1000

Train mask:
tensor([ True,  True,  True,  ..., False, False, False])
Edge index:
tensor([[   0,    0,    0,  ..., 2707, 2707, 2707],
        [ 633, 1862, 2582,  ...,  598, 1473, 2706]])

Contains isolated nodes: False
Contains self-loops: False
Is undirected: True


## GAT Network

We use a GAT Network with 2 Attentional Layers.

The first layer uses a multi-head attention. This means it applies the attentional convolution multiple times and stacks the results. The dimension of its output features is $8 \times 8$ for each node (8 heads with 8 output features each).

The second layer gets the linearized 64 features and outputs 7 features.

For the first test, we use the GATConv layer offered by pytorch.

In [None]:
class GAT(torch.nn.Module):
    def __init__(self):
        super(GAT, self).__init__()
        self.hid = 8
        self.in_head = 8
        self.out_head = 1
        
        self.conv1 = GATConv(dataset.num_features, self.hid, heads=self.in_head, dropout=0.6)
        self.conv2 = GATConv(self.hid*self.in_head, dataset.num_classes, concat=False,
                             heads=self.out_head, dropout=0.6)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
                
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv1(x, edge_index)
        x = F.elu(x)
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

We train the model using and Adam optimizer for 1000 epochs. Each epoch feeds the entire Cora graph into the network. The computed loss takes into consideration only the nodes from the training set.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = "cpu"

model = GAT().to(device)
print(model)

data = dataset[0].to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=5e-4)

model.train()
for epoch in range(1000):
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    
    if epoch%200 == 0:
        print(loss)
    
    loss.backward()
    optimizer.step()
    
    

GAT(
  (conv1): GATConv(1433, 8, heads=8)
  (conv2): GATConv(64, 7, heads=1)
)
tensor(1.9452, grad_fn=<NllLossBackward0>)
tensor(0.6751, grad_fn=<NllLossBackward0>)
tensor(0.5434, grad_fn=<NllLossBackward0>)
tensor(0.5658, grad_fn=<NllLossBackward0>)
tensor(0.5649, grad_fn=<NllLossBackward0>)


Evaluate the model prediction for the nodes in the testing set. It should get at least 82% accuracy.

In [None]:
model.eval()
_, pred = model(data).max(dim=1)
correct = float(pred[data.test_mask].eq(data.y[data.test_mask]).sum().item())
acc = correct / data.test_mask.sum().item()
print('Accuracy: {:.4f}'.format(acc))

print(pred[data.test_mask])

Accuracy: 0.8220
tensor([2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 6, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 6,
        2, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5,
        5, 5, 2, 2, 2, 2, 2, 6, 6, 3, 0, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 6, 0, 0,
        3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3,
        3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 4, 4, 4, 4, 4, 3, 2, 5, 5, 5, 5,
        6, 5, 5, 5, 5, 6, 4, 4, 0, 0, 1, 0, 0, 0, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0,
        0, 0, 0, 0, 3, 4, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
        6, 6, 5, 6, 6, 0, 5, 5, 5, 0, 5, 4, 4, 0, 3, 3, 3, 2, 3, 1, 3, 3, 3, 2,
        3, 3, 1, 3, 2, 