# Graph Neural Networks

Graph Neural Networks (GNNs) are currently the most popular approach to machine learning on graphs. Many GNN architectures can be unified by the Message-Passing Neural Networks (MPNNs) framework. Below we will describe (a variant of) this framework and implement and train several examples of MPNNs.

First, let's introduce the notation we will be using in this notebook. Let $G = (V, E)$ be a simple undirected graph without self-loops with node set $V$ and edge set $E$, $|V| = n$, $|E| = m$. Sometimes it will also be handy to use the set $E_{dir}$ of directed edges, $|E_{dir}| = 2m$. Let $N(v)$ be the set of one-hop neighbors of node $v$, and $deg(v)$ be the degree of node $v$, $deg(v) = |N(v)|$. Let $A$ be the adjacency matrix of graph $G$ and $D$ be the diagonal degree matrix of graph $G$, i.e., $D = diag \Big( deg(v_1), \; deg(v_2), \; ..., \; deg(v_n) \Big)$.

In each layer $l$ an MPNN creates a representation $x_i^l$ of each node $v_i$ from it's previous-layer representation $x_i^{l-1}$ and previous-layer representations of its neighbors. The formula for this transformation at layer $l+1$ is:

$$ x_i^{l+1} = \mathrm{Update} \Bigg( x_i^l, \; \mathrm{Aggregate} \Big( \Big\{ (x_i^l, \; x_j^l): \; v_j \in N(v_i) \Big\} \Big) \Bigg) $$

Here, $\mathrm{Aggregate}$ is a function that aggregates information from the set of neighbors (since it operates on a set, it should be invariant to the order of neighbors) and $\mathrm{Update}$ is a function that combines the node's previous-layer representation with the aggregated information from its neighbors. For example, $\mathrm{Aggregate}$ can be the elementwise mean operation over the set of neighbors and $\mathrm{Update}$ can be an MLP that takes two concatenated vectors as input:

$$ x_i^{l+1} = \mathrm{MLP} \Bigg( \bigg[ x_i^l \; \mathbin\Vert \; \mathrm{mean} \Big( \Big\{ x_j^l: \; v_j \in N(v_i) \Big\} \Big) \bigg] \Bigg) $$

(this is actually the first GNN that we will implement in this seminar).

The $\mathrm{Aggregate}$ operation is often called message passing, neighborhood aggregation, or graph convolution. Sometimes this operation is split into $\mathrm{Message}$ and $\mathrm{Reduce}$ functions.

Note that variations of the above MPNN formula are possible. For example, edge representations can be added, but we won't do it in this seminar.

In [1]:
# !pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121

In [2]:
from tqdm.notebook import tqdm
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
from torch.cuda.amp import autocast, GradScaler
from sklearn.metrics import roc_auc_score

In [3]:
device = 'cuda:0'

Now, let's get us a graph. PyTorch Geometric library provides a lot of popular graph datasets. We will use the Amazon-Computers dataset. It is a co-purchasing network where nodes represent products, edges indicate that two products are frequently bought together, node features are bag-of-words-encoded product reviews, and node labels are product categories. Note that this graph has multiple connected components and some isolated nodes.

In [4]:
# !pip install torch_geometric==2.5.0

In [5]:
from torch_geometric import datasets

In [6]:
data = datasets.Amazon(name='computers', root='data')[0]
features = data.x
labels = data.y
edges = data.edge_index.T

# The graph is undirected, but is stored as a directed one (like all graphs in PyTorch Geometric),
# so each edge appears twice.
print(f'Number of nodes: {len(labels)}')
print(f'Number of edges: {len(edges) // 2}')
print(f'Average node degree: {len(edges) / len(labels):.2f}')
print(f'Number of classes: {len(labels.unique())}')

Downloading https://github.com/shchur/gnn-benchmark/raw/master/data/npz/amazon_electronics_computers.npz
Processing...


Number of nodes: 13752
Number of edges: 245861
Average node degree: 35.76
Number of classes: 10


Done!


In [7]:
from sklearn.model_selection import train_test_split

In [8]:
full_idx = np.arange(len(labels))
train_idx, val_and_test_idx = train_test_split(full_idx, test_size=0.5, random_state=0,
                                               stratify=labels)

val_idx, test_idx = train_test_split(val_and_test_idx, test_size=0.5, random_state=0,
                                     stratify=labels[val_and_test_idx])

train_idx = torch.from_numpy(train_idx)
val_idx = torch.from_numpy(val_idx)
test_idx = torch.from_numpy(test_idx)

Let's prepare a training loop.

In [9]:
def train_step(model, loss_fn, optimizer, scaler, amp, graph, features, labels, train_idx):
    model.train()

    with autocast(enabled=amp):
        logits = model(graph=graph, x=features).squeeze(1)
        loss = loss_fn(input=logits[train_idx], target=labels[train_idx])

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()


@torch.no_grad()
def evaluate(model, metric, amp, graph, features, labels, train_idx, test_idx, val_idx):
    model.eval()

    with autocast(enabled=amp):
        logits = model(graph=graph, x=features)

    if metric == 'ROC AUC':
        labels = labels.cpu().numpy()
        logits = logits.cpu().numpy()
        train_idx = train_idx.cpu().numpy()
        val_idx = val_idx.cpu().numpy()
        test_idx = test_idx.cpu().numpy()
        
        train_metric = roc_auc_score(y_true=labels[train_idx], y_score=logits[train_idx]).item()
        val_metric = roc_auc_score(y_true=labels[val_idx], y_score=logits[val_idx]).item()
        test_metric = roc_auc_score(y_true=labels[test_idx], y_score=logits[test_idx]).item()
        
    elif metric == 'accuracy':
        preds = logits.argmax(axis=1)
        
        train_metric = (preds[train_idx] == labels[train_idx]).float().mean().item()
        val_metric = (preds[val_idx] == labels[val_idx]).float().mean().item()
        test_metric = (preds[test_idx] == labels[test_idx]).float().mean().item()
    
    else:
        raise ValueError(f'Unknown metric: {metric}.')
    
    metrics = {
        f'train {metric}': train_metric,
        f'val {metric}': val_metric,
        f'test {metric}': test_metric
    }

    return metrics


def run_experiment(graph, features, labels, train_idx, val_idx, test_idx, graph_conv_module, num_layers=2,
                   hidden_dim=256, num_heads=4, dropout=0.2, lr=3e-5, num_steps=500, device='cuda:0', amp=False):
    num_classes = len(labels.unique())
    loss_fn = F.binary_cross_entropy_with_logits if num_classes == 2 else F.cross_entropy
    metric = 'ROC AUC' if num_classes == 2 else 'accuracy'
    if num_classes == 2:
        labels = labels.float()
    
    model = Model(graph_conv_module=graph_conv_module,
                  num_layers=num_layers,
                  input_dim=features.shape[1],
                  hidden_dim=hidden_dim,
                  output_dim=1 if num_classes == 2 else num_classes,
                  num_heads=num_heads,
                  dropout=dropout)
    model.to(device)
    
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    scaler = GradScaler(enabled=amp)
    
    graph = graph.to(device)
    features = features.to(device)
    labels = labels.to(device)
    train_idx = train_idx.to(device)
    val_idx = val_idx.to(device)
    test_idx = test_idx.to(device)
    
    best_val_metric = 0
    corresponding_test_metric = 0
    best_step = None
    with tqdm(total=num_steps) as progress_bar:
        for step in range(1, num_steps + 1):
            train_step(model=model, loss_fn=loss_fn, optimizer=optimizer, scaler=scaler, amp=amp, graph=graph,
                       features=features, labels=labels, train_idx=train_idx)
            metrics = evaluate(model=model, metric=metric, amp=amp, graph=graph, features=features, labels=labels,
                               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx)

            progress_bar.update()
            progress_bar.set_postfix({metric: f'{value:.2f}' for metric, value in metrics.items()})
            
            if metrics[f'val {metric}'] > best_val_metric:
                best_val_metric = metrics[f'val {metric}']
                corresponding_test_metric = metrics[f'test {metric}']
                best_step = step
    
    print(f'Best val {metric}: {best_val_metric:.4f}')
    print(f'Corresponding test {metric}: {corresponding_test_metric:.4f}')
    print(f'(step {best_step})')


This should look quite similar to your standard training loop, but with one notable difference - there are no mini-batches, we are always training on the whole graph. Since the data samples (graph nodes) are not independent, we cannot trivially sample a mini-batch.

Now, let's implement a model. Don't forget about skip connections and layer normalization - they can signififcantly boost the performance of a deep learning model.

In [10]:
class FeedForwardModule(nn.Module):
    def __init__(self, dim, num_inputs, dropout):
        super().__init__()
        self.linear_1 = nn.Linear(in_features=num_inputs * dim, out_features=dim)
        self.dropout_1 = nn.Dropout(p=dropout)
        self.act = nn.GELU()
        self.linear_2 = nn.Linear(in_features=dim, out_features=dim)
        self.dropout_2 = nn.Dropout(p=dropout)
    
    def forward(self, x):
        x = self.linear_1(x)
        x = self.dropout_1(x)
        x = self.act(x)
        x = self.linear_2(x)
        x = self.dropout_2(x)

        return x


class ResidualModule(nn.Module):
    def __init__(self, graph_conv_module, dim, num_heads, dropout):
        super().__init__()
        self.normalization = nn.LayerNorm(normalized_shape=dim)
        self.graph_conv = graph_conv_module(dim=dim, num_heads=num_heads)
        self.feed_forward = FeedForwardModule(dim=dim, num_inputs=2, dropout=dropout)
    
    def forward(self, graph, x):
        x_res = self.normalization(x)
        
        x_aggregated = self.graph_conv(graph, x_res) # <---- 
        x_res = torch.cat([x_res, x_aggregated], axis=1)
        
        x_res = self.feed_forward(x_res)
        
        x = x + x_res
        
        return x


class Model(nn.Module):
    def __init__(self, graph_conv_module, num_layers, input_dim, hidden_dim, output_dim, num_heads, dropout):
        super().__init__()
        self.input_linear = nn.Linear(in_features=input_dim, out_features=hidden_dim)
        self.input_dropout = nn.Dropout(p=dropout)
        self.input_act = nn.GELU()
        
        self.residual_modules = nn.ModuleList(
            ResidualModule(graph_conv_module=graph_conv_module, dim=hidden_dim, num_heads=num_heads,
                           dropout=dropout)
            for _ in range(num_layers)
        )
        
        self.output_normalization = nn.LayerNorm(hidden_dim)
        self.output_linear = nn.Linear(in_features=hidden_dim, out_features=output_dim)
    
    def forward(self, graph, x):
        x = self.input_linear(x)
        x = self.input_dropout(x)
        x = self.input_act(x)
        
        for residual_module in self.residual_modules:
            x = residual_module(graph, x)
        
        x = self.output_normalization(x)
        logits = self.output_linear(x)
        
        return logits


Now everything is ready - except for the graph convolution module. We will implement several variants of this module, which will constitute the only difference between our GNNs. But first - as a simple baseline - let's implement a graph convolution module that does nothing. It will allow us to see how a graph-agnostic model performs, so we can then compare our GNNs to this baseline.

In [11]:
class DummyGraphConv(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()

    def forward(self, graph, x):
        return torch.zeros_like(x)


In [12]:
graph = torch.empty(0)   # We don't care about graph representation for this experiment.

run_experiment(graph=graph, features=features, labels=labels,
               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx,
               graph_conv_module=DummyGraphConv,
               device=device, amp=True)

  scaler = GradScaler(enabled=amp)


  0%|          | 0/500 [00:00<?, ?it/s]

  with autocast(enabled=amp):
  with autocast(enabled=amp):


Best val accuracy: 0.8464
Corresponding test accuracy: 0.8319
(step 459)


Now let's implement some real graph convolutions. Simple graph convolutions can be represented as operations with (sparse) matrices. Thus, they can be implemented in pure PyTorch. We will need the graph adjacency matrix $A$, the graph degree matrix $D$, and the matrix of node representations at layer $l$ $X^l$. Further, let $\tilde{x_i}^{l}$ be the output of $\mathrm{Aggregate}$ function at layer $l$ for node $v_i$ and let $\widetilde{X}^l$ be the matrix of stacked vectors $\tilde{x_i}^{l}$ for all nodes.

For the next couple experiments, assume that the graph argument of the graph convolution forward method is a sparse adjacency matrix of the graph.

In [13]:
edges.T.shape

torch.Size([2, 491722])

In [14]:
len(labels)

13752

In [15]:
graph = torch.sparse_coo_tensor(
    indices=edges.T,
    values=torch.ones(len(edges)),
    size=(len(labels), len(labels))
)

graph

tensor(indices=tensor([[    0,     0,     0,  ..., 13751, 13751, 13751],
                       [  507,  6551,  8210,  ..., 12751, 13019, 13121]]),
       values=tensor([1., 1., 1.,  ..., 1., 1., 1.]),
       size=(13752, 13752), nnz=491722, layout=torch.sparse_coo)

Let's implement a graph convolution that simply takes the mean of neighbors' representations. We can write:

$$ \tilde{x}_i^{l+1} = \frac{1}{|N(v_i)|} \sum_{v_j \in N(v_i)} x_j^l $$

This operation can be written in matrix form:

$$ \widetilde{X}^{l+1} = D^{-1} A X^l $$

Let's implement it!

Additionally, we can fuse the computation of inverse degree matrix and speed up the execution even more! However, this optimization doesn't work witn inductive setting, where a graph changes during the inference 

In [None]:
class MeanGraphConv(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
        self.cache = None

    def forward(self, graph, x):
        ### YOUR CODE HERE ###
    
        A = graph
        X = x
    
        if self.cache is None:
            # compute node degrees:
            degrees_sparse = graph.sum(axis=0)

            # construct sparse tensor with diagonal values populated by inverse node degrees D^{-1}:
            degrees_sparse_indices = degrees_sparse.indices().squeeze()
            degrees_sparse_values = degrees_sparse.values()
            degrees_sparse_values[degrees_sparse_values == 0] = 1
            diagonal_indices = torch.stack([degrees_sparse_indices, degrees_sparse_indices])

            inverse_degree_sparse_tensor = torch.sparse_coo_tensor(
                indices=diagonal_indices,
                values=1 / degrees_sparse_values,
                size=(len(graph), len(graph))
            )

            # compute D^{-1} @ A and cache this fused matrix
            self.cache = inverse_degree_sparse_tensor @ A


        # x_aggregated = (A @ X) / d
        x_aggregated = self.cache @ X
    
        ######################
        return x_aggregated


(The computations can be sped up by precomputing $D^{-1} A$, but we won't do it.)

In [44]:
run_experiment(graph=graph, features=features, labels=labels,
               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx,
               graph_conv_module=MeanGraphConv,
               device=device)

  scaler = GradScaler(enabled=amp)


  0%|          | 0/500 [00:00<?, ?it/s]

  with autocast(enabled=amp):
  with autocast(enabled=amp):


Best val accuracy: 0.9040
Corresponding test accuracy: 0.8973
(step 488)


As we can see, the accuracy is a lot better than in the previous experiment - our GNN works better than a graph-agnostoc model on this dataset.

Now, let's try another simple GNN variant - this time we will implement a graph convolution proposed in [the GCN paper](https://arxiv.org/abs/1609.02907). The formula is:

$$ \tilde{x}_i^{l+1} = \sum_{v_j \in N(v_i)} \frac{1}{\sqrt{deg(v_i) deg(v_j)}} x_j^l $$

It's very similar to the mean convolution, except we normalize each neighbor's representation not by the degree of the ego node, but by the geometric mean of the degree of the ego node and the neighbor. This operation can be written in matrix form:

$$ \widetilde{X}^{l+1} = D^{-\frac{1}{2}} A D^{-\frac{1}{2}} X^l $$

Let's implement it!

In [44]:
class GCNGraphConv(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
    
    def forward(self, graph, x):
        ### YOUR CODE HERE ###
        
        A = graph
        X = x
        d = A.sum(axis=0).to_dense()[:, None]
        d[d == 0] = 1   # There are some isolated nodes, and we don't want to divide by zero.
        d_inv_sqrt = d.rsqrt()

        x_aggregated = (A * d_inv_sqrt * d_inv_sqrt.T) @ X        
        ######################
        
        return x_aggregated


(The computations can be sped up by precomputing $D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$, but we won't do it.)

In [45]:
run_experiment(graph=graph, features=features, labels=labels,
               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx,
               graph_conv_module=GCNGraphConv,
               device=device)

  scaler = GradScaler(enabled=amp)


  0%|          | 0/500 [00:00<?, ?it/s]

  with autocast(enabled=amp):
  with autocast(enabled=amp):


Best val accuracy: 0.9055
Corresponding test accuracy: 0.8973
(step 434)


The results are similar to those in the previous experiment.

Simple graph convolutions can be expressed as matrix operations, and thus, can be implemented in pure PyTorch. However, efficient implementation of more complex graph convolutions requires using specialized libraries. There are two most popular GNN libraries for PyTorch - [PyTorch Geometric (PyG)](https://github.com/pyg-team/pytorch_geometric) and [Deep Graph Library (DGL)](https://www.dgl.ai/). In this seminar, we will be using DGL, because ~it is objectively better~ the instructor likes it more.

In [46]:
# !pip install dgl -f https://data.dgl.ai/wheels/torch-2.4/cu124/repo.html

In [47]:
import dgl
from dgl import ops

There are many features for deep learning on graphs in DGL, but we will only be using two of them - the Graph class, which is obviously used for representing a graph, and the [ops module](https://docs.dgl.ai/api/python/dgl.ops.html), which contains operators for message passing on graphs.

First, let's create a graph representation which we will be using in the next few experiments.

In [49]:
graph = dgl.graph((edges[:, 0], edges[:, 1]), num_nodes=len(labels))
graph

Graph(num_nodes=13752, num_edges=491722,
      ndata_schemes={}
      edata_schemes={})

Now let's reimplement the mean graph convolution, this time using DGL. For this we will need a certain operation from the ops module - can you guess which one by their names?

In [51]:
class DGLMeanGraphConv(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
    
    def forward(self, graph, x):
        ### YOUR CODE HERE ###
        
        x_aggregated = ops.copy_u_mean(graph, x)
        
        ######################
        
        return x_aggregated


In [52]:
run_experiment(graph=graph, features=features, labels=labels,
               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx,
               graph_conv_module=DGLMeanGraphConv,
               device=device, amp=True)

  scaler = GradScaler(enabled=amp)


  0%|          | 0/500 [00:00<?, ?it/s]

  with autocast(enabled=amp):
  return th.cuda.amp.autocast_mode._cast(
  return th.cuda.amp.autocast(enabled=False)
  with autocast(enabled=amp):


Best val accuracy: 0.9081
Corresponding test accuracy: 0.9011
(step 455)


The results are roughly the same as for the pure PyTorch implementation, but the training is faster (graph message passing operations with DGL a generally faster than PyTorch sparse matrix multiplications, and, further, DGL supports using AMP with most of its operations, while PyTorch does not (yet) allow using AMP with sparse matrix operations).

By simply swapping the ops.copy_u_mean function for the ops.copy_u_max function, we can get another graph convolution that computes the elementwise maximum of neighbors' representations. This one cannot be efficiently implemented in pure PyTorch. Let's see how it performs.

In [53]:
class DGLMaxGraphConv(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
    
    def forward(self, graph, x):
        ### YOUR CODE HERE ###
        
        x_aggregated = ops.copy_u_max(graph, x)
        x_aggregated[x_aggregated.isinf()] = 0   # There are some isolated nodes, and we do not want -inf.
        
        ######################
        
        return x_aggregated


In [54]:
run_experiment(graph=graph, features=features, labels=labels,
               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx,
               graph_conv_module=DGLMaxGraphConv,
               device=device)   # This one currently does not work with AMP.

  scaler = GradScaler(enabled=amp)


  0%|          | 0/500 [00:00<?, ?it/s]

  with autocast(enabled=amp):
  with autocast(enabled=amp):


Best val accuracy: 0.9107
Corresponding test accuracy: 0.9098
(step 405)


Now, let's reimplement the GCN graph convolution using DGL.

In [55]:
class DGLGCNGraphConv(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
    
    def forward(self, graph, x):
        ### YOUR CODE HERE ###
        d = graph.in_degrees()
        d_inv_sqrt = 1 / d.sqrt()
        
        weights = ops.u_mul_v(graph, d_inv_sqrt, d_inv_sqrt)
        
        x_aggregated = ops.u_mul_e_sum(graph, x, weights)        
        
        ######################
        
        return x_aggregated


(The computations can be sped up by precomputing weights, but we won't do it.)

In [56]:
run_experiment(graph=graph, features=features, labels=labels,
               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx,
               graph_conv_module=DGLGCNGraphConv,
               device=device, amp=True)

  scaler = GradScaler(enabled=amp)


  0%|          | 0/500 [00:00<?, ?it/s]

  with autocast(enabled=amp):
  with autocast(enabled=amp):


Best val accuracy: 0.9072
Corresponding test accuracy: 0.8895
(step 465)


Now let's implement something more complex - the graph convolution proposed in [the GAT paper](https://arxiv.org/abs/1710.10903). This one uses attention (although a very simple version of it). The formulas are:

$$ s_{ij} = \mathrm{LeakyReLU} \Big( w_1^T x_i^l + w_2^T x_j^l + b \Big)\;\;\;\;\; \forall (i, j) \in E_{dir} $$

$$ \Big( p_{ij}: \; v_j \in N(v_i) \Big) = softmax \Big( s_{ij}: \; v_j \in N(v_i) \Big) \;\;\;\;\; \forall i = 1, ..., n$$

This clunky notation means that for each node we take the softmax of attention scores ($s_{ij}$) of its neighbors to get attention probabilities ($p_{ij}$) corresponding to these neighbors. Another way to write it is:

$$ p_{ij} = \frac{ \exp{(s_{ij})} }{ \sum_{v_k \in N(v_i)} \exp{(s_{ik})} } \;\;\;\;\; \forall (i, j) \in E_{dir} $$

The necessary edge softmax function is available in DGL.

$$ \tilde{x}_i^{l+1} = \sum_{v_j \in N(v_i)} p_{ij} x_j^l $$

Note that additionally the attention mechanism is multi-headed.

In [57]:
from dgl.nn.functional import edge_softmax

In [58]:
class DGLGATGraphConv(nn.Module):
    def __init__(self, dim, num_heads=4, **kwargs):
        super().__init__()
        ### YOUR CODE HERE ###
        
        self.dim = dim
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
    
        self.attn_linear_u = nn.Linear(in_features=dim, out_features=num_heads)
        self.attn_linear_v = nn.Linear(in_features=dim, out_features=num_heads, bias=False)
        self.attn_act = nn.LeakyReLU(negative_slope=0.2)
        
        ######################
    
    def forward(self, graph, x):
        ### YOUR CODE HERE ###
        
        attn_scores_u = self.attn_linear_u(x)
        attn_scores_v = self.attn_linear_v(x)

        attn_scores = ops.u_add_v(graph, attn_scores_u, attn_scores_v)

        attn_scores = self.attn_act(attn_scores)
        attn_probs = edge_softmax(graph, attn_scores)
        
        x = x.reshape(len(x), self.head_dim, self.num_heads)
        x_aggregated = ops.u_mul_e_sum(graph, x, attn_probs)
        x_aggregated = x_aggregated.reshape(len(x), self.dim)        
        ######################
        
        return x_aggregated


In [59]:
run_experiment(graph=graph, features=features, labels=labels,
               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx,
               graph_conv_module=DGLGATGraphConv,
               device=device, amp=True)

  scaler = GradScaler(enabled=amp)


  0%|          | 0/500 [00:00<?, ?it/s]

  with autocast(enabled=amp):
  with autocast(enabled=amp):


Best val accuracy: 0.9113
Corresponding test accuracy: 0.8985
(step 424)


Now, let's see if GNNs can achieve strong performance on a heterophilous graphs.

In [60]:
data = datasets.HeterophilousGraphDataset(name='amazon-ratings', root='data/heterophilous-graphs')[0]
features = data.x
labels = data.y
edges = data.edge_index.T

# The graph is undirected, but is stored as a directed one (like all graphs in PyTorch Geometric),
# so each edge appears twice.
print(f'Number of nodes: {len(labels)}')
print(f'Number of edges: {len(edges) // 2}')
print(f'Average node degree: {len(edges) / len(labels):.2f}')
print(f'Number of classes: {len(labels.unique())}')

Number of nodes: 24492
Number of edges: 93050
Average node degree: 7.60
Number of classes: 5


In [61]:
# The same split as in the previous seminar, except we now use part of the previous seminar's 
# train set as a val set.
train_idx = torch.where(data.train_mask[:, 0])[0]
val_idx = torch.where(data.val_mask[:, 0])[0]
test_idx = torch.where(data.test_mask[:, 0])[0]

In [62]:
graph = dgl.graph((edges[:, 0], edges[:, 1]), num_nodes=len(labels))
graph

Graph(num_nodes=24492, num_edges=186100,
      ndata_schemes={}
      edata_schemes={})

As always, let's first try a graph-agnostic baseline.

In [63]:
run_experiment(graph=graph, features=features, labels=labels,
               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx,
               graph_conv_module=DummyGraphConv,
               device=device, amp=True, num_steps=2500)

  scaler = GradScaler(enabled=amp)


  0%|          | 0/2500 [00:00<?, ?it/s]

  with autocast(enabled=amp):
  with autocast(enabled=amp):


Best val accuracy: 0.4684
Corresponding test accuracy: 0.4761
(step 2473)


Let's see if a GNN with mean graph convolution can achieve significantly better results.

In [64]:
run_experiment(graph=graph, features=features, labels=labels,
               train_idx=train_idx, val_idx=val_idx, test_idx=test_idx,
               graph_conv_module=DGLMeanGraphConv,
               device=device, amp=True, num_steps=2500)

  scaler = GradScaler(enabled=amp)


  0%|          | 0/2500 [00:00<?, ?it/s]

  with autocast(enabled=amp):
  with autocast(enabled=amp):


Best val accuracy: 0.5314
Corresponding test accuracy: 0.5404
(step 1828)
