In [1]:
import os 
import torch
# pre-partitioning the graph into subgraphs on which one can operate in a mini-batch fashion.

**Cluster-GCN** ([Chiang et al. (2019)](https://arxiv.org/abs/1905.07953) works by first partioning the graph into subgraphs based on graph partitioning algorithms.
With this, GNNs are restricted to solely convolve inside their specific subgraphs, which omits the problem of **neighborhood explosion**.

However, after the graph is partitioned, some links are removed which may limit the model's performance due to a biased estimation.
To address this issue, Cluster-GCN also **incorporates between-cluster links inside a mini-batch**, which results in the following **stochastic partitioning scheme**:

In [2]:
import torch
from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures

dataset = Planetoid(root='./data/Planetoid', name='PubMed', transform=NormalizeFeatures())

print()
print(f'Dataset: {dataset}:')
print('==================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

data = dataset[0]  # Get the first graph object.

print()
print(data)
print('===============================================================================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.3f}')
print(f'Has isolated nodes: {data.has_isolated_nodes()}')
print(f'Has self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')


Dataset: PubMed():
Number of graphs: 1
Number of features: 500
Number of classes: 3

Data(x=[19717, 500], edge_index=[2, 88648], y=[19717], train_mask=[19717], val_mask=[19717], test_mask=[19717])
Number of nodes: 19717
Number of edges: 88648
Average node degree: 4.50
Number of training nodes: 60
Training node label rate: 0.003
Has isolated nodes: False
Has self-loops: False
Is undirected: True


PyTorch Geometric provides a **two-stage implementation** of the Cluster-GCN algorithm:
1. [**`ClusterData`**](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.ClusterData) converts a `Data` object into a dataset of subgraphs containing `num_parts` partitions.
2. Given a user-defined `batch_size`, [**`ClusterLoader`**](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.ClusterLoader) implements the stochastic partitioning scheme in order to create mini-batches.

The procedure to craft mini-batches then looks as follows:

In [5]:
from torch_geometric.loader import ClusterData, ClusterLoader

torch.manual_seed(12345)
cluster_data = ClusterData(data, num_parts=128)  # 1. Create subgraphs.
train_loader = ClusterLoader(cluster_data, batch_size=32, shuffle=True)  # 2. Stochastic partioning scheme.

print()
total_num_nodes = 0
for step, sub_data in enumerate(train_loader):
    print(f'Step {step + 1}:')
    print('=======')
    print(f'Number of nodes in the current batch: {sub_data.num_nodes}')
    print(sub_data)
    print()
    total_num_nodes += sub_data.num_nodes

print(f'Iterated over {total_num_nodes} of {data.num_nodes} nodes!')

Computing METIS partitioning...


ImportError: 'ClusterData' requires either 'pyg-lib' or 'torch-sparse'