<a href="https://colab.research.google.com/github/vent0906/ww/blob/main/self_learn_GCN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Project Summary: Learning Graph Convolutional Networks (GCN) with PyTorch Geometric

In this project, I explored how to build and train a Graph Convolutional Network (GCN) using the PyTorch Geometric (PyG) framework. The main goal was to understand how GCNs operate on graph-structured data and apply them to a real-world graph classification problem.

I used the **MUTAG** dataset, a well-known benchmark from the TUDataset collection, where each graph represents a molecule and the task is to classify whether it is mutagenic. I started by analyzing the dataset's structure — including the number of graphs, node features, edges, and class distribution — and verified key graph properties such as isolated nodes, self-loops, and graph directionality.

I then implemented a **3-layer GCN** model with ReLU activations and global mean pooling for graph-level readout, followed by a fully connected classification layer. I defined training and testing procedures using batched data loaders and monitored performance using accuracy metrics.

To evaluate model performance, I tracked both training and test accuracy across 170 epochs and visualized the results with Matplotlib. This allowed me to observe convergence behavior and check for potential overfitting.

Overall, this hands-on experiment helped me understand the full GCN training pipeline, including dataset loading, model design, loss optimization, evaluation, and visualization in the context of graph classification.


In [4]:
!pip install torch_geometric

Collecting torch_geometric
  Downloading torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/63.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Downloading torch_geometric-2.6.1-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch_geometric
Successfully installed torch_geometric-2.6.1


In [5]:
import torch
from torch_geometric.datasets import TUDataset

# Load the MUTAG dataset from the TUDataset collection
dataset = TUDataset(root='data/TUDataset', name='MUTAG')

print()
print(f'Dataset: {dataset}:')
print('====================')
print(f'Number of graphs: {len(dataset)}')              # Total number of graph samples in the dataset
print(f'Number of features: {dataset.num_features}')    # Node feature dimension
print(f'Number of classes: {dataset.num_classes}')      # Number of target classes for classification

# Access the first graph object in the dataset
data = dataset[0]

print()
print(data)  # Print full info of the graph object (e.g., edge_index, x, y)
print('=============================================================')

# Extract structural statistics of the graph
print(f'Number of nodes: {data.num_nodes}')                     # Total number of nodes
print(f'Number of edges: {data.num_edges}')                     # Total number of edges
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')  # Average degree = edges/nodes
print(f'Has isolated nodes: {data.has_isolated_nodes()}')       # Whether there are nodes with no connections
print(f'Has self-loops: {data.has_self_loops()}')               # Whether the graph contains self-loops
print(f'Is undirected: {data.is_undirected()}')                 # Whether the graph is undirected



Downloading https://www.chrsmrrs.com/graphkerneldatasets/MUTAG.zip



Dataset: MUTAG(188):
Number of graphs: 188
Number of features: 7
Number of classes: 2

Data(edge_index=[2, 38], x=[17, 7], edge_attr=[38, 4], y=[1])
Number of nodes: 17
Number of edges: 38
Average node degree: 2.24
Has isolated nodes: False
Has self-loops: False
Is undirected: True


Processing...
Done!


In [6]:
# Set random seed for reproducibility
torch.manual_seed(12345)

# Shuffle the dataset to randomize the order of graphs
dataset = dataset.shuffle()

# Split the dataset into training and testing sets
train_dataset = dataset[:150]  # First 150 graphs for training
test_dataset = dataset[150:]   # Remaining graphs for testing

# Print the size of each split
print(f'Number of training graphs: {len(train_dataset)}')
print(f'Number of test graphs: {len(test_dataset)}')


Number of training graphs: 150
Number of test graphs: 38


In [8]:
from torch_geometric.loader import DataLoader

# Create a data loader for the training set with batching and shuffling
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Create a data loader for the test set without shuffling
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)



In [7]:
from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, global_mean_pool

class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCN, self).__init__()
        torch.manual_seed(12345)

        # Define 3 GCN layers
        self.conv1 = GCNConv(dataset.num_node_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.conv3 = GCNConv(hidden_channels, hidden_channels)

        # Final linear classifier
        self.lin = Linear(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index, batch):
        # Step 1: Node feature transformation using GCN layers
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = self.conv2(x, edge_index)
        x = x.relu()
        x = self.conv3(x, edge_index)

        # Step 2: Readout layer - aggregate node features into graph embeddings
        x = global_mean_pool(x, batch)  # Shape: [num_graphs, hidden_channels]

        # Step 3: Classification
        x = F.dropout(x, p=0.5, training=self.training)  # Apply dropout during training
        x = self.lin(x)  # Linear classification layer

        return x

# Instantiate the model
model = GCN(hidden_channels=64)
print(model)


GCN(
  (conv1): GCNConv(7, 64)
  (conv2): GCNConv(64, 64)
  (conv3): GCNConv(64, 64)
  (lin): Linear(in_features=64, out_features=2, bias=True)
)


In [9]:
from IPython.display import Javascript

# Adjust iframe height for better visualization in Google Colab
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))

# Initialize the GCN model, optimizer, and loss function
model = GCN(hidden_channels=64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()

def train():
    model.train()  # Set model to training mode

    for data in train_loader:  # Iterate over training batches
        out = model(data.x, data.edge_index, data.batch)  # Forward pass
        loss = criterion(out, data.y)  # Compute cross-entropy loss
        loss.backward()  # Backpropagation
        optimizer.step()  # Update model parameters
        optimizer.zero_grad()  # Clear gradients for next step

def test(loader):
    model.eval()  # Set model to evaluation mode

    correct = 0
    for data in loader:  # Iterate over test batches
        out = model(data.x, data.edge_index, data.batch)  # Forward pass
        pred = out.argmax(dim=1)  # Get predicted class with highest probability
        correct += int((pred == data.y).sum())  # Count correct predictions
    return correct / len(loader.dataset)  # Compute accuracy

# Training loop
for epoch in range(1, 171):  # Train for 170 epochs
    train()
    train_acc = test(train_loader)  # Evaluate on training set
    test_acc = test(test_loader)    # Evaluate on test set
    print(f'Epoch: {epoch:03d}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}')



<IPython.core.display.Javascript object>

Epoch: 001, Train Acc: 0.6467, Test Acc: 0.7368
Epoch: 002, Train Acc: 0.6467, Test Acc: 0.7368
Epoch: 003, Train Acc: 0.6467, Test Acc: 0.7368
Epoch: 004, Train Acc: 0.6467, Test Acc: 0.7368
Epoch: 005, Train Acc: 0.6467, Test Acc: 0.7368
Epoch: 006, Train Acc: 0.6533, Test Acc: 0.7368
Epoch: 007, Train Acc: 0.7467, Test Acc: 0.7632
Epoch: 008, Train Acc: 0.7267, Test Acc: 0.7632
Epoch: 009, Train Acc: 0.7200, Test Acc: 0.7632
Epoch: 010, Train Acc: 0.7133, Test Acc: 0.7895
Epoch: 011, Train Acc: 0.7200, Test Acc: 0.7632
Epoch: 012, Train Acc: 0.7200, Test Acc: 0.7895
Epoch: 013, Train Acc: 0.7200, Test Acc: 0.7895
Epoch: 014, Train Acc: 0.7133, Test Acc: 0.8421
Epoch: 015, Train Acc: 0.7133, Test Acc: 0.8421
Epoch: 016, Train Acc: 0.7533, Test Acc: 0.7368
Epoch: 017, Train Acc: 0.7400, Test Acc: 0.7632
Epoch: 018, Train Acc: 0.7133, Test Acc: 0.8421
Epoch: 019, Train Acc: 0.7400, Test Acc: 0.7895
Epoch: 020, Train Acc: 0.7533, Test Acc: 0.7368
Epoch: 021, Train Acc: 0.7467, Test Acc: