# HW2 part 2: GNN 

## Graph Attention Network (GAT) Implementation for Node Classification

**Objective:** Implement a GAT from scratch to perform node classification using an OGB dataset. Develop neural components, including the forward pass, as well as training and testing routines. There are 17 `todo`s and 2 questions for GAT.

A Graph Attention Network (GAT) applies the concept of self-attention to graph-structured data. Unlike traditional Graph Convolutional Networks (GCNs) that rely on fixed or uniform weights derived from adjacency structures, GAT learns to assign different weights (attention scores) to different edges. This allows the model to focus more on important neighbors while possibly ignoring less relevant ones. By doing so, GAT effectively captures how each node interacts with its neighbors in a more flexible and adaptive way.

In [None]:
# This notebook's first part demonstrates how to train and evaluate a GAT model on the ogbn-arxiv dataset for node classification (To predict the category of each paper). Make sure you know the basics of GNNs, PyG and pytorch before starting this notebook. If not, please check the corresponding tutorials first.

# If you use Google Colab, you can uncomment and run the following command to install the required packages.
# For this assigment, we recommend using your own local computer since the RAM usage is high, Colab may crash due to the high RAM usage.
# We passed our assignment on our local computers using python version 3.10.16.
# !pip install torch==2.5.0
# !pip install torch-geometric==2.6.1
# !pip install ogb==1.3.6

import torch
import torch.nn.functional as F
import torch.nn as nn
from torch_geometric.nn import GATConv
from ogb.nodeproppred import PygNodePropPredDataset,Evaluator

# We recommend to use GPU for training if possible. Because GAT is a relatively large model, training on CPU can be slow. But after testing, we find that training GAT on CPU is also acceptable.


if torch.cuda.is_available():
    device = torch.device("cuda")  # NVIDIA GPU
else:
    device = torch.device("cpu")   # CPU fallback

print(f"Using device: {device}")


import random
def set_seed(seed):
    random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    # make cudnn deterministic
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# set a random seed for reproducibility
set_seed(42)



In [None]:

# ## Define a GAT Model for Node Classification

#### Model Structure

## The model is divided into two GATConv layers:
##   The first layer uses multi-head attention (specified by num_heads) to produce richer representations. Here, each head outputs hidden_channels features, and they are concatenated, resulting in a total output dimension of hidden_channels * num_heads.
##   The second layer is a single-head output layer that directly predicts the final node embeddings or categories. Its output dimension corresponds to the number of classes (out_channels).

#### Layer Normalization

## After the first GAT layer, the output is passed through nn.LayerNorm(hidden_channels * num_heads). This helps stabilize training by normalizing feature distributions across different nodes.

#### Forward Pass

## First apply dropout to the input features (x) to reduce overfitting.
## Pass the data through the first GAT layer (gat1).
## Apply an ELU activation, then layer normalization, and another dropout.
## Finally, pass the features through the second GAT layer (gat2) to get the predictions (the paper category with the highest score).

class GAT(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_heads=8, dropout=0.6,add_self_loops=True):
        super(GAT, self).__init__()
        self.dropout = dropout
        # The first layer: multi-head attention, each head outputs hidden_channels, and the total output dimension = hidden_channels * num_heads (i.e. the input dimension of the second layer), you don't need to care the concatenation in this layer
        # The usage of GATConv is: GATConv(in_channels, out_channels, heads=XXX, concat=XXX, dropout=dropout, add_self_loops=XXX)
        ## TODO 1: Define the first GAT layer (2 points)
        
        ## TODO 1: Define the first GAT layer
        # The second layer: single-head output, not concatenated, directly output the dimension of the node categories
        ## TODO 2: Define the second GAT layer (2 points)
        
        ## TODO 2: Define the second GAT layer
        # Layer normalization for stable training, use nn.LayerNorm, the input is the hidden_channels * num_heads
        ## TODO 3: Define a LayerNorm layer (2 points)
        
        ## TODO 3: Define a LayerNorm layer

    def forward(self, data):
        # data contains x and edge_index
        x, edge_index = data.x, data.edge_index
        ## TODO 4: use a dropout layer for the input features, then apply the first GAT layer (2 points)
        # use F.dropout to perform dropout, the input is x, and the dropout rate is self.dropout, set training=self.training


        ## TODO 4: use a dropout layer for the input features, then apply the first GAT layer
        ## TODO 5: apply ELU activation, layer normalization and dropout (2 points)
        # use F.elu for ELU activation, the input is x



        ## TODO 5: apply ELU activation, layer normalization and dropout
        ## TODO 6: apply the second GAT layer (2 points)

        ## TODO 6: apply the second GAT layer
        return x



# Load the ogbn-arxiv dataset
dataset_arxiv = PygNodePropPredDataset(name='ogbn-arxiv')


# Let's see some properties of the dataset
## TODO 7: print the number of graphs in the dataset (2 points)

## TODO 7: print the number of graphs in the dataset

# Let's see how many features and classes are in the dataset
## TODO 8: print the number of features and classes in the dataset (2 points)

## TODO 8: print the number of features and classes in the dataset


data_arxiv = dataset_arxiv[0]
# Let's see the shape of the node features and the target labels
print(data_arxiv.x.shape)

# Get the data split index
split_idx_arxiv = dataset_arxiv.get_idx_split()
train_idx_arxiv = split_idx_arxiv['train']
valid_idx_arxiv = split_idx_arxiv['valid']
test_idx_arxiv  = split_idx_arxiv['test']

# Model, optimizer, scheduler, and evaluator settings
model_arxiv = GAT(in_channels=dataset_arxiv.num_features,
                         hidden_channels=64,
                         out_channels=dataset_arxiv.num_classes,
                         num_heads=8,
                         dropout=0.6).to(device)
data_arxiv = data_arxiv.to(device)
optimizer_arxiv = torch.optim.Adam(model_arxiv.parameters(), lr=0.005, weight_decay=5e-4)
scheduler_arxiv = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer_arxiv, mode='min', factor=0.7, patience=10, verbose=True)
evaluator_arxiv = Evaluator(name='ogbn-arxiv')

def train_arxiv():
    model_arxiv.train()
    optimizer_arxiv.zero_grad()
    ## TODO 9: forward propagation, loss calculation and backward propagation (2 points)
    # use F.cross_entropy as the loss function, the input is the output of the model and the target labels, and only use the training set for loss calculation, then call loss.backward()



    ## TODO 9: forward propagation, loss calculation and backward propagation
    optimizer_arxiv.step()
    return loss.item()

@torch.no_grad()
def evaluate_arxiv():
    model_arxiv.eval()
    ## TODO 10: get the prediction and use argmax to get the final category (2 points)


    ## TODO 10: get the prediction and use argmax to get the final category
    train_acc = evaluator_arxiv.eval({'y_true': data_arxiv.y[train_idx_arxiv],
                                      'y_pred': y_pred[train_idx_arxiv]})['acc']
    valid_acc = evaluator_arxiv.eval({'y_true': data_arxiv.y[valid_idx_arxiv],
                                      'y_pred': y_pred[valid_idx_arxiv]})['acc']
    test_acc  = evaluator_arxiv.eval({'y_true': data_arxiv.y[test_idx_arxiv],
                                      'y_pred': y_pred[test_idx_arxiv]})['acc']
    return train_acc, valid_acc, test_acc


In [None]:
## You can change these hyperparameters to see if you can get better results, but the default hyperparameters should work. And also make sure the three sets for hyperparameters are the same.
num_epochs = 30
best_valid_acc_arxiv = 0
patience_arxiv = 30
trigger_times_arxiv = 0
best_model_state_arxiv = None


In [None]:
print("---- ogbn-arxiv training start ----")
for epoch in range(1, num_epochs + 1):
    loss = train_arxiv()
    scheduler_arxiv.step(loss)
    train_acc, valid_acc, test_acc = evaluate_arxiv()
    ## TODO 11: early stopping, in an if-else block (2 points)
    if valid_acc > best_valid_acc_arxiv:



    else:
       
    ## TODO 11: early stopping, in an if-else block

    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Train: {train_acc:.4f}, Valid: {valid_acc:.4f}, Test: {test_acc:.4f}')

    if trigger_times_arxiv >= patience_arxiv:
        print("Early stopping triggered!")
        break


In [None]:
# Load the best model state
model_arxiv.load_state_dict(best_model_state_arxiv)
final_train, final_valid, final_test = evaluate_arxiv()
print(f"[ogbn-arxiv] Best validation accuracy: {final_valid:.4f}, corresponding test accuracy: {final_test:.4f}")



* What's the best validation accuracy you can get? (2 points) What's the corresponding test accuracy? (2 points) Please report the results in this markdown cell.

* Let's see if we remove the graph structure and only use the node features, how well the model can perform.

In [None]:
import copy

# construct a dummy edge_index with self-loops only
num_nodes = data_arxiv.num_nodes
dummy_edge_index = torch.arange(num_nodes, device=data_arxiv.x.device).unsqueeze(0).repeat(2, 1)

# copy the original data and replace the edge_index with the dummy one
data_arxiv_no_graph = copy.deepcopy(data_arxiv)
data_arxiv_no_graph.edge_index = dummy_edge_index

# define a new model for the data without graph structure, with the same hyperparameters
model_no_graph = GAT(
    in_channels=dataset_arxiv.num_features,
    hidden_channels=64,
    out_channels=dataset_arxiv.num_classes,
    num_heads=8,
    dropout=0.6
).to(device)


optimizer_no_graph = torch.optim.Adam(model_no_graph.parameters(), lr=0.005, weight_decay=5e-4)
scheduler_no_graph = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer_no_graph, mode='min', factor=0.7, patience=10, verbose=True
)

def train_no_graph():
    ## TODO 12: training code for the model without graph structure, should be the same as the original training code (2 points)


    # use the same model but with data without graph structure




    ## TODO 12: training code for the model without graph structure, should be the same as the original training code
    return loss.item()

@torch.no_grad()
def evaluate_no_graph():
    ## TODO 13: evaluation code for the model without graph structure, should be the same as the original evaluation code (2 points)

    ## TODO 13: evaluation code for the model without graph structure, should be the same as the original evaluation code
    return train_acc, valid_acc, test_acc


## You can change these hyperparameters to see if you can get better results, but the default hyperparameters should work. And also make sure the three sets for hyperparameters are the same.
num_epochs_no_graph = 30
best_valid_acc_arxiv_no_graph = 0
patience_arxiv_no_graph = 30
trigger_times_arxiv_no_graph = 0
best_model_state_arxiv_no_graph = None
# start training and evaluation
for epoch in range(1, num_epochs_no_graph+1):
    loss = train_no_graph()
    scheduler_no_graph.step(loss)
    train_acc, valid_acc, test_acc = evaluate_no_graph()
    ## TODO 14: early stopping, in an if-else block (2 points)
    if valid_acc > best_valid_acc_arxiv_no_graph:



    else:
        
    ## TODO 14: early stopping, in an if-else block

    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Train: {train_acc:.4f}, Valid: {valid_acc:.4f}, Test: {test_acc:.4f}')

    if trigger_times_arxiv_no_graph >= patience_arxiv_no_graph:
        print("Early stopping triggered!")
        break
# Load the best model state
model_no_graph.load_state_dict(best_model_state_arxiv_no_graph)
final_train, final_valid, final_test = evaluate_no_graph()
print(f"[ogbn-arxiv_no_graph] Best validation accuracy: {final_valid:.4f}, corresponding test accuracy: {final_test:.4f}")


* What's the best validation accuracy you can get? (2 points) What's the corresponding test accuracy? (2 points) Please report the results in this markdown cell.

* Let's see if we further remove the edge structure and only use the node features, how well the model can perform.

In [None]:
import copy

# an empty edge_index, even no self-loops
dummy_edge_index = torch.empty((2, 0), dtype=torch.long, device=data_arxiv.x.device)

# duplicate the original data and replace the edge_index with the dummy one
data_arxiv_no_edge = copy.deepcopy(data_arxiv)
data_arxiv_no_edge.edge_index = dummy_edge_index

# define a new model for the data without graph structure, with the same hyperparameters
model_no_edge = GAT(
    in_channels=dataset_arxiv.num_features,
    hidden_channels=64,
    out_channels=dataset_arxiv.num_classes,
    num_heads=8,
    dropout=0.6
).to(device)

optimizer_no_edge = torch.optim.Adam(model_no_edge.parameters(), lr=0.005, weight_decay=5e-4)
scheduler_no_edge = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer_no_edge, mode='min', factor=0.7, patience=10, verbose=True
)

def train_no_edge():
    ## TODO 15: training code for the model without edge structure, should be the same as the original training code (2 points)


    # use the same model but with data without graph structure



    ## TODO 15: training code for the model without edge structure, should be the same as the original training code
    return loss.item()

@torch.no_grad()
def evaluate_no_edge():
    ## TODO 16: evaluation code for the model without edge structure, should be the same as the original evaluation code (2 points)

    ## TODO 16: evaluation code for the model without edge structure, should be the same as the original evaluation code
    return train_acc, valid_acc, test_acc


## You can change these hyperparameters to see if you can get better results, but the default hyperparameters should work. And also make sure the three sets for hyperparameters are the same.
num_epochs_no_edge = 30
best_valid_acc_no_edge = 0
patience_no_edge = 30
trigger_times_no_edge = 0
best_model_state_no_edge = None

# start training and evaluation
for epoch in range(1, num_epochs_no_edge+1):
    loss = train_no_edge()
    scheduler_no_edge.step(loss)
    train_acc, valid_acc, test_acc = evaluate_no_edge()
    ## TODO 17: early stopping, in an if-else block (2 points)
    if valid_acc > best_valid_acc_no_edge:



    else:

    ## TODO 17: early stopping, in an if-else block

    print(f"Epoch: {epoch:03d}, Loss: {loss:.4f}, Train: {train_acc:.4f}, Valid: {valid_acc:.4f}, Test: {test_acc:.4f}")

    if trigger_times_no_edge >= patience_no_edge:
        print("Early stopping triggered!")
        break

# load the best model state
model_no_edge.load_state_dict(best_model_state_no_edge)
final_train, final_valid, final_test = evaluate_no_edge()
print(f"[ogbn-arxiv_no_edge] Best validation accuracy: {final_valid:.4f}, corresponding test accuracy: {final_test:.4f}")


* What's the best validation accuracy you can get? What's the corresponding test accuracy? Please report the results in this markdown cell (2 points).
* What's your observation of the results (all the three situations)? Please write down your observation in this markdown cell (2 points).

## Graph Convolutional Network (GCN) Implementation for Graph Classification

**Objective:** Implement a GCN from scratch to perform graph classification using an OGB dataset. Develop neural components, including the forward pass, as well as training and testing routines. There are 11 `todo`s and 7 questions for GCN.

## 1. Introduction

Provide an overview of Graph Neural Networks (GNNs) and the significance of GCNs in processing graph-structured data. Discuss the relevance of graph classification tasks and the role of the OGB datasets in benchmarking.

## 2. Dataset Exploration

### 2.1. Importing Libraries



In [None]:
import torch
import torch.nn.functional as F
from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset, Evaluator
import torch_geometric

### 2.2. Loading Dataset

In [None]:
# Import necessary modules
from ogb.graphproppred import PygGraphPropPredDataset  # OGB dataset loader for graph property prediction
from torch_geometric.data import DataLoader  # PyG DataLoader for handling batches of graphs

# Load the OGB dataset for molecular property prediction.
# 'ogbg-molhiv' is a graph-level dataset where each graph represents a molecular structure,
# and the task is binary classification to predict whether the molecule inhibits HIV replication.
dataset = PygGraphPropPredDataset(name='ogbg-molhiv')

# Split the dataset into training, validation, and test sets.
# The dataset provides predefined splits to ensure consistency in evaluation.
split_idx = dataset.get_idx_split()
train_dataset = dataset[split_idx['train']]  # Training set
valid_dataset = dataset[split_idx['valid']]  # Validation set
test_dataset = dataset[split_idx['test']]    # Test set

# Create DataLoaders for batch processing.
# The DataLoader enables efficient loading of graphs in batches for training and evaluation.
# - batch_size: Number of graphs per batch.
# - shuffle: Whether to shuffle the dataset at each epoch (only done for training).
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)   # Shuffle for training
valid_loader = DataLoader(valid_dataset, batch_size=128, shuffle=False)  # No shuffle for validation
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)    # No shuffle for testing


In [None]:
# Display dataset information
print(f'Dataset name: {dataset.name}')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of classes: {dataset.num_tasks}')
print(f'Number of node features: {dataset.num_node_features}')

## 3. Model Implementation
### 3.1. GCN Convolution Layer Implementation

In [None]:
import torch.nn as nn
from torch_geometric.nn import MessagePassing  # Base class for defining message-passing layers
from torch_geometric.utils import add_self_loops, degree  # Utilities for graph processing

class GCNConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        """
        Custom implementation of a Graph Convolutional Network (GCN) layer.

        Args:
            in_channels (int): Number of input node features.
            out_channels (int): Number of output node features.

        The layer follows the formulation:
            H' = σ(D^(-1/2) A D^(-1/2) H W)
        where:
            - A is the adjacency matrix with self-loops.
            - D is the degree matrix.
            - H is the input node feature matrix.
            - W is the trainable weight matrix.
            - σ is a non-linearity (like ReLU).
        """
        super(GCNConv, self).__init__(aggr='add')  # "Add" aggregation means summing messages from neighbors.

        # TODO 1: Define a linear transformation layer for node features (2 points)
        self.linear = nn.Linear(______, ______)
        # TODO 1: Define a linear transformation layer for node features

    def forward(self, x, edge_index):
        """
        Forward pass of the GCN layer.

        Args:
            x (Tensor): Node feature matrix of shape [num_nodes, in_channels].
            edge_index (Tensor): Graph connectivity in COO format, shape [2, num_edges].

        Returns:
            Tensor: Updated node features of shape [num_nodes, out_channels].
        """

        # TODO 2: Add self-loops to the adjacency matrix (2 points)
        edge_index, _ = add_self_loops(______, num_nodes=______)
        # TODO 2: Add self-loops to the adjacency matrix

        # TODO 3: Compute node degrees (2 points)
        row, col = edge_index  # Extract row (source nodes) and col (destination nodes)
        deg = degree(______, x.size(0), dtype=x.dtype)  # Compute degree of each node
        deg_inv_sqrt = deg.pow(-0.5)  # Compute D^(-1/2)
        # TODO 3: Compute node degrees 

        # Prevent division by zero for isolated nodes
        deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0  

        # TODO 4: Compute normalized adjacency matrix (2 points)
        norm = ______
        # TODO 4: Compute normalized adjacency matrix 

        # TODO 5: Perform message passing using PyTorch Geometric's propagate function (2 points)
        out = self.propagate(______, x=x, norm=norm)
        # TODO 5: Perform message passing using PyTorch Geometric's propagate function 

        # TODO 6: Apply the linear transformation after aggregation (2 points)
        out = ______
        # TODO 6: Apply the linear transformation after aggregation 

        return out

    def message(self, x_j, norm):
        """
        Message function: Defines how information is aggregated from neighboring nodes.

        Args:
            x_j (Tensor): Features of neighboring nodes.
            norm (Tensor): Normalization coefficients computed from node degrees.

        Returns:
            Tensor: Normalized node features.
        """
        # TODO 7: Apply normalization to the node features (2 points)
        return ______
        # TODO 7: Apply normalization to the node features


## TODO:
### Why is it important to add self-loops in the graph convolution process, and how does the normalization strategy here help stabilize training in graph neural networks? (2 points)

### 3.2. Implement GCN

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import global_mean_pool

class GCN(torch.nn.Module):
    def __init__(self, num_node_features, hidden_channels, num_classes, dropout_rate):
        """
        Graph Convolutional Network (GCN) for graph classification.

        Args:
            num_node_features (int): Number of input node features.
            hidden_channels (int): Number of hidden units in the GCN layers.
            num_classes (int): Number of output classes (for classification tasks).
            dropout_rate (float): Dropout probability for regularization.

        This model consists of:
        - Two GCNConv layers to extract graph structure-aware features.
        - ReLU activation and dropout for regularization.
        - A global mean pooling layer to aggregate node features into graph-level representations.
        - A final linear layer for classification.
        """
        super(GCN, self).__init__()

        self.dropout_rate = dropout_rate  # Store dropout rate

        # TODO 8: Complete the network layer definitions (2 points)
        # First graph convolution layer: transforms node features to hidden dimension
        # Second graph convolution layer: refines node embeddings
        # Final linear layer to produce class predictions from the pooled graph representation
        # TODO 8: Complete the network layer definitions (2 points)

    def forward(self, x, edge_index, batch):
        """
        Forward pass through the GCN.

        Args:
            x (Tensor): Node feature matrix of shape [num_nodes, num_node_features].
            edge_index (Tensor): Graph connectivity in COO format [2, num_edges].
            batch (Tensor): Batch index for each node, used for global pooling.

        Returns:
            Tensor: Output class logits of shape [num_graphs, num_classes].
        """

        # TODO 9: Complete the implementation of the forward function (2 points)

        # First GCN layer: apply convolution, activation, and dropout
        # ReLU activation layer
        # Dropout layer for regularization
        # Second GCN layer
        # Global mean pooling to aggregate node features into a graph 
        #   representation (Output shape: [num_graphs, hidden_channels])
        # Final linear layer for classification (Output shape: [num_graphs, num_classes])
        # TODO 9: Complete the implementation of the forward function (2 points)

        return x


## 4. Model Training
### 4.1. Implement Training and Evalutation Function

In [None]:
def train(model, loader, device):
    """
    Training function for the GCN model.

    Args:
        model (torch.nn.Module): The GCN model.
        loader (DataLoader): DataLoader for the training set.
        criterion (torch.nn.Module): Loss function.
        optimizer (torch.optim.Optimizer): Optimizer for training.
        device (torch.device): Device to run computations on (CPU or GPU).

    Returns:
        float: Average training loss per graph.
    """

    model.train()
    total_loss = 0
    # Iterate over batches in the DataLoader
    for data in loader:
        data = data.to(device)

        # TODO 10: Complete the training function (2 points)
        #  Zero the gradients of the Adam optimizer from `torch.optim`

        # Perform a forward pass through the model

        # Compute the loss use `BCEWithLogitsLoss()``

        #  Perform backpropagation

        # Update model parameters

        # TODO 10: Complete the training function

        # Accumulate loss
        total_loss += loss.item() * data.num_graphs

    return total_loss / len(loader.dataset)


## TODO:
### How does the choice of the BCEWithLogitsLoss and the Adam optimizer influence the training dynamics in this setting? (2 points)
### What are potential alternative choices and their pros/cons for graph-based binary classification tasks? (2 points)

In [None]:
def evaluate(model, loader, evaluator, device):
    model.eval()
    y_true = []
    y_pred = []
    with torch.no_grad():
        for data in loader:
            data = data.to(device)
            out = model(data.x, data.edge_index, data.batch)
            y_true.append(data.y.view(-1, 1))
            y_pred.append(out)
    y_true = torch.cat(y_true, dim=0)
    y_pred = torch.cat(y_pred, dim=0)
    return evaluator.eval({'y_true': y_true, 'y_pred': y_pred})


## TODO:
### What potential pitfalls can arise during the evaluation phase of graph neural networks? (2 points)
### How might one detect issues like overfitting or underfitting using evaluation metrics such as ROC-AUC? (2 points)

### 4.2. Training Loop (Hyper-parameter Tuning to reach `Validation AUC` $\ge$ 0.7 ) 

1 pt: `Validation AUC` $\ge$ 0.69


1.5 pts: `Validation AUC` $\ge$ 0.71


2 pts: `Validation AUC` $\ge$ 0.73

In [None]:
# TODO 11: Tune your hyper-parameter to achieve high performance (2 points)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN(num_node_features=dataset.num_node_features, hidden_channels=32, num_classes=dataset.num_tasks, dropout_rate = 0.5).to(device)
evaluator = Evaluator(name='ogbg-molhiv')
# TODO 11: Tune your hyper-parameter to achieve high performance



In [None]:
from tqdm import tqdm
num_epochs = 5
best_valid_auc = 0

for epoch in tqdm(range(1, num_epochs + 1)):
    # Training
    train_loss = train(model, train_loader, device)

    # Evaluation
    train_result = evaluate(model, train_loader, evaluator, device)
    valid_result = evaluate(model, valid_loader, evaluator, device)

    train_auc = train_result['rocauc']
    valid_auc = valid_result['rocauc']

    # Print metrics
    print(f'Epoch: {epoch:03d}, '
          f'Train Loss: {train_loss:.4f}, '
          f'Train AUC: {train_auc:.4f}, '
          f'Validation AUC: {valid_auc:.4f}')

    # Save the best model
    if valid_auc > best_valid_auc:
        best_valid_auc = valid_auc
        print(f'Best model with Validation AUC: {best_valid_auc:.4f}')


### OPEN ENDED QUESTION:
#### Reflect on the entire implementation of the GCN model for molecular property prediction.
#### - What potential improvements or modifications could you suggest to enhance the model's performance? Consider aspects such as network architecture, hyperparameter tuning, data preprocessing, and evaluation strategies. (2 points)
#### - How might you leverage additional graph-specific techniques or modern architectures to push the boundaries of performance on this task? (2 points)

