<a href="https://colab.research.google.com/github/starkjiang/TrAC-GNN/blob/main/Homework/module2_gnn_homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework for Module 2 of Graph Neural Network**

In this Colab, we will implement the GraphSAGE (https://arxiv.org/pdf/1706.02216). Then we will run our model on the PubMed and Cora datasets. They are all standard citation network benchmark datasets (https://arxiv.org/pdf/1603.08861), and are PyG built-in. You can use `torch_geometric.datasets` to check more detail.

**Note: Make sure to sequentially run all the cells in each section such that the intermediate variables / packages will carry over to the next cell.**

# Device

You might want to use GPU for this Colab.

Please click on `Runtime` and then `Change runtime type`. Then select the hardware accelerator to **GPU**.

# Installation

In [None]:
"""Install packages that are required to execute this notebook.

"""
# Install packages. This may take some time.
!pip install torch-scatter~=2.1.0
!pip install torch-sparse~=0.6.16
!pip install torch-cluster~=1.6.0
!pip install torch-spline-conv~=1.2.1
!pip install torch-geometric==2.2.0 -f https://data.pyg.org/whl/torch-{torch.__version__}.html

# !rm -rf /root/.pyg/Planetoid
# !rm -rf /root/.torch_geometric
# !rm -rf ./Pubmed

# Prevent version incompatability issue.
import functools
import torch
old_load = torch.load
torch.load = functools.partial(old_load, weights_only=False)

# Set random seed.
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.cuda.manual_seed_all(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# PubMed Dataset

**Question 1:** Please implement to find out the number of graph, the number of nodes, the number of features, and the number of classes associated with this dataset. Please indicate them clearly, not just printing a number (4 points).

In [None]:
# Use the PubMed dataset from PyG.
from torch_geometric.datasets import Planetoid
dataset_pm = Planetoid(root='.', name="Pubmed")

data_pm = dataset_pm[0]

###############################################################################
# TODO: Your code here.
# Print the information about the graph here.
# Our implementation is ~5 lines, but don't worry if you deviate from this.
###############################################################################

**Question 2:** Please find out the number of training nodes, the number of evaluation nodes, the number of testing nodes, if there are edges are directed, if the graph as isolated nodes, and if the graph has loops or not (6 points).

In [None]:
# Print information about the graph.
###############################################################################
# TODO: Your code here.
# Print the information about the graph here.
# Our implementation is ~6 lines, but don't worry if you deviate from this.
###############################################################################

# Neighbor Sampling

We use the `NeighborLoader` class to create the batches to get the neighbor sampling. **Please do not modify the default parameters for the grading purpose.**

In [None]:
from torch_geometric.loader import NeighborLoader
from torch_geometric.utils import to_networkx

# Create batches with neighbor sampling.
train_loader = NeighborLoader(
    data_pm,
    num_neighbors=[5, 10],
    batch_size=16,
    input_nodes=data_pm.train_mask,
)

# Print each subgraph.
for i, subgraph in enumerate(train_loader):
    print(f'Subgraph {i}: {subgraph}')

In [None]:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

# Plot each subgraph
fig = plt.figure(figsize=(16,16))
for idx, (subdata, pos) in enumerate(zip(train_loader, [221, 222, 223, 224])):
    G = to_networkx(subdata, to_undirected=True)
    ax = fig.add_subplot(pos)
    ax.set_title(f'Subgraph {idx}', fontsize=24)
    plt.axis('off')
    nx.draw_networkx(
        G,
        pos=nx.spring_layout(G, seed=0),
        with_labels=False,
        node_color=subdata.y,
    )
plt.show()

# GraphSAGE Implementation

**Question 3:** In this block, we will establish the `GraphSAGE` module for our model training and testing in the next. We leverage the `SAGEConv` layer to do it. In the `GraphSAGE` module, you need to implement **2** `GraphSAGE` layers with the input/output dimensions and the hidden size. Subsequently, you also need to implement the **forward** function to add nonlinearity into the model. In between 2 `GraphSAGE` layers, we use the **ReLU** activation function and a **Dropout layer** with the rate being **0.5** in this implementation. We also provide the training and testing functions in the same module. **Please do not modify them for grading purposes** (5 points).

In [None]:
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv


def accuracy(pred_y, y):
    """Calculate accuracy."""
    return ((pred_y == y).sum() / len(y)).item()


class GraphSAGE(torch.nn.Module):
    """GraphSAGE."""
    def __init__(self, dim_in, dim_h, dim_out):
        super().__init__()

        self.sage1 = None
        self.sage2 = None
        #######################################################################
        # TODO: Your code here!
        # Please construct two GraphSAGE layers using SAGEConv.
        # Our implementation is ~2 lines, but don't worry if you deviate
        # from this.
        #######################################################################


    def forward(self, x, edge_index):
        #######################################################################
        # TODO: Your code here!
        # Please implement the forward function based on the requirements. Also,
        # modify the return to reflect on your implementation.
        # Our implementation is ~3 lines, but don't worry if you deviate
        # from this.
        #######################################################################

        return x

    def fit(self, loader, epochs):
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.parameters(), lr=0.01)

        self.train()
        for epoch in range(epochs+1):
            total_loss = 0
            acc = 0
            val_loss = 0
            val_acc = 0

            # Train on batches.
            for batch in loader:
                optimizer.zero_grad()
                out = self(batch.x, batch.edge_index)
                loss = criterion(
                    out[batch.train_mask],
                    batch.y[batch.train_mask]
                )
                total_loss += loss.item()
                acc += accuracy(
                    out[batch.train_mask].argmax(dim=1),
                    batch.y[batch.train_mask]
                )
                loss.backward()
                optimizer.step()

                # Validation.
                val_loss += criterion(
                    out[batch.val_mask],
                    batch.y[batch.val_mask]
                )
                val_acc += accuracy(
                    out[batch.val_mask].argmax(dim=1),
                    batch.y[batch.val_mask]
                )

            # Print metrics every 10 epochs.
            if epoch % 20 == 0:
                print(f'Epoch {epoch:>3} | Train Loss: {loss/len(loader):.3f} '
                      f'| Train Acc: {acc/len(loader)*100:>6.2f}% '
                      f'| Val Loss: {val_loss/len(loader):.2f} '
                      f'| Val Acc: {val_acc/len(loader)*100:.2f}%'
                     )

    @torch.no_grad()
    def test(self, data):
        self.eval()
        out = self(data.x, data.edge_index)
        acc = accuracy(
            out.argmax(dim=1)[data.test_mask],
            data.y[data.test_mask]
        )
        return acc

**Question 4**: What is the maximum accuracy for the testing set by using GraphSAGE? (5 points)

Please feel free to play with the hyperparameters to get your value. Also, please make sure you will run the cell and then get the same accuracy value as you report. If the two values are different, you will lose the point.

In [None]:
# Create GraphSAGE.
graphsage = GraphSAGE(dataset_pm.num_features, 32, dataset_pm.num_classes)
print(graphsage)

# Train.
graphsage.fit(train_loader, 100)

# Test.
acc = graphsage.test(data_pm)
print(f'GraphSAGE test accuracy: {acc*100:.2f}%')

# Cora Dataset

**Question 5:** Please use another dataset - Cora to train and test a different model. Cora dataset is also a standard citation network benchmark dataset. **Please follow the similar steps as the above to report your best testing accuracy.** You also need to print the information about the graph and show the neighbor sampling (10 points).

In [None]:
################################################################################
# TODO: Your code is here!
# Please import the Cora data from PyG and show the similar steps to report the
# information of the graph. Then, please implement the neighbor sampling. Call
# the GraphSAGE model you have implemented and then train and test the model
# with the dataset.
################################################################################

# Submission

To receive points, you must answer all the questions listed above. Please ensure that the output of each code cell is visible in your submitted `.ipynb` file. When submitting, run the notebook in your own Colab account and share the link. For grading, the outputs of specific cells will be checked. If you prefer to run the notebook locally on your machine, upload the completed notebook to your Colab account and share the link. In case of technical issues, you may also email me your file directly. Please name the file in the format like ` Name_module2_gnn_homework.ipynb`