<a href="https://colab.research.google.com/github/sjdunand/NodeClassification/blob/main/CSCE533_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, let's make sure all dependencies are properly installed. This is a bit of a weird way to do it, but I was having a problem with "Building the wheel" for torch-sparse. The fix was from this StackOverflow discussion:
https://stackoverflow.com/questions/67285115/building-wheels-for-torch-sparse-in-colab-takes-forever

In [None]:
import torch

!pip uninstall torch-scatter torch-sparse torch-geometric torch-cluster  --y
!pip install torch-scatter -f https://data.pyg.org/whl/torch-{torch.__version__}.html
!pip install torch-sparse -f https://data.pyg.org/whl/torch-{torch.__version__}.html
!pip install torch-cluster -f https://data.pyg.org/whl/torch-{torch.__version__}.html
!pip install git+https://github.com/pyg-team/pytorch_geometric.git


# Let's set a seed for both the CPU and GPU to ensure reproducibility.
The following code was from ChatGPT 11/23/2024 using the prompt:
"What is the best way to set a random seed for PyTorch to ensure reproducability?"


In [None]:
import numpy as np
import random
import torch
seed = 42
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)

We will first use a somewhat "default" GCN model. No specific hyperparemeter tuning or data transformations will be used. The accuracy and AUC results from this run (on the CORA dataset) will be used as a baseline to compare possible improvements to both later on.

In [None]:
import torch
import torch.nn.functional as F
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv

# load cora using pytorch geometric
dataset = Planetoid(root='/tmp/Cora', name='Cora')


# create graph conv model
class GCN(torch.nn.Module):
    def __init__(self):
        super(GCN, self).__init__()
        # starting with 2 layers
        self.conv1 = GCNConv(dataset.num_node_features, 16)
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data): # standard forward pass. unsure if relu needed after second conv (assuming log_softmax works in it's place for nonlinearity)
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# send to gpu if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4) # "default"-ish parameters

def train():
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask]) # mask on trainable samples (still somewhat weird to me)
    loss.backward()
    optimizer.step()
    return loss.item()

# for validation
def test():
    model.eval()
    logits, accs = model(data), []
    for mask in [data.train_mask, data.val_mask]: # use the "validation" masks this time
        pred = logits[mask].max(1)[1]
        acc = pred.eq(data.y[mask]).sum().item() / mask.sum().item()
        accs.append(acc)
    return accs

from sklearn.metrics import roc_auc_score

# keep a held-aside set of data for a final evaluation
def final_test():
    model.eval()
    logits = model(data)
    test_mask = data.test_mask

    preds = logits[test_mask].max(1)[1].cpu().numpy()
    probs = F.softmax(logits[test_mask], dim=1).cpu().detach().numpy() # logits -> probabilities
    true_labels = data.y[test_mask].cpu().numpy()

    accuracy = (preds == true_labels).sum() / test_mask.sum().item()

    # compute AUC
    auc = roc_auc_score(
        F.one_hot(torch.tensor(true_labels), dataset.num_classes).numpy(),
        probs,
        multi_class="ovr",
    )

    return accuracy, auc


# training loop
best_val_acc = 0
for epoch in range(200):
    loss = train()
    if epoch % 10 == 0:
        train_acc, val_acc = test()
        print(f'Epoch {epoch:03d}, Loss: {loss:.4f}, Train Acc: {train_acc:.4f}, '
              f'Val Acc: {val_acc:.4f}')

        # cool feature of loading a previously "best" performing model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_model_state = model.state_dict()

# use best model on the final test set
model.load_state_dict(best_model_state)
test_acc, test_auc = final_test()
print(f"Final Test Accuracy: {test_acc:.4f}, Test AUC: {test_auc:.4f}")

# Next, let's try implementing k-fold cross validation, as the CORA dataset is somewhat small (at least compared to traditional deep learning benchmark datasets)

We will mainly do this because it seems our baseline model overfit the data a bit (the training accuracy hit 100% very quickly)

We will split the dataset into an 80/20 for train(and val) and testing sets, respectively.

Using 5-fold cross validation, we will see which model performs the best on the validation sets, and use that model for the final, unbiased accuracy and auc test on unseen data.

In [None]:
from sklearn.model_selection import KFold, train_test_split
import torch

def kFoldTrain(numFolds, modelType, num_epochs):

  # use sklearn to split the data easily
  data = dataset[0]
  data.to(device)
  num_nodes = data.num_nodes
  train_val_idx, test_idx = train_test_split(range(num_nodes), test_size=0.2, random_state=42)

  # keep a held aside testing set that remains unseen until the very end
  data.test_mask = torch.zeros(num_nodes, dtype=torch.bool)
  data.test_mask[test_idx] = True

  # use sklearn again for k-fold cross validation
  kf = KFold(n_splits=numFolds, shuffle=True, random_state=42)
  fold_accuracies = []
  fold_aucs = []

  # main loop
  for fold_idx, (train_idx, val_idx) in enumerate(kf.split(train_val_idx)):
      print(f"Fold {fold_idx + 1}/{numFolds}")

     # get train and val data for this fold
      train_fold_idx = [train_val_idx[i] for i in train_idx]
      val_fold_idx = [train_val_idx[i] for i in val_idx]

      # just to be able to use different models (later)
      if (modelType == "BaseGCN"):
        model = GCN().to(device)
      elif (modelType == "DeeperGCN"):
        model = DeeperGCN().to(device)
      else:
        print("Invalid model type")
        return

      # same optimizer and hyperparameters
      optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

      # set up masks
      data.train_mask = torch.zeros(num_nodes, dtype=torch.bool)
      data.val_mask = torch.zeros(num_nodes, dtype=torch.bool)
      data.train_mask[train_fold_idx] = True
      data.val_mask[val_fold_idx] = True

      # train!
      best_val_acc = 0
      for epoch in range(num_epochs):
          model.train()
          optimizer.zero_grad()
          out = model(data)
          loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask]) # negative log likelihood. good for multiclass classifier
          loss.backward() # compute gradients
          optimizer.step() # update parameters

          # validation
          if epoch % 10 == 0:
              model.eval()
              logits = model(data)
              train_pred = logits[data.train_mask].max(1)[1]
              val_pred = logits[data.val_mask].max(1)[1]
              train_acc = train_pred.eq(data.y[data.train_mask]).sum().item() / data.train_mask.sum().item()
              val_acc = val_pred.eq(data.y[data.val_mask]).sum().item() / data.val_mask.sum().item()
              print(f'Epoch {epoch:03d}, Train Acc: {train_acc:.4f}, Val Acc: {val_acc:.4f}')


              if val_acc > best_val_acc:
                  best_val_acc = val_acc
                  best_model_state = model.state_dict()

      model.load_state_dict(best_model_state)
      model.eval()
      logits = model(data)
      val_pred = logits[data.val_mask].max(1)[1]
      val_acc = val_pred.eq(data.y[data.val_mask]).sum().item() / data.val_mask.sum().item()

      # AUC
      probs = F.softmax(logits[data.val_mask], dim=1).cpu().detach().numpy()
      true_labels = data.y[data.val_mask].cpu().numpy()
      auc = roc_auc_score(
          F.one_hot(torch.tensor(true_labels), dataset.num_classes).numpy(),
          probs,
          multi_class="ovr",
      )

      print(f"Fold {fold_idx + 1} Val Accuracy: {val_acc:.4f}, Val AUC: {auc:.4f}")
      fold_accuracies.append(val_acc)
      fold_aucs.append(auc)

  # performance metrics
  avg_accuracy = sum(fold_accuracies) / numFolds
  avg_auc = sum(fold_aucs) / numFolds
  print(f"\nK-Fold Cross-Validation Results:")
  print(f"Average Validation Accuracy: {avg_accuracy:.4f}, Average Validation AUC: {avg_auc:.4f}")

  # final evaluation on held-aside data
  model.load_state_dict(best_model_state)  # Use the best model found across folds
  model.eval()
  logits = model(data)
  test_mask = data.test_mask

  # final accuracy
  test_pred = logits[test_mask].max(1)[1]
  test_acc = test_pred.eq(data.y[test_mask]).sum().item() / test_mask.sum().item()

  # final auc
  test_probs = F.softmax(logits[test_mask], dim=1).cpu().detach().numpy()
  test_labels = data.y[test_mask].cpu().numpy()
  test_auc = roc_auc_score(
      F.one_hot(torch.tensor(test_labels), dataset.num_classes).numpy(),
      test_probs,
      multi_class="ovr",
  )

  print(f"Hold-out Test Accuracy: {test_acc:.4f}, Test AUC: {test_auc:.4f}")


# able to change model or num epochs
kFoldTrain(5, "BaseGCN", 200)

Now that we've seen some significant improvements on the accuracy using k-fold cross validation, let's see if normalization on the data can help using a PyTorch transformation!

In [None]:
# create graph conv model
class DeeperGCN(torch.nn.Module):
    def __init__(self):
        super(DeeperGCN, self).__init__()
        # 2 gcnconv layers
        self.conv1 = GCNConv(dataset.num_node_features, 32)
        self.conv2 = GCNConv(32, 16)

        # try some fully connected layers!
        self.fc1 = torch.nn.Linear(16, 32)
        self.fc2 = torch.nn.Linear(32, dataset.num_classes)

    def forward(self, data): # standard forward pass. unsure if relu needed after second conv (assuming log_softmax works in it's place for nonlinearity)
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)

        # fully connected layers
        x = F.relu(self.fc1(x))
        x = self.fc2(x)

        return F.log_softmax(x, dim=1)

kFoldTrain(5, "DeeperGCN", 200)

Let's try one final experiment: increase the training time from 200 epochs to 1,000!

In [None]:
kFoldTrain(5, "DeeperGCN", 1000)

# Now we will run similar experiments using the Citeseer dataset!

First, we will use the baseline training algorithm with our "default" GCN model

In [None]:
import torch
import torch.nn.functional as F
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv

dataset = Planetoid(root='/tmp/Citeseer', name='Citeseer')

# create graph conv model
class GCN(torch.nn.Module):
    def __init__(self):
        super(GCN, self).__init__()
        # starting with 2 layers
        self.conv1 = GCNConv(dataset.num_node_features, 16)
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data): # standard forward pass. unsure if relu needed after second conv (assuming log_softmax works in it's place for nonlinearity)
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# send to gpu if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4) # "default"-ish parameters

def train():
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask]) # mask on trainable samples (still somewhat weird to me)
    loss.backward()
    optimizer.step()
    return loss.item()

# for evaluating on unseen data
def test():
    model.eval()
    logits, accs = model(data), []
    for mask in [data.train_mask, data.val_mask]: # use the "validation" masks this time
        pred = logits[mask].max(1)[1]
        acc = pred.eq(data.y[mask]).sum().item() / mask.sum().item()
        accs.append(acc)
    return accs

from sklearn.metrics import roc_auc_score

# keep a held-aside set of data for a final evaluation
def final_test():
    model.eval()
    logits = model(data)
    test_mask = data.test_mask

    preds = logits[test_mask].max(1)[1].cpu().numpy()
    probs = F.softmax(logits[test_mask], dim=1).cpu().detach().numpy() # logits -> probabilities
    true_labels = data.y[test_mask].cpu().numpy()

    accuracy = (preds == true_labels).sum() / test_mask.sum().item()

    # compute AUC
    auc = roc_auc_score(
        F.one_hot(torch.tensor(true_labels), dataset.num_classes).numpy(),
        probs,
        multi_class="ovr",
    )

    return accuracy, auc


# training loop
best_val_acc = 0
for epoch in range(200):
    loss = train()
    if epoch % 10 == 0:
        train_acc, val_acc = test()
        print(f'Epoch {epoch:03d}, Loss: {loss:.4f}, Train Acc: {train_acc:.4f}, '
              f'Val Acc: {val_acc:.4f}')

        # cool feature of loading a previously "best" performing model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_model_state = model.state_dict()

# use best model on the final test set
model.load_state_dict(best_model_state)
test_acc, test_auc = final_test()
print(f"Final Test Accuracy: {test_acc:.4f}, Test AUC: {test_auc:.4f}")

Seems like we're having the same issue of overfitting the training data like we did with Cora (only, it's worse now!). Let's try to see what K-Fold cross validation can do for us.

In [None]:
kFoldTrain(5, "BaseGCN", 200)

Much better accuracy from employing 5-fold cross validation. Next, let's see if increasing the model depth can help us get the representation power we need to model the more challenging Citeseer dataset.

In [None]:
kFoldTrain(5, "DeeperGCN", 200)

Oh no! It actually decreased the performance to have more layers! Maybe we should try playing with number of epochs to give the model a chance to learn even more parameters from the fully-connected layers.

In [None]:
kFoldTrain(5, "DeeperGCN", 1000)