In [19]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv

# Load the Cora dataset
dataset = Planetoid(root='data/Cora', name='Cora')

# Prepare data
data = dataset[0]

# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

    if epoch == 99:
        print(f'Final Loss: {loss.item()}')

print("Training complete!")


Epoch 0, Loss: 1.9495115280151367
Epoch 10, Loss: 0.5894498229026794
Epoch 20, Loss: 0.0932442769408226
Epoch 30, Loss: 0.019700631499290466
Epoch 40, Loss: 0.007515148725360632
Epoch 50, Loss: 0.0043111443519592285
Epoch 60, Loss: 0.003122517140582204
Epoch 70, Loss: 0.0025340530555695295
Epoch 80, Loss: 0.002172798151150346
Epoch 90, Loss: 0.0019147110870108008
Final Loss: 0.0017305193468928337
Training complete!


## Explanation:
GCN aggregates features from a node’s neighbors using graph convolutions. This allows the network to learn representations based on both node features and graph structure.
The Cora dataset is used to classify nodes into one of 7 research topics.

## Questions (1 point each):

1. What would happen if we added more GCN layers (e.g., 3 layers instead of 2)? How would this affect over-smoothing?
2. What would happen if we used a larger hidden dimension (e.g., 64 instead of 16)? How would this impact the model's capacity?
3. What would happen if we replaced ReLU activation with a sigmoid function? Would the performance change?

4. What would happen if we trained on only 10% of the nodes and tested on the remaining 90%? How would the performance be affected?
5. What would happen if we used a different optimizer (e.g., RMSprop) instead of Adam? Would it affect the convergence speed?

Extra credit: 
1. What would happen if we used edge weights (non-binary) in the adjacency matrix? How would it affect message passing?
2. What would happen if we removed the log-softmax function in the output layer? Would the loss function still work correctly?

## No points, just for you to think about:
1. What would happen if we applied dropout to the node features during training? How would it affect the model’s generalization?
2. What would happen if we used mean-pooling instead of summing the messages in the GCN layers?
3. What would happen if we pre-trained the node features using a different algorithm, like Node2Vec, before feeding them into the GCN?


What would happen if we added more GCN layers (e.g., 3 layers instead of 2)? How would this affect over-smoothing?

My intuition is that we would increase accuracy (decrease loss) if we added more layers. This allows the network to capture more complex features in the data that would be beneficial in its classification. However, adding more layers would increase over-smoothing because the graph would aggregate information about particular nodes from more and more of the other nodes in the network. This means that the representation of each node would become much more similar because its representation is formed in a very similar manner to the rest of the nodes in the graph.

In [10]:
class GCN2Layer(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim):
        super(GCN2Layer, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim1)
        self.conv2 = GCNConv(hidden_dim1, hidden_dim2)
        self.conv3 = GCNConv(hidden_dim2, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        x = torch.relu(x)
        x = self.conv3(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN2Layer(input_dim=dataset.num_node_features, hidden_dim1=16, hidden_dim2=8, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

    if epoch == 99:
        print(f'Final Loss: {loss.item()}')

print("Training complete!")


Epoch 0, Loss: 1.9400514364242554
Epoch 10, Loss: 0.9110769033432007
Epoch 20, Loss: 0.26767370104789734
Epoch 30, Loss: 0.05882469564676285
Epoch 40, Loss: 0.012696413323283195
Epoch 50, Loss: 0.004139997996389866
Epoch 60, Loss: 0.002468880033120513
Epoch 70, Loss: 0.0018266976112499833
Epoch 80, Loss: 0.0014895839849486947
Epoch 90, Loss: 0.0012793855275958776
Final Loss: 0.00114056293386966
Training complete!


In [11]:
(0.00114056293386966 - 0.0018777881050482392) / 0.0018777881050482392

-0.39260296153576946

By adding a 3rd layer with 8 nodes, we see a 39% decrease in loss.

What would happen if we used a larger hidden dimension (e.g., 64 instead of 16)? How would this impact the model's capacity?

By adding a larger hidden dimension, I would expect the model's capacity to decrease. Increasing the hidden dimension size multiplicitly increases the number of parameters, I would expect the model to naturally overfit to the data.

In [12]:
class GCNLargerHL(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCNLargerHL, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCNLargerHL(input_dim=dataset.num_node_features, hidden_dim=64, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

    if epoch == 99:
        print(f'Final Loss: {loss.item()}')

print("Training complete!")

Epoch 0, Loss: 1.9406793117523193
Epoch 10, Loss: 0.07691950350999832
Epoch 20, Loss: 0.0033538031857460737
Epoch 30, Loss: 0.0006547874072566628
Epoch 40, Loss: 0.0002803529496304691
Epoch 50, Loss: 0.0001895413442980498
Epoch 60, Loss: 0.00015777147200424224
Epoch 70, Loss: 0.00014230006490834057
Epoch 80, Loss: 0.00013241938722785562
Epoch 90, Loss: 0.00012499105650931597
Final Loss: 0.00011930983600905165
Training complete!


In [14]:
(0.00011930983600905165 - 0.0018777881050482392) / 0.0018777881050482392

-0.9364625669486885

What would happen if we replaced ReLU activation with a sigmoid function? Would the performance change?

I don't have too much intuition on this, but in the last assignment, my performance greatly increased when I used ReLU instead of sigmoid. So based on that, I expect maybe a similar thing here - switching to sigmoid would decrease performance.

In [15]:
class GCNSigmoid(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCNSigmoid, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.sigmoid(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCNSigmoid(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

    if epoch == 99:
        print(f'Final Loss: {loss.item()}')

print("Training complete!")

Epoch 0, Loss: 2.044727325439453
Epoch 10, Loss: 1.4465745687484741
Epoch 20, Loss: 0.9942996501922607
Epoch 30, Loss: 0.6476757526397705
Epoch 40, Loss: 0.4087792634963989
Epoch 50, Loss: 0.26063838601112366
Epoch 60, Loss: 0.17464640736579895
Epoch 70, Loss: 0.12472358345985413
Epoch 80, Loss: 0.09445361793041229
Epoch 90, Loss: 0.07492941617965698
Final Loss: 0.06269379705190659
Training complete!


In [16]:
(0.06269379705190659 - 0.0018777881050482392) / 0.0018777881050482392

32.38704557950963

We see a 32% increase in loss when shifting from ReLU to sigmoid. Mathematically, I'm not too sure why this is the case and what the reason behind this is, but maybe we can touch on this in discussions. It could be because we see deadzones at the top and bottom of the sigmoid function (where the input is either very small or very large), which isn't ideal for making differentiations in these extreme cases.

What would happen if we trained on only 10% of the nodes and tested on the remaining 90%? How would the performance be affected?

Naturally, with less data, I would expect the performance of the model to decrease. Since we only train on 10% of the data, we capture much less information about the data, as a whole. Decreasing the data gives the model less examples from which it can draw conclusions/recognize patterns. I'd expect we'd see a pretty random performance on the test set because our training set probably wasn't comprehensive enough for the model to capture and apply its learned knowledge to the test set.

What would happen if we used a different optimizer (e.g., RMSprop) instead of Adam? Would it affect the convergence speed?

In [17]:
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.RMSprop(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

    if epoch == 99:
        print(f'Final Loss: {loss.item()}')

print("Training complete!")


Epoch 0, Loss: 1.9499214887619019
Epoch 10, Loss: 0.0588170662522316
Epoch 20, Loss: 0.019654853269457817
Epoch 30, Loss: 0.010645276866853237
Epoch 40, Loss: 0.006953630596399307
Epoch 50, Loss: 0.00499807670712471
Epoch 60, Loss: 0.003810119116678834
Epoch 70, Loss: 0.00302361068315804
Epoch 80, Loss: 0.0024691051803529263
Epoch 90, Loss: 0.002059818943962455
Final Loss: 0.001775294542312622
Training complete!


In [18]:
(0.001775294542312622 - 0.0018777881050482392) / 0.0018777881050482392

-0.05458207050096537

I didn't really have too much intuition on this question. However, changing from Adam to RMSProp decreased loss by around 5%. 

I did some research on the different optimizers. Regarding convergence speed, I found that typically Adam is faster because it adds a momentum component to the RMSProp algorithm. This allows for quicker convergence to a minimum. Because of this, I had expected that the switch RMSProp would have decrease performance because it would have taken more time to converge to the loss that Adam ended up finding. Maybe an increased number of epochs would show this. Or it just could be the case that RMSprop is simply better for this problem than Adam.