# Graph Neural Network Classification on the PROTEINS Dataset

For the first approach, I'm going to use [Spektral](https://graphneural.network/getting-started/) for Python to build my GCN layer and then perform our classification. 

Spektral is a library for Python for Graph Neural Networks, built on Tensorflow and Keras. 

Our second experiment will be built with PyTorch Geometric.

In [None]:
# Uncomment me and run this cell!
# !pip install spektral

In [3]:
# Reading in the PROTEINS dataset
from spektral.datasets import TUDataset

# Spectral provides the TUDataset class, which contains benchmark datasets for graph classification
data = TUDataset('PROTEINS')
data

Downloading PROTEINS dataset.


100%|█████████████████████████████████████████| 447k/447k [00:00<00:00, 948kB/s]


Successfully loaded PROTEINS.


TUDataset(n_graphs=1113)

In [4]:
# Since we want to utilize the Spektral GCN layer, we want to follow the original paper for this method and perform some preprocessing:
from spektral.transforms import GCNFilter

data.apply(GCNFilter())

In [5]:
# Split our train and test data. This just splits based on the first 80%/second 20% which isn't entirely ideal, so we'll shuffle the data first.
import numpy as np

np.random.shuffle(data)
split = int(0.8 * len(data))
data_train, data_test = data[:split], data[split:]

In [6]:
# Spektral is built on top of Keras, so we can use the Keras functional API to build a model that first embeds,
# then sums the nodes together (global pooling), then classifies the result with a dense softmax layer

# First, let's import the necessary layers:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout
from spektral.layers import GCNConv, GlobalSumPool

In [7]:
# Now, we can use model subclassing to define our model:

class ProteinsGNN(Model):
  
  def __init__(self, n_hidden, n_labels):
    super().__init__()
    # Define our GCN layer with our n_hidden layers
    self.graph_conv = GCNConv(n_hidden)
    # Define our global pooling layer
    self.pool = GlobalSumPool()
    # Define our dropout layer, initialize dropout freq. to .5 (50%)
    self.dropout = Dropout(0.5)
    # Define our Dense layer, with softmax activation function
    self.dense = Dense(n_labels, 'softmax')

  # Define class method to call model on input
  def call(self, inputs):
    out = self.graph_conv(inputs)
    out = self.dropout(out)
    out = self.pool(out)
    out = self.dense(out)

    return out

In [8]:
# Instantiate our model for training
model = ProteinsGNN(32, data.n_labels)

In [9]:
# Compile model with our optimizer (adam) and loss function
model.compile('adam', 'categorical_crossentropy')

In [10]:
# Here's the trick - we can't just call Keras' fit() method on this model.
# Instead, we have to use Loaders, which Spektral walks us through. Loaders create mini-batches by iterating over the graph
# Since we're using Spektral for an experiment, for our first trial we'll use the recommended loader in the getting started tutorial

# TODO: read up on modes and try other loaders later
from spektral.data import BatchLoader

loader = BatchLoader(data_train, batch_size=32)

In [11]:
# Now we can train! We don't need to specify a batch size, since our loader is basically a generator
# But we do need to specify the steps_per_epoch parameter

model.fit(loader.load(), steps_per_epoch=loader.steps_per_epoch, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f7780700ad0>

In [12]:
# To evaluate, let's instantiate another loader to test

test_loader = BatchLoader(data_test, batch_size=32)

In [13]:
# And feed it to our model by calling .load()

loss = model.evaluate(loader.load(), steps=loader.steps_per_epoch)

print('Test loss: {}'.format(loss))

Test loss: 3.0448803901672363


## PyTorch Geometric GCN

Pytorch Geometric provides [GCN layers](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html) based on Kipf & Welling's original paper: ["Semi-Supervised Classification with Graph Convolutional Networks"](https://arxiv.org/abs/1609.02907) on which I've based the bulk of my research and write-ups.

While my original goal was to use my [original experiment](https://colab.research.google.com/drive/1NUQgUdrdvIddewdQyGEpas_mPaFzC8-e?usp=sharing) (based off of [this](https://towardsdatascience.com/understanding-graph-convolutional-networks-for-node-classification-a2bfdb7aba7b) resource) to build this from scratch, I ran into difficulties trying to embed and classify such a large dataset, specifically with Colab RAM allowances.

For this reason, I sought out different methods and found that this problem had already been solved, and for purposes of time and demonstration chose to delve into Pytorch Geometric rather than invent the wheel. 

In order to successfully learn to implement this approach with this library, I relied on the Pytorch Geometric [documentation](https://pytorch-geometric.readthedocs.io/en/latest/index.html) as well as [this notebook](https://colab.research.google.com/drive/1I8a0DfQ3fI7Njc62__mVXUlcAleUclnb?usp=sharing) written by matthias.fey@tu-dortmund.de. 

I would like to extend thanks and all due credit to these authors, as this work and research would not be possible without them. Further credits and citations can be found in the [README](https://github.com/sidneyarcidiacono/graph-convolutional-networks) of this repository.

In [14]:
# Install required packages. Uncomment me and run!
# !pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
# !pip install -q torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
# !pip install -q torch-geometric

[K     |████████████████████████████████| 2.6MB 2.5MB/s 
[K     |████████████████████████████████| 1.5MB 2.6MB/s 
[K     |████████████████████████████████| 215kB 5.1MB/s 
[K     |████████████████████████████████| 235kB 8.4MB/s 
[K     |████████████████████████████████| 2.2MB 9.7MB/s 
[K     |████████████████████████████████| 51kB 5.7MB/s 
[?25h  Building wheel for torch-geometric (setup.py) ... [?25l[?25hdone


In [16]:
import torch
from torch_geometric.datasets import TUDataset

# Like Spektral, pytorch geometric provides us with benchmark TUDatasets
dataset = TUDataset(root='data/TUDataset', name='PROTEINS')

# Let's take a look at our data. We'll look at dataset (all data) and data (our first graph):

data = dataset[0]  # Get the first graph object.

print()
print(f'Dataset: {dataset}:')
print('====================')
# How many graphs?
print(f'Number of graphs: {len(dataset)}')
# How many features?
print(f'Number of features: {dataset.num_features}')
# Now, in our first graph, how many edges?
print(f'Number of edges: {data.num_edges}')
# Average node degree?
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
# Do we have isolated nodes?
print(f'Contains isolated nodes: {data.contains_isolated_nodes()}')
# Do we contain self-loops?
print(f'Contains self-loops: {data.contains_self_loops()}')
# Is this an undirected graph?
print(f'Is undirected: {data.is_undirected()}')


Dataset: PROTEINS(1113):
Number of graphs: 1113
Number of features: 3
Number of edges: 162
Average node degree: 3.86
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True


In [20]:
# Now, we need to perform our train/test split.
# We create a seed, and then shuffle our data
torch.manual_seed(12345)
dataset = dataset.shuffle()

# Once it's shuffled, we slice the data to split
train_dataset = dataset[150:-150]
test_dataset = dataset[0:150]

# Take a look at the training versus test graphs
print(f'Number of training graphs: {len(train_dataset)}')
print(f'Number of test graphs: {len(test_dataset)}')

Number of training graphs: 813
Number of test graphs: 150


In [22]:
# Import DataLoader for batching
from torch_geometric.data import DataLoader

# our DataLoader creates diagonal adjacency matrices, and concatenates features
# and target matrices in the node dimension. This allows differing numbers of nodes and edges 
# over examples in one batch. (from pytorch geometric docs)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Take a look at the output to understand this further:
for step, data in enumerate(train_loader):
    print(f'Step {step + 1}:')
    print('=======')
    print(f'Number of graphs in the current batch: {data.num_graphs}')
    print(data)
    print()

Step 1:
Number of graphs in the current batch: 64
Batch(batch=[3454], edge_index=[2, 12586], ptr=[65], x=[3454, 3], y=[64])

Step 2:
Number of graphs in the current batch: 64
Batch(batch=[2181], edge_index=[2, 8612], ptr=[65], x=[2181, 3], y=[64])

Step 3:
Number of graphs in the current batch: 64
Batch(batch=[2259], edge_index=[2, 8598], ptr=[65], x=[2259, 3], y=[64])

Step 4:
Number of graphs in the current batch: 64
Batch(batch=[2338], edge_index=[2, 8642], ptr=[65], x=[2338, 3], y=[64])

Step 5:
Number of graphs in the current batch: 64
Batch(batch=[2475], edge_index=[2, 8958], ptr=[65], x=[2475, 3], y=[64])

Step 6:
Number of graphs in the current batch: 64
Batch(batch=[2879], edge_index=[2, 11016], ptr=[65], x=[2879, 3], y=[64])

Step 7:
Number of graphs in the current batch: 64
Batch(batch=[1811], edge_index=[2, 6808], ptr=[65], x=[1811, 3], y=[64])

Step 8:
Number of graphs in the current batch: 64
Batch(batch=[2742], edge_index=[2, 10060], ptr=[65], x=[2742, 3], y=[64])

Step 

In [23]:
# Import everything we need to build our network:
from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.nn import global_mean_pool

# Define our GCN class as a pytorch Module
class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCN, self).__init__()
        # We inherit from pytorch geometric's GCN class, and we initialize three layers
        self.conv1 = GCNConv(dataset.num_node_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.conv3 = GCNConv(hidden_channels, hidden_channels)
        # Our final linear layer will define our output
        self.lin = Linear(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index, batch):
        # 1. Obtain node embeddings 
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = self.conv2(x, edge_index)
        x = x.relu()
        x = self.conv3(x, edge_index)

        # 2. Readout layer
        x = global_mean_pool(x, batch)  # [batch_size, hidden_channels]

        # 3. Apply a final classifier
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin(x)
        return x

model = GCN(hidden_channels=64)
print(model)

GCN(
  (conv1): GCNConv(3, 64)
  (conv2): GCNConv(64, 64)
  (conv3): GCNConv(64, 64)
  (lin): Linear(in_features=64, out_features=2, bias=True)
)


In [24]:
# Initialize our model from our GCN class:
model = GCN(hidden_channels=64)
# Set our optimizer (adam)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Define our loss function
criterion = torch.nn.CrossEntropyLoss()

# Initialize our train function
def train():
    model.train()

    for data in train_loader:  # Iterate in batches over the training dataset.
      out = model(data.x, data.edge_index, data.batch)  # Perform a single forward pass.
      loss = criterion(out, data.y)  # Compute the loss.
      loss.backward()  # Derive gradients.
      optimizer.step()  # Update parameters based on gradients.
      optimizer.zero_grad()  # Clear gradients.

# Define our test function
def test(loader):
  model.eval()

  correct = 0
  for data in loader:  # Iterate in batches over the training/test dataset.
      out = model(data.x, data.edge_index, data.batch)  
      pred = out.argmax(dim=1)  # Use the class with highest probability.
      correct += int((pred == data.y).sum())  # Check against ground-truth labels.
  return correct / len(loader.dataset)  # Derive ratio of correct predictions.

# Run for 200 epochs (range is exclusive in the upper bound)
for epoch in range(1, 201):
    train()
    train_acc = test(train_loader)
    test_acc = test(test_loader)
    print(f'Epoch: {epoch:03d}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}')

Epoch: 001, Train Acc: 0.6138, Test Acc: 0.5533
Epoch: 002, Train Acc: 0.6335, Test Acc: 0.5933
Epoch: 003, Train Acc: 0.6950, Test Acc: 0.6467
Epoch: 004, Train Acc: 0.6950, Test Acc: 0.6667
Epoch: 005, Train Acc: 0.7085, Test Acc: 0.6600
Epoch: 006, Train Acc: 0.6765, Test Acc: 0.6400
Epoch: 007, Train Acc: 0.7245, Test Acc: 0.6667
Epoch: 008, Train Acc: 0.6839, Test Acc: 0.6600
Epoch: 009, Train Acc: 0.6986, Test Acc: 0.6400
Epoch: 010, Train Acc: 0.6790, Test Acc: 0.6533
Epoch: 011, Train Acc: 0.6925, Test Acc: 0.6400
Epoch: 012, Train Acc: 0.7060, Test Acc: 0.6533
Epoch: 013, Train Acc: 0.7060, Test Acc: 0.6667
Epoch: 014, Train Acc: 0.6851, Test Acc: 0.6333
Epoch: 015, Train Acc: 0.7134, Test Acc: 0.6800
Epoch: 016, Train Acc: 0.7245, Test Acc: 0.6733
Epoch: 017, Train Acc: 0.7245, Test Acc: 0.7000
Epoch: 018, Train Acc: 0.6482, Test Acc: 0.5933
Epoch: 019, Train Acc: 0.6802, Test Acc: 0.6600
Epoch: 020, Train Acc: 0.6962, Test Acc: 0.6400
Epoch: 021, Train Acc: 0.6691, Test Acc: