# Classification on the PROTEINS Dataset

We're going to do our walkthrough with [Spektral](https://graphneural.network/getting-started/) for Python to demonstrate the implementation of GCNs. 

The PROTEINS dataset contains 1,113 graphs, all representative of the structure of different proteins. For our walkthrough, we'll mostly be focused in on building a GCN, but I encourage you to dive deeper into this dataset later if you're interested in learning more about the feature representations, or what we're classifying! 

Spektral is a library for Python for Graph Neural Networks, built on Tensorflow and Keras. Another great alternative is PyTorch Geometric. 

While Spektral is fabulous for quickly getting a model up and running (like for this walkthrough) PyTorch Geometric will be more comfortable  for those who prefer PyTorch over Tensorflow & Keras, it just takes a little bit more time to get set up. 

In [3]:
# Uncomment me and run this cell!
# !pip install spektral





In [4]:
# Reading in the PROTEINS dataset
from spektral.datasets import TUDataset

# Spectral provides the TUDataset class, which contains benchmark datasets for graph classification
data = TUDataset('PROTEINS')
data

Downloading PROTEINS dataset.


100%|█████████████████████████████████████████| 447k/447k [00:00<00:00, 496kB/s]


Successfully loaded PROTEINS.


TUDataset(n_graphs=1113)

In [5]:
# Since we want to utilize the Spektral GCN layer, we want to follow the original paper for this method and perform some preprocessing:
from spektral.transforms import GCNFilter

# Apply the built-in filter to all of our data:
data.apply(GCNFilter())

In [6]:
# Split our train and test data. This just splits based on the first 80%/second 20% which isn't entirely ideal, so we'll shuffle the data first.
import numpy as np

np.random.shuffle(data)
split = int(0.8 * len(data))
data_train, data_test = data[:split], data[split:]

In [7]:
# Spektral is built on top of Keras, so we can use the Keras functional API to build a model that first embeds,
# then sums the nodes together (global pooling), then classifies the result with a dense softmax layer

# First, let's import the necessary layers:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout
from spektral.layers import GCNConv, GlobalSumPool

In [8]:
# Now, we can use model subclassing to define our model:

class ProteinsGNN(Model):
  
  def __init__(self, n_hidden, n_labels):
    super().__init__()
    # Define our GCN layer with our n_hidden layers
    self.graph_conv = GCNConv(n_hidden)
    # Define our global pooling layer
    self.pool = GlobalSumPool()
    # Define our dropout layer, initialize dropout freq. to .5 (50%)
    self.dropout = Dropout(0.5)
    # Define our Dense layer, with softmax activation function
    self.dense = Dense(n_labels, 'softmax')

  # Define class method to call model on input
  def call(self, inputs):
    out = self.graph_conv(inputs)
    out = self.dropout(out)
    out = self.pool(out)
    out = self.dense(out)

    return out

In [9]:
# Instantiate our model for training
model = ProteinsGNN(32, data.n_labels)

In [10]:
# Compile model with our optimizer (adam) and loss function
model.compile('adam', 'categorical_crossentropy')

In [11]:
# Here's the trick - we can't just call Keras' fit() method on this model.
# Instead, we have to use Loaders, which Spektral walks us through. Loaders create mini-batches by iterating over the graph
# Since we're using Spektral for an experiment, for our first trial we'll use the recommended loader in the getting started tutorial

# TODO: read up on modes and try other loaders later
from spektral.data import BatchLoader

loader = BatchLoader(data_train, batch_size=32)

In [12]:
# Now we can train! We don't need to specify a batch size, since our loader is basically a generator
# But we do need to specify the steps_per_epoch parameter

model.fit(loader.load(), steps_per_epoch=loader.steps_per_epoch, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f84ae419400>

In [13]:
# To evaluate, let's instantiate another loader to test

test_loader = BatchLoader(data_test, batch_size=32)

In [15]:
# And feed it to our model by calling .load()

loss = model.evaluate(loader.load(), steps=loader.steps_per_epoch)

print('Test loss: {}'.format(loss))

Test loss: 3.7211432456970215
