# torch geometric + skorch @ CORA dataset

In [1]:
!date

Di 21. Jun 15:04:50 CEST 2022


This is an example for how to use skorch with [torch geometric](https://pytorch-geometric.readthedocs.io/). The code is based on the [introduction example](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html) but modified to have a proper train/valid/test split. This example is showcasing a quite small data set that does not need to employ batching to be trained efficiently. How to do batching with skorch + torch geometric will not be handled here.

Dependencies of this notebook besides skorch base installation:

---

In [2]:
import skorch
import torch

### Data Loading

In [3]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora')

In [4]:
dataset.data, dataset.num_classes

(Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708]),
 7)

In order to use pytorch geometric / the cora dataset with skorch
we need to address the following things:
    
1. graph convolutions cannot handle missing nodes (=> splitting node attributes but keeping edge_index intact will lead to errors)
2. cora dataset has different attributes for the different split masks (i.e. `train_mask`, `val_mask`, `test_mask`)
3. skorch expects to have (X, y) pairs for classification tasks

To deal with (1) we will split the data into three datasets, creating three sub-graphs in the process; these complete sub-graphs can then be convolved over without errors. 
We use the masks mentioned in (2) to identify the nodes and edges of the subgraphs.

(3) will be handled by specifying our own `XYDataset` which will just have length 1 and return the dataset and the respective y values. We will therefore basically simulate a `batch_size=1` scenario.

In [5]:
from torch_geometric.data import Data

# simulating batch_size=1 by returning the whole dataset and the
# y-values. this way, the data loader can iterate over the 'batches'
# and produce X/y values for us.
class XYDataset(torch.utils.data.Dataset):
    def __init__(self, data: Data, y: torch.tensor):
        self.data = data
        self.y = y
        
    def __len__(self):
        return 1
        
    def __getitem__(self, i):
        return self.data, self.y

### Data Splitting

Split the graph into train, validation and test sub-graphs.
This ensures that there will be no leakage between steps when we apply graph
convolution operators on the graph.

We use `relabel_nodes=True` to re-index the edges when the mask starts in the
middle of the node tensor. Without this flag a mask like `[0, 1, 0]` will assume
that `nodes[1]` exists but our mask forbids this: `nodes = all_nodes[[0, 1, 0]] == [<node_1>]`.

In [6]:
from torch_geometric.utils import subgraph

data = dataset[0]

edge_index_train, _ = subgraph(
    subset=data.train_mask, 
    edge_index=data.edge_index, 
    relabel_nodes=True
)
ds_train = XYDataset(
    Data(x=data.x[data.train_mask], edge_index=edge_index_train),
    data.y[data.train_mask],
)

edge_index_valid, _ = subgraph(
    subset=data.val_mask, 
    edge_index=data.edge_index, 
    relabel_nodes=True
)
ds_valid = XYDataset(
    Data(x=data.x[data.val_mask], edge_index=edge_index_valid),
    data.y[data.val_mask],
)

edge_index_test, _ = subgraph(
    subset=data.test_mask, 
    edge_index=data.edge_index, 
    relabel_nodes=True
)
ds_test = XYDataset(
    Data(x=data.x[data.test_mask], edge_index=edge_index_test),
    data.y[data.test_mask],
)

### Data Feeding

Our "batch" consists of the whole dataset so if we unpack the
batch into `(X, y)` we will have `X = Data(...)` and `y = [y_true]`.
The `DataLoader` does not modify `X` but `y` gets a new batch dimension.
This will lead to a shape mismatch as `y.shape` would then be `(1, #num_samples)`. Therefore, we need our own loader that strips the first dimension to 
match the predicted `y` and the labelled `y` in length.

In [7]:
from torch_geometric.loader import DataLoader

class RawLoader(DataLoader):
    def __iter__(self):
        it = super().__iter__()
        for X, y in it:
            yield X, y[0]

### Modelling

This is the CORA example module as seen in the [torch geometric introduction](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html).

In [8]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 16)
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data):        
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.softmax(x, dim=1)

### Fitting

In [9]:
from skorch.helper import predefined_split

torch.manual_seed(42)

net = skorch.NeuralNetClassifier(
    module=GCN,
    lr=0.1,
    optimizer__weight_decay=5e-4,
    max_epochs=200,
    train_split=skorch.helper.predefined_split(ds_valid),
    batch_size=1,
    iterator_train=RawLoader,
    iterator_valid=RawLoader,
)

In [10]:
net.fit(ds_train, None)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m1.9659[0m       [32m0.1620[0m        [35m1.9380[0m  0.1744
      2        [36m1.9286[0m       [32m0.1640[0m        [35m1.9361[0m  0.0062
      3        1.9395       0.1640        [35m1.9333[0m  0.0062
      4        [36m1.9249[0m       [32m0.1780[0m        [35m1.9311[0m  0.0067
      5        [36m1.9008[0m       [32m0.1880[0m        [35m1.9292[0m  0.0066
      6        [36m1.8900[0m       0.1860        [35m1.9276[0m  0.0059
      7        [36m1.8748[0m       [32m0.1900[0m        [35m1.9261[0m  0.0098
      8        [36m1.8742[0m       [32m0.1940[0m        [35m1.9244[0m  0.0063
      9        [36m1.8715[0m       0.1900        [35m1.9227[0m  0.0093
     10        [36m1.8554[0m       [32m0.2000[0m        [35m1.9201[0m  0.0063
     11        [36m1.8497[0m       [32m0.2100[0m        [35m1.9173[0m  0.008

    107        0.9672       0.5320        [35m1.5743[0m  0.0071
    108        0.9656       [32m0.5360[0m        [35m1.5719[0m  0.0073
    109        1.0130       0.5340        [35m1.5680[0m  0.0072
    110        0.9821       [32m0.5380[0m        [35m1.5630[0m  0.0070
    111        0.9151       0.5360        [35m1.5598[0m  0.0077
    112        [36m0.8390[0m       0.5360        [35m1.5549[0m  0.0093
    113        0.9395       [32m0.5420[0m        [35m1.5505[0m  0.0074
    114        0.9234       [32m0.5440[0m        [35m1.5465[0m  0.0069
    115        0.9149       0.5420        [35m1.5425[0m  0.0072
    116        0.9710       0.5400        [35m1.5399[0m  0.0071
    117        0.9081       0.5400        [35m1.5393[0m  0.0086
    118        0.9285       0.5380        [35m1.5347[0m  0.0063
    119        0.8962       0.5440        [35m1.5320[0m  0.0080
    120        0.9140       0.5420        [35m1.5285[0m  0.0066
    121        0.8729       [3

<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=GCN(
    (conv1): GCNConv(1433, 16)
    (conv2): GCNConv(16, 7)
  ),
)

### Evaluation

In [11]:
from sklearn.metrics import accuracy_score

In [13]:
accuracy_score(ds_test.y, net.predict(ds_test))

0.683