# Introduction to PyTorch Geometric (PyG)

PyTorch Geometric (PyG): https://pytorch-geometric.readthedocs.io/en/latest/

## 0. Instllation 

See https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

## 1. Data Format

See https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#data-handling-of-graphs and understand the meaning of `edge_index`.

## 2. Example

The following code provides an example to use PyG for building GCN to solve the node classification task. We will walk through the code and write code comments in this lecture. 

In [14]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Citeseer', name='Citeseer')

In [15]:
dataset 

Citeseer()

In [16]:
data = dataset[0]
data

Data(x=[3327, 3703], edge_index=[2, 9104], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327])

In [17]:
data.x.shape

torch.Size([3327, 3703])

In [18]:
data.y.max() + 1

tensor(6)

In [19]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, num_node_features, num_hidden, num_classes):
        super().__init__()
        self.conv1 = GCNConv(num_node_features, num_hidden)
        self.conv2 = GCNConv(num_hidden, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

In [20]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data = dataset[0].to(device)
model = GCN(num_node_features=data.x.shape[1], 
            num_hidden=16,
            num_classes=(data.y.max()+1).item()
           ).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print('Epoch {0}: {1}'.format(epoch, loss.item()))

Epoch 0: 1.7944680452346802
Epoch 10: 0.32050636410713196
Epoch 20: 0.09472008049488068
Epoch 30: 0.0439465306699276
Epoch 40: 0.049599695950746536
Epoch 50: 0.048935387283563614
Epoch 60: 0.042063795030117035
Epoch 70: 0.028407957404851913
Epoch 80: 0.04070654138922691
Epoch 90: 0.041722141206264496
Epoch 100: 0.053463321179151535
Epoch 110: 0.021726898849010468
Epoch 120: 0.043291542679071426
Epoch 130: 0.018726319074630737
Epoch 140: 0.03828861191868782
Epoch 150: 0.01987367682158947
Epoch 160: 0.026789499446749687
Epoch 170: 0.029931657016277313
Epoch 180: 0.06403932720422745
Epoch 190: 0.02754107676446438


In [21]:
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.6790


## Q: How to use PyG in our project?

Essentially, there are only a few more steps we need to do:
- we need to convert the provided data into the PyG format.
    - PyG

In [12]:
import scipy.sparse as sp
import numpy as np
import json
adj = sp.load_npz('./cse881project/data_2023/adj.npz')
feat  = np.load('./cse881project/data_2023/features.npy')
labels = np.load('./cse881project/data_2023/labels.npy')
splits = json.load(open('./cse881project/data_2023/splits.json'))
idx_train, idx_test = splits['idx_train'], splits['idx_test']

In [2]:
from torch_geometric.utils import from_scipy_sparse_matrix

In [3]:
edge_index = from_scipy_sparse_matrix(adj)

In [4]:
edge_index

(tensor([[   0,    0,    0,  ..., 2478, 2478, 2479],
         [1084, 1104, 1288,  ...,  931,  933,  999]]),
 tensor([1., 1., 1.,  ..., 1., 1., 1.]))

In [5]:
feat

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0]])

In [7]:
labels.shape

(496,)

In [11]:
len(splits['idx_train']), len(splits['idx_test'])

(496, 1984)

## How to submit the result

In [22]:
preds = pred[idx_test]
np.savetxt('submission.txt', preds, fmt='%d')