# SD212: Graph mining

## Lab 7: Graph neural networks

In this lab, you will learn to classify nodes using graph neural networks.

We use [DGL](https://www.dgl.ai) (deep graph library), which relies on pytorch.

In [1]:
# pip install dgl

## Import

In [2]:
import numpy as np
from scipy import sparse

In [3]:
from sknetwork.data import load_netset
from sknetwork.classification import DiffusionClassifier
from sknetwork.embedding import Spectral
from sknetwork.utils import directed2undirected

In [4]:
import dgl
from dgl.nn import SAGEConv
from dgl import function as fn

In [5]:
import torch
from torch import nn
import torch.nn.functional as F

In [6]:
# ignore warnings from DGL
import warnings
warnings.filterwarnings('ignore')

## Load data

We will work on the following datasets (see the [NetSet](https://netset.telecom-paris.fr/) collection for details):
* Cora (directed graph + bipartite graph)
* WikiVitals (directed graph + bipartite graph)

Both datasets are graphs with node features (given by the bipartite graph) and ground-truth labels.

In [7]:
cora = load_netset('cora')
wikivitals = load_netset('wikivitals')

Parsing files...
Done.
Parsing files...
Done.


In [8]:
dataset = cora
#dataset = wikivitals

In [9]:
adjacency = dataset.adjacency
biadjacency = dataset.biadjacency
labels = dataset.labels

In [10]:
# we use undirected graphs
adjacency = directed2undirected(adjacency)

In [11]:
# for Wikivitals, use spectral embedding of the bipartite graph as features

if dataset.meta.name.startswith('Wikivitals'):
    spectral = Spectral(50)
    features = spectral.fit_transform(biadjacency)
else:
    features = biadjacency.toarray()

In [12]:
def split_train_test_val(n_samples, test_ratio=0.1, val_ratio=0.1, seed=None):
    """Split the samples into train / test / validation sets.
    
    Returns
    -------
    train: np.ndarray
        Boolean mask
    test: np.ndarray
        Boolean mask
    validation: np.ndarray
        Boolean mask
    """
    if seed:
        np.random.seed(seed)

    # test
    index = np.random.choice(n_samples, int(np.ceil(n_samples * test_ratio)), replace=False)
    test = np.zeros(n_samples, dtype=bool)
    test[index] = 1
    
    # validation
    index = np.random.choice(np.argwhere(~test).ravel(), int(np.ceil(n_samples * val_ratio)), replace=False)
    val = np.zeros(n_samples, dtype=bool)
    val[index] = 1
    
    # train
    train = np.ones(n_samples, dtype=bool)
    train[test] = 0
    train[val] = 0
    return train, test, val

In [13]:
train, test, val = split_train_test_val(len(labels))

## Graph and tensors

In DGL, the graph is represented as an object, the features and labels as tensors.

In [14]:
# graph as an object
graph = dgl.from_scipy(adjacency)

In [15]:
type(graph)

dgl.heterograph.DGLGraph

In [16]:
# features and labels as tensors
features = torch.Tensor(features)
labels = torch.Tensor(labels).long()

In [17]:
# masks as tensors
train = torch.Tensor(train).bool()
test = torch.Tensor(test).bool()
val = torch.Tensor(val).bool()

## Graph neural network

We start with a simple graph neural network without hidden layer. The output layer is of type [GraphSAGE](https://docs.dgl.ai/generated/dgl.nn.pytorch.conv.SAGEConv.html).

In [18]:
class GNN(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GNN, self).__init__()
        self.conv = SAGEConv(dim_input, dim_output, aggregator_type='mean')
        
    def forward(self, graph, features):
        output = self.conv(graph, features)
        return output

## To do

* Train the model on Cora and get accuracy.
* Compare with the same model trained on an empty graph.
* Add a hidden layer with ReLu activation function (e.g., dimension = 20) and retrain the model. 
* Compare with a classifier based on heat diffusion.

In [19]:
def init_model(model, features, labels):
    '''Init the GNN with appropriate dimensions.'''
    dim_input = features.shape[1]
    dim_output = len(labels.unique())
    return model(dim_input, dim_output)   

In [20]:
def eval_model(gnn, graph, features, labels, test=test):
    '''Evaluate the model in terms of accuracy.'''
    gnn.eval()
    with torch.no_grad():
        output = gnn(graph, features)
        labels_pred = torch.max(output, dim=1)[1]
        score = np.mean(np.array(labels[test]) == np.array(labels_pred[test]))
    return score

In [21]:
def train_model(gnn, graph, features, labels, train=train, val=val, n_epochs=100, lr=0.01, verbose=True):
    '''Train the GNN.'''
    optimizer = torch.optim.Adam(gnn.parameters(), lr=lr)
    
    gnn.train()
    scores = []
    
    for t in range(n_epochs):   
        
        # forward
        output = gnn(graph, features)
        logp = nn.functional.log_softmax(output, 1)
        loss = nn.functional.nll_loss(logp[train], labels[train])

        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # evaluation
        score = eval_model(gnn, graph, features, labels, val)
        scores.append(score)
        
        if verbose and t % 10 == 0:
            print("Epoch {:02d} | Loss {:.3f} | Accuracy {:.3f}".format(t, loss.item(), score))

In [22]:
gnn = init_model(GNN, features, labels)

In [23]:
train_model(gnn, graph, features, labels)

Epoch 00 | Loss 1.975 | Accuracy 0.310
Epoch 10 | Loss 0.602 | Accuracy 0.797
Epoch 20 | Loss 0.256 | Accuracy 0.841
Epoch 30 | Loss 0.140 | Accuracy 0.856
Epoch 40 | Loss 0.091 | Accuracy 0.856
Epoch 50 | Loss 0.066 | Accuracy 0.856
Epoch 60 | Loss 0.051 | Accuracy 0.852
Epoch 70 | Loss 0.042 | Accuracy 0.845
Epoch 80 | Loss 0.036 | Accuracy 0.849
Epoch 90 | Loss 0.031 | Accuracy 0.849


In [24]:
eval_model(gnn, graph, features, labels)

0.8671586715867159

In [25]:
empty_graph = dgl.from_scipy(sparse.csr_matrix(adjacency.shape))

In [26]:
gnn = init_model(GNN, features, labels)

In [27]:
train_model(gnn, graph, features, labels)

Epoch 00 | Loss 1.993 | Accuracy 0.339
Epoch 10 | Loss 0.607 | Accuracy 0.801
Epoch 20 | Loss 0.257 | Accuracy 0.849
Epoch 30 | Loss 0.141 | Accuracy 0.838
Epoch 40 | Loss 0.091 | Accuracy 0.838
Epoch 50 | Loss 0.066 | Accuracy 0.849
Epoch 60 | Loss 0.052 | Accuracy 0.845
Epoch 70 | Loss 0.042 | Accuracy 0.845
Epoch 80 | Loss 0.036 | Accuracy 0.845
Epoch 90 | Loss 0.031 | Accuracy 0.841


In [28]:
eval_model(gnn, empty_graph, features, labels, test)

0.7306273062730627

In [29]:
class GNN(nn.Module):
    def __init__(self, dim_input, dim_output, dim_hidden=20):
        super(GNN, self).__init__()
        self.conv1 = SAGEConv(dim_input, dim_hidden, 'mean')
        self.conv2 = SAGEConv(dim_hidden, dim_output, 'mean')
        
    def forward(self, graph, features):
        h = self.conv1(graph, features)
        h = F.relu(h)
        output = self.conv2(graph, h)
        return output

In [30]:
gnn = init_model(GNN, features, labels)

In [31]:
train_model(gnn, graph, features, labels)

Epoch 00 | Loss 2.132 | Accuracy 0.358
Epoch 10 | Loss 0.223 | Accuracy 0.863
Epoch 20 | Loss 0.057 | Accuracy 0.849
Epoch 30 | Loss 0.016 | Accuracy 0.849
Epoch 40 | Loss 0.006 | Accuracy 0.852
Epoch 50 | Loss 0.004 | Accuracy 0.849
Epoch 60 | Loss 0.002 | Accuracy 0.852
Epoch 70 | Loss 0.002 | Accuracy 0.849
Epoch 80 | Loss 0.002 | Accuracy 0.849
Epoch 90 | Loss 0.001 | Accuracy 0.849


In [32]:
eval_model(gnn, graph, features, labels)

0.8597785977859779

In [33]:
# comparison with heat diffusion
y_true = dataset.labels
y_train = y_true.copy()
y_train[test|val] = -1

In [34]:
algo = DiffusionClassifier()

In [35]:
y_pred = algo.fit_predict(biadjacency, y_train)

In [36]:
np.mean(y_pred[test] == y_true[test])

0.7675276752767528

In [37]:
y_pred = algo.fit_predict(adjacency, y_train)

In [38]:
np.mean(y_pred[test] == y_true[test])

0.8413284132841329

## Build your own GNN

You will now build your own GNN. We start with a simple graph convolution layer. 

In [39]:
class GraphConvLayer(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GraphConvLayer, self).__init__()
        self.layer = nn.Linear(dim_input, dim_output)
        
    def forward(self, graph, signal):
        with graph.local_scope():
            # message passing
            graph.ndata['node'] = signal
            graph.update_all(fn.copy_u('node', 'message'), fn.mean('message', 'average'))
            h = graph.ndata['average']
            return self.layer(h)

Observe that the message passing is based on the diffusion:
$$U\mapsto PU$$ where $P$ is the transition matrix of the random walk.

## To do

* Build a GNN with two layers based on this graph convolution layer.
* Train this GNN and compare the results with the previous one.
* Add the input signal of each node, so that the message passing becomes:
$$
U\mapsto (I + P)U
$$
* Retrain the GNN and observe the results.
* Retrain the same GNN without message passing in the first layer.

In [40]:
class GNN(nn.Module):
    def __init__(self, n_input, n_output, n_hidden=20):
        super(GNN, self).__init__()
        self.conv1 = GraphConvLayer(n_input, n_hidden)
        self.conv2 = GraphConvLayer(n_hidden, n_output)
        
    def forward(self, graph, features):
        h = self.conv1(graph, features)
        h = F.relu(h)
        h = self.conv2(graph, h)
        return h

In [41]:
gnn = init_model(GNN, features, labels)
train_model(gnn, graph, features, labels, train, val, verbose=True)

Epoch 00 | Loss 1.962 | Accuracy 0.210
Epoch 10 | Loss 1.104 | Accuracy 0.672
Epoch 20 | Loss 0.469 | Accuracy 0.849
Epoch 30 | Loss 0.275 | Accuracy 0.856
Epoch 40 | Loss 0.204 | Accuracy 0.863
Epoch 50 | Loss 0.161 | Accuracy 0.856
Epoch 60 | Loss 0.132 | Accuracy 0.860
Epoch 70 | Loss 0.111 | Accuracy 0.845
Epoch 80 | Loss 0.093 | Accuracy 0.830
Epoch 90 | Loss 0.079 | Accuracy 0.827


In [42]:
eval_model(gnn, graph, features, labels, test)

0.8302583025830258

In [43]:
class GraphConvLayer(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(GraphConvLayer, self).__init__()
        self.layer = nn.Linear(dim_input, dim_output)
        
    def forward(self, graph, signal):
        with graph.local_scope():
            # message passing
            graph.ndata['node'] = signal
            graph.update_all(fn.copy_u('node', 'message'), fn.mean('message', 'average'))
            h = graph.ndata['average']
            return self.layer(h + signal)

In [44]:
class GNN(nn.Module):
    def __init__(self, n_input, n_output, n_hidden=20):
        super(GNN, self).__init__()
        self.conv1 = GraphConvLayer(n_input, n_hidden)
        self.conv2 = GraphConvLayer(n_hidden, n_output)
        
    def forward(self, graph, features):
        h = self.conv1(graph, features)
        h = F.relu(h)
        h = self.conv2(graph, h)
        return h

In [45]:
gnn = init_model(GNN, features, labels)
train_model(gnn, graph, features, labels, train, val, verbose=True)

Epoch 00 | Loss 1.969 | Accuracy 0.590
Epoch 10 | Loss 0.397 | Accuracy 0.867
Epoch 20 | Loss 0.158 | Accuracy 0.863
Epoch 30 | Loss 0.081 | Accuracy 0.845
Epoch 40 | Loss 0.046 | Accuracy 0.838
Epoch 50 | Loss 0.030 | Accuracy 0.838
Epoch 60 | Loss 0.020 | Accuracy 0.834
Epoch 70 | Loss 0.015 | Accuracy 0.838
Epoch 80 | Loss 0.012 | Accuracy 0.838
Epoch 90 | Loss 0.010 | Accuracy 0.834


In [46]:
eval_model(gnn, graph, features, labels, test)

0.8597785977859779

In [47]:
class ConvLayer(nn.Module):
    def __init__(self, dim_input, dim_output):
        super(ConvLayer, self).__init__()
        self.layer = nn.Linear(dim_input, dim_output)
        
    def forward(self, features):
        output = self.layer(features)
        return output

In [48]:
class GNN(nn.Module):
    def __init__(self, n_input, n_output, n_hidden=20):
        super(GNN, self).__init__()
        self.conv1 = ConvLayer(n_input, n_hidden)
        self.conv2 = GraphConvLayer(n_hidden, n_output)
        
    def forward(self, graph, features):
        h = self.conv1(features)
        h = F.relu(h)
        h = self.conv2(graph, h)
        return h

In [49]:
gnn = init_model(GNN, features, labels)

In [50]:
train_model(gnn, graph, features, labels, train, val, verbose=True)

Epoch 00 | Loss 1.970 | Accuracy 0.328
Epoch 10 | Loss 0.660 | Accuracy 0.845
Epoch 20 | Loss 0.190 | Accuracy 0.867
Epoch 30 | Loss 0.079 | Accuracy 0.863
Epoch 40 | Loss 0.037 | Accuracy 0.863
Epoch 50 | Loss 0.022 | Accuracy 0.863
Epoch 60 | Loss 0.015 | Accuracy 0.860
Epoch 70 | Loss 0.011 | Accuracy 0.860
Epoch 80 | Loss 0.009 | Accuracy 0.860
Epoch 90 | Loss 0.007 | Accuracy 0.860


In [51]:
eval_model(gnn, graph, features, labels, test)

0.8671586715867159

## Heat diffusion as a GNN

Node classification by heat diffusion can be seen as a GNN without training, using a one-hot encoding of labels. Features are ignored.

## To do

* Build a special GNN whose output corresponds to one step of heat diffusion in the graph.
* Use this GNN to classify nodes by heat diffusion, with temperature centering.

In [52]:
from sknetwork.utils import get_membership

In [53]:
labels_one_hot = get_membership(labels).toarray()
labels_one_hot = torch.Tensor(labels_one_hot)

In [54]:
class Diffusion(nn.Module):
    def __init__(self):
        super(Diffusion, self).__init__()
        
    def forward(self, graph, features, mask):
        '''Mask is a boolean tensor giving the training set.'''
        with graph.local_scope():
            h_node = features
            # diffusion
            graph.ndata['node'] = h_node
            graph.update_all(fn.copy_src('node', 'message'), fn.mean('message', 'neighbor'))
            h_neighbor = graph.ndata['neighbor']
            # seed nodes
            h_neighbor[mask] = h_node[mask]
            return h_neighbor

In [55]:
diffusion = Diffusion()

In [56]:
n_iter = 20

temperatures = labels_one_hot
temperatures[~train] = 0
for t in range(n_iter):
    temperatures = diffusion(graph, temperatures, train)
    
# temperature centering
temperatures -= temperatures.mean(axis=0)

AttributeError: module 'dgl.function' has no attribute 'copy_src'

In [57]:
labels_pred = np.argmax(temperatures.numpy(), axis=1)

In [58]:
np.mean(labels.numpy()[test] == labels_pred[test])

0.11439114391143912

In [59]:
# comparison with scikit-network
from sknetwork.classification import DiffusionClassifier

In [60]:
algo = DiffusionClassifier()

In [61]:
labels[~train] = -1
labels = labels.numpy()
labels_true = dataset.labels

In [62]:
labels_pred = algo.fit_predict(adjacency, labels)

In [63]:
np.mean(labels_true[test] == labels_pred[test])

0.8413284132841329