# Deep learning on graph data 

Laurent Cetinsoy

Reference : https://tkipf.github.io/graph-convolutional-networks/


Graph are powerful datastructure that allows to represent various kind of data : 

- molecules
- social networks interactions

Deep learning on graph data has become a major subfield of deep learning. 

Ou goal is to get the basics of machine learning on graph 

## Graph reminders

A graph is described by objects (two sets) : 

- the sets of node (also called vertex). Each node can contain values
- the sets of edges betweens the node. 

There are many ways to represents graph data in computer science. 

On simple way would be to have : 
- a list of node values
- a list of tuple for representing the edges. Each tuple would be (iNodeSource, iNodeTarget) 



Create a list of 3 nodes having the values (-1, 2, 4).



 
Create a list of unordered connexions. where Node 0 is connected to 1 and 2

Another way to represent the edges in a graph is to use the Adjacency matrix. 

Let's say that the node of your graph are indexed (0, 1, 2, .., n)

Then the Adjacency matrix is a n by n matrix. 

$ a_{ij} $ value represend if node $i$ is connected to no $j$ 

The adjacacency matrix is very interesting because it carries a lot of information about the underlying graph 

However one drawback is that it can consume a lot of memory (hello scipy.sparse matrices :) 

Modify the following Adjancy matrix so that the node 1 is connected to the node 3 


In [None]:
import numpy as np 
A = np.array([
    [0, 0, 0], 
    [0, 0, 1], 
    [0, 1, 0]]
)
A

array([[0, 0, 1],
       [0, 0, 1],
       [1, 0, 0]])

create a function build_adjency_matrix(edges) which build a numpy adjacency matrix for a list of edges 

Code a function neighbors(edges, i_node) which returns a list of node index which are direct neighbors of i_node


Code a function average_neighbors(nodes, edges, i_node) which computes the average of node values which are the direct neighbors of the node i_node

Code a function weighted_average_neihhbors(nodes, edges, i_node) which computes a weighted average with random generated weights

The above function can be seen as a random convolution on a graph node. Use it to compute a new set of node values on all the nodes of the graph




The previous function allowed use to create a new set of node features by combining locally each node with its neighbors. This is how Graph convolutional layers will work : A GCN layer will process each nodes but instead of using random weights, it will be learnable weights. 


Besides, instead of just doing an average, more complex operation can be done. 

Remark : usually the edge is not impacted by the convolution

Such approaches are called message passing: because each node send a message (its value) to the neighbor

![graph_convolution.png](graph_convolution.png)

## Deep learning on graph with pytorch geometric

Now we have some basics, we will learn to use pytorch geometric to do simple graph classification with deep learning

install pytorch geometric 

In [None]:
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.10.0+cpu.html

Import pytorch and the Data class from pytorch Geometric

In [None]:
import torch
from torch_geometric.data import Data

Create a pytorch tensor with the value -1.0, 2.0, 3.5 and store it in the variable **nodes**
It will represend the node values of your graph. We consider that the first node as index 0 and value -10. 

In [None]:
nodes = torch.tensor([1.0, 2.0, 3.5])


The following code let you build the edges of the graph following pytorch way.

Modify it so that node 2 is also connected to node 3



In [None]:
edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)

data = Data(x=x, edge_index=edge_index)

Display the number of edge with the .num_edges attribue



Display the number of nodes with the .num_nodes attribute

## Let's  load a dataset 


Run the following cell that load a dataset

In [None]:
from torch_geometric.datasets import TUDataset

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')

Display the type of dataset. You should see the same type as before

Display the number of classes

Display the number of features per node (a node can contains vectors  and not only number !)

import the DataLoader Class from torch_geometric.loader 

In [None]:
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader


From dataset variable, create a DataLoader with batch_size=32, shuffle=True

In [None]:

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES', use_node_attr=True)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:
    batch
    >>> DataBatch(batch=[1082], edge_index=[2, 4066], x=[1082, 21], y=[32])

    batch.num_graphs
    >>> 32

The following class lets your define a pytorch model using a GCN Layer. 

Please remark the following interesting thing : 

When you use the GCNConv layer in the forward function, you pass both the nodes and the edges. 

Besides it only update the node values (the features), not the edges ! 

In [None]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 16)
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
       
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

In addition to that since the output of the GCN layers are also a vector, you need not to have flatten / dense layers (but you can add dense if you want)

So what happened ? the GCN layers have just transformed the nodes using both base graph nodes and the graph connectivity. 

How does it work ? Lets see

In [None]:
Create an Adam Optimizer and train your model. Remember to user zero_grad and backward methods. 


In [None]:
device = 'cpu'
model = GCN().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()


Evaluate your model 

In [None]:
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6b859965-b858-4b8d-a841-009599aef86e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>