# Graph Learning Tasks
In this tutorial, we are going to demonstrate some basic tasks in graph learning. In general, many of the graph learning problems can fall into the following categories:

* Node classification: assign a label to a node.
* Link prediction: predict the existence of an edge between two nodes.
* Graph classification: assign a label to a graph.

Many real-world applications can be formulated as one of these graph problems.
* Fraud detection in financial transactions: transactions form a graph, where users are nodes and transactions are edges. In this case, we want to detect malicious users, which is to assign binary labels to users.
* Community detection in a social network: a social network is naturally a graph, where nodes are users and edges are interactions between users. We want to predict which community a node belongs to.
* Recommendation: users and items form a bipartite graph. They are connected with edges when users purchase items. Given users' purchase history, we want to predict what items a user will purchase in a near future. Thus, recommendation is a link prediction problem.
* Drug discovery: a molecule is a graph whose nodes are atoms. We want to predict the property of a molecule. In this case, we want to assign a label to a graph.

## Get started

DGL can be used with different deep learning frameworks. Currently, DGL can be used with Pytorch and MXNet. Here, we show how DGL works with Pytorch.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

When we load DGL, we need to set the DGL backend for one of the deep learning frameworks. Because this tutorial develops models in Pytorch, we have to set the DGL backend to Pytorch.

In [2]:
import dgl
from dgl import DGLGraph

# Load Pytorch as backend
dgl.load_backend('pytorch')

Load the rest of the necessary libraries.

In [3]:
import numpy as np

## GNN model

Typically, GNN is used to compute meaningful node embeddings. With the embeddings, we can perform many downstream tasks.

DGL provides two ways of implementing a GNN model:
* using the [nn module](https://doc.dgl.ai/features/nn.html), which contains many commonly used GNN modules.
* using the message passing interface to implement a GNN model from scratch.

For simplicity, we implement the GNN model in the tutorial with the nn module.

In this tutorial, we use [GraphSage](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf), one of the first inductive GNN models. GraphSage performs the following computation on every node $v$ in the graph:

$$h_{N(v)}^{(l)} \gets AGGREGATE_k({h_u^{(l-1)}, \forall u \in N(v)})$$
$$h_v^{(l)} \gets \sigma(W^k \cdot CONCAT(h_v^{(l-1)}, h_{N(v)}^{(l)})),$$

where $N(v)$ is the neighborhood of node $v$ and $l$ is the layer Id.

The GraphSage model has multiple layers. In each layer, a vertex accesses its direct neighbors. When we stack $k$ layers in a model, a node $v$ access neighbors within $k$ hops. The output of the GraphSage model is node embeddings that represent the nodes and all information in the k-hop neighborhood.

<img src="https://github.com/zheng-da/DGL_devday_tutorial/raw/master/GNN.png" alt="drawing" width="600"/>

Use DGL's `nn` module to build the GraphSage model. `SAGEConv` implements the operations of `GraphSage` in a layer.

In [4]:
from dgl.nn.pytorch import conv as dgl_conv

class GraphSAGEModel(nn.Module):
    def __init__(self,
                 in_feats,
                 n_hidden,
                 out_dim,
                 n_layers,
                 activation,
                 dropout,
                 aggregator_type):
        super(GraphSAGEModel, self).__init__()
        self.layers = nn.ModuleList()

        # input layer
        self.layers.append(dgl_conv.SAGEConv(in_feats, n_hidden, aggregator_type,
                                         feat_drop=dropout, activation=activation))
        # hidden layers
        for i in range(n_layers - 1):
            self.layers.append(dgl_conv.SAGEConv(n_hidden, n_hidden, aggregator_type,
                                             feat_drop=dropout, activation=activation))
        # output layer
        self.layers.append(dgl_conv.SAGEConv(n_hidden, out_dim, aggregator_type,
                                         feat_drop=dropout, activation=None))

    def forward(self, g, features):
        h = features
        for layer in self.layers:
            h = layer(g, h)
        return h

Interested readers can check out our [online tutorials](https://doc.dgl.ai/tutorials/models/index.html) to see how to use DGL's message passing interface to implement GNN models.

## Prepare the dataset for the tutorial

DGL has a large collection of built-in datasets. Please see [this doc](https://doc.dgl.ai/api/python/data.html) for more information.

In this tutorial, we use a citation network called pubmed for demonstration. A node in the citation network is a paper and an edge represents the citation between two papers. This dataset has 19,717 papers and 88,651 citations. Each paper has a sparse bag-of-words feature vector and a class label.

All other graph data, such as node features, are stored as NumPy tensors. When we load the tensors, we convert them to Pytorch tensors.

In [5]:
from dgl.data import citegrh

# load and preprocess the pubmed dataset
data = citegrh.load_pubmed()

# sparse bag-of-words features of papers
features = torch.FloatTensor(data.features)
# the number of input node features
in_feats = features.shape[1]
# class labels of papers
labels = torch.LongTensor(data.labels)
# the number of unique classes on the nodes.
n_classes = data.num_labels

Finished data loading and preprocessing.
  NumNodes: 19717
  NumEdges: 88651
  NumFeats: 500
  NumClasses: 3
  NumTrainingSamples: 60
  NumValidationSamples: 500
  NumTestSamples: 1000


In [6]:
print(n_classes)
print(data.labels.shape)
print(features[0])

3
(19717,)
tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0031, 0.0000,
        0.0000, 0.0000, 0.0302, 0.0000, 0.0082, 0.0108, 0.0000, 0.0110, 0.0000,
        0.0064, 0.0164, 0.0000, 0.0000, 0.0080, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0131, 0.0000, 0.0000, 0.0097, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0136, 0.0107, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0122, 0.0000, 0.0179, 0.0000, 0.0218,
        0.0000, 0.0000, 0.0000, 0.0099, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0387, 0.0000, 0.0144,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0098,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0300, 0.0000,
        0.0000, 0.0000, 0.000

For small datasets, DGL stores the network structure in a [NetworkX](https://networkx.github.io) object. NetworkX is a very popular Python graph library. It provides comprehensive API for graph manipulation and is very useful for preprocessing small graphs.

Then we create a DGLGraph from the grpah dataset and convert it to a read-only DGLGraph, which supports more efficient computation. Currently, DGL sampling API only works on read-only DGLGraphs.

In [7]:
g = DGLGraph(data.graph)
g.readonly()

## Node classification in the semi-supervised setting

Let us perform node classification in a semi-supervised setting. In this setting, we have the entire graph structure and all node features. We only have labels on some of the nodes. We want to predict the labels on other nodes. Even though some of the nodes do not have labels, they connect with nodes with labels. Thus, we train the model with both labeled nodes and unlabeled nodes. Semi-supervised learning can usually improve performance.

<img src="https://github.com/zheng-da/DGL_devday_tutorial/raw/master/node_classify1.png" alt="drawing" width="200"/>

This dependency graph shows a better view of how labeled and unlabled nodes are used in the training.
<img src="https://github.com/zheng-da/DGL_devday_tutorial/raw/master/node_classify2.png" alt="drawing" width="800"/>

First, we create a 2-layer GraphSage model.

In [8]:
# Hyperparameters
n_hidden = 64
n_layers = 2
dropout = 0.5
aggregator_type = 'gcn'

gconv_model = GraphSAGEModel(in_feats,
                             n_hidden,
                             n_classes,
                             n_layers,
                             F.relu,
                             dropout,
                             aggregator_type)

Now we create the node classification model based on the GraphSage model. The GraphSage model takes a DGLGraph object and node features as input and computes node embeddings as output. With node embeddings, we use a cross entropy loss to train the node classification model.

Now we create the node classification model based on the GraphSage model. The GraphSage model takes a DGLGraph object and node features as input and computes node embeddings as output. With node embeddings, we use a cross entropy loss to train the node classification model.

In [9]:
class NodeClassification(nn.Module):
    def __init__(self, gconv_model, n_hidden, n_classes):
        super(NodeClassification, self).__init__()
        self.gconv_model = gconv_model
        self.loss_fcn = torch.nn.CrossEntropyLoss()

    def forward(self, g, features, train_idx):
        logits = self.gconv_model(g, features)
        return self.loss_fcn(logits[train_idx], labels[train_idx])

After defining a model for node classification, we need to define an evaluation function to evaluate the performance of a trained model.

In [10]:
def NCEvaluate(model, g, features, labels, test_idx):
    model.eval()
    with torch.no_grad():
        # compute embeddings with GNN
        logits = model.gconv_model(g, features)
        logits = logits[test_idx]
        test_labels = labels[test_idx]
        _, indices = torch.max(logits, dim=1)
        correct = torch.sum(indices == test_labels)
        return correct.item() * 1.0 / len(test_labels)

Prepare data for semi-supervised node classification

In [11]:
# the dataset is split into training set, validation set and testing set.
train_idx = np.where(data.train_mask > 0)[0]
val_idx = np.where(data.val_mask > 0)[0]
test_idx = np.where(data.test_mask > 0)[0]

print("""----Data statistics------'
      #Classes %d
      #Train samples %d
      #Val samples %d
      #Test samples %d""" %
          (n_classes,
           train_idx.shape[0],
           val_idx.shape[0],
           test_idx.shape[0]))

----Data statistics------'
      #Classes 3
      #Train samples 60
      #Val samples 500
      #Test samples 1000


After defining the model and evaluation function, we can put everything into the training loop to train the model.

In [12]:
# Node classification task
model = NodeClassification(gconv_model, n_hidden, n_classes)

# Training hyperparameters
weight_decay = 5e-4
n_epochs = 150
lr = 1e-3

# create the Adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

dur = []
for epoch in range(n_epochs):
    # Set the model in the training mode.
    model.train()
    # forward
    loss = model(g, features, train_idx)
    
    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # validation
    acc = NCEvaluate(model, g, features, labels, val_idx)
    print("Epoch {:05d} | Loss {:.4f} | Accuracy {:.4f}"
          .format(epoch, loss.item(), acc))

print()
acc = NCEvaluate(model, g, features, labels, test_idx)
print("Test Accuracy {:.4f}".format(acc))

Epoch 00000 | Loss 1.1155 | Accuracy 0.1960
Epoch 00001 | Loss 1.0978 | Accuracy 0.1960
Epoch 00002 | Loss 1.0900 | Accuracy 0.1960
Epoch 00003 | Loss 1.0897 | Accuracy 0.1960
Epoch 00004 | Loss 1.0964 | Accuracy 0.1960
Epoch 00005 | Loss 1.0907 | Accuracy 0.2240
Epoch 00006 | Loss 1.0923 | Accuracy 0.3820
Epoch 00007 | Loss 1.0728 | Accuracy 0.4540
Epoch 00008 | Loss 1.0824 | Accuracy 0.5060
Epoch 00009 | Loss 1.0843 | Accuracy 0.5260
Epoch 00010 | Loss 1.0918 | Accuracy 0.5400
Epoch 00011 | Loss 1.0901 | Accuracy 0.5560
Epoch 00012 | Loss 1.0798 | Accuracy 0.5660
Epoch 00013 | Loss 1.0862 | Accuracy 0.6180
Epoch 00014 | Loss 1.0860 | Accuracy 0.6480
Epoch 00015 | Loss 1.0757 | Accuracy 0.6600
Epoch 00016 | Loss 1.0601 | Accuracy 0.6700
Epoch 00017 | Loss 1.0654 | Accuracy 0.6800
Epoch 00018 | Loss 1.0639 | Accuracy 0.6880
Epoch 00019 | Loss 1.0658 | Accuracy 0.6940
Epoch 00020 | Loss 1.0638 | Accuracy 0.6980
Epoch 00021 | Loss 1.0605 | Accuracy 0.6980
Epoch 00022 | Loss 1.0426 | Accu

## Take home exercise

An interested user can try other GNN models to compute node embeddings and use it for node classification. Please check out the [nn module](https://doc.dgl.ai/features/nn.html) in DGL.