<h1 style="color:rgb(0,120,170)">Neural Networks and Deep Learning</h1>
<h2 style="color:rgb(0,120,170)">Graph Convolutional Networks</h2>


! pip install -U stellargraph

In [1]:
import os

import matplotlib.pyplot as plt
import pandas as pd

import stellargraph as sg
#from stellargraph.mapper import FullBatchNodeGenerator
#from stellargraph.layer import GCN

import tensorflow as tf
#from tensorflow.keras import layers, optimizers, losses, metrics, Model
from sklearn import preprocessing, model_selection

from IPython.display import display, HTML

%matplotlib inline

In [2]:
tf.__version__

'2.3.1'

## Data Preparation  

#### Loading the CORA network

We can retrieve a StellarGraph graph object holding this Cora dataset using the Cora loader (docs) from the datasets submodule (docs). It also provides us with the ground-truth node subject classes. This function is implemented using Pandas, see the “Loading data into StellarGraph from Pandas” notebook for details.

(Note: Cora is a citation network, which is a directed graph, but, like most users of this graph, we ignore the edge direction and treat it as undirected.)

(See the “Loading from Pandas” demo for details on how data can be loaded.)

In [3]:
dataset = sg.datasets.Cora()
display(HTML(dataset.description))
G, node_subjects = dataset.load()

In [4]:
print(G.info())

StellarGraph: Undirected multigraph
 Nodes: 2708, Edges: 5429

 Node types:
  paper: [2708]
    Features: float32 vector, length 1433
    Edge types: paper-cites->paper

 Edge types:
    paper-cites->paper: [5429]
        Weights: all 1 (default)
        Features: none


We aim to train a graph-ML model that will predict the “subject” attribute on the nodes. These subjects are one of 7 categories, with some categories more common than others:

In [5]:
node_subjects.value_counts().to_frame()

Unnamed: 0,subject
Neural_Networks,818
Probabilistic_Methods,426
Genetic_Algorithms,418
Theory,351
Case_Based,298
Reinforcement_Learning,217
Rule_Learning,180


### Splitting the data

For machine learning we want to take a subset of the nodes for training, and use the rest for validation and testing. We’ll use scikit-learn’s train_test_split function (docs) to do this.

Here we’re taking 140 node labels for training, 500 for validation, and the rest for testing.

In [6]:
train_subjects, test_subjects = model_selection.train_test_split(node_subjects, train_size=140, test_size=None, stratify=node_subjects)
val_subjects, test_subjects = model_selection.train_test_split(test_subjects, train_size=500, test_size=None, stratify=test_subjects)

Note using stratified sampling gives the following counts:

In [7]:
train_subjects.value_counts().to_frame()

Unnamed: 0,subject
Neural_Networks,42
Probabilistic_Methods,22
Genetic_Algorithms,22
Theory,18
Case_Based,16
Reinforcement_Learning,11
Rule_Learning,9


The training set has class imbalance that might need to be compensated, e.g., via using a weighted cross-entropy loss in model training, with class weights inversely proportional to class support. However, we will ignore the class imbalance in this example, for simplicity.

### Converting to numeric arrays

For our categorical target, we will use one-hot vectors that will be compared against the model’s soft-max output. To do this conversion we can use the LabelBinarizer transform (docs) from scikit-learn. Another option would be the pandas.get_dummies function (docs), but the scikit-learn transform allows us to do the inverse transform easily later in the notebook, to interpret the predictions.

In [8]:
target_encoding = preprocessing.LabelBinarizer()

train_targets = target_encoding.fit_transform(train_subjects)
val_targets = target_encoding.transform(val_subjects)
test_targets = target_encoding.transform(test_subjects)

The CORA dataset contains attributes w_x that correspond to words found in that publication. If a word occurs more than once in a publication the relevant attribute will be set to one, otherwise it will be zero. These numeric attributes have been automatically included in the StellarGraph instance G, and so we do not have to do any further conversion.  

Each paper is analysed to see if it contains each of 1433 words

![Cora](../../../images/Cora-features.png)

## Creating the GCN layers

A machine learning model in StellarGraph consists of a pair of items:

+ the layers themselves, such as graph convolution, dropout and even conventional dense layers
+ a data generator to convert the core graph structure and node features into a format that can be fed into the Keras model for training or prediction

GCN is a full-batch model and we’re doing node classification here, which means the FullBatchNodeGenerator class (docs) is the appropriate generator for our task. StellarGraph has many generators in order to support all its many models and tasks.

Specifying the method='gcn' argument to the FullBatchNodeGenerator means it will yield data appropriate for the GCN algorithm specifically, by using the normalized graph Laplacian matrix to capture the graph structure.

In [11]:
generator = sg.mapper.FullBatchNodeGenerator(G, method="gcn")

Using GCN (local pooling) filters...


A generator just encodes the information required to produce the model inputs.  
Calling the flow method (docs) with a set of nodes and their true labels produces an object that can be used to train the model, on those nodes and labels that were specified.  
We created a training set above, so that’s what we’re going to use here.

In [12]:
train_gen = generator.flow(train_subjects.index, train_targets)

Now we can specify our machine learning model by building a stack of layers. We can use StellarGraph’s GCN class (docs), which packages up the creation of this stack of graph convolution and dropout layers. We can specify a few parameters to control this:

+ layer_sizes: the number of hidden GCN layers and their sizes. In this case, two GCN layers with 16 units each.
+ activations: the activation to apply to each GCN layer’s output. In this case, RelU for both layers.
+ dropout: the rate of dropout for the input of each GCN layer. In this case, 50%.


In [14]:
gcn = sg.layer.GCN(layer_sizes=[16, 16], activations=["relu", "relu"], generator=generator, dropout=0.5)

To create a Keras model we now expose the input and output tensors of the GCN model for node prediction, via the GCN.in_out_tensors method:

In [15]:
x_inp, x_out = gcn.in_out_tensors()

x_out

<tf.Tensor 'gather_indices/GatherV2:0' shape=(1, None, 16) dtype=float32>

The x_out value is a TensorFlow tensor that holds a 16-dimensional vector for the nodes requested when training or predicting. The actual predictions of each node’s class/subject needs to be computed from this vector. StellarGraph is built using Keras functionality, so this can be done with a standard Keras functionality: an additional dense layer (with one unit per class) using a softmax activation. This activation function ensures that the final outputs for each input node will be a vector of “probabilities”, where every value is between 0 and 1, and the whole vector sums to 1. The predicted class is the element with the highest value.

In [16]:
predictions = tf.keras.layers.Dense(units=train_targets.shape[1], activation="softmax")(x_out)

## Training and evaluating

### Training the model

Now let’s create the actual Keras model with the input tensors x_inp and output tensors being the predictions predictions from the final dense layer. Our task is a categorical prediction task, so a categorical cross-entropy loss function is appropriate. Even though we’re doing graph ML with StellarGraph, we’re still working with conventional Keras prediction values, so we can use the loss function from Keras directly.

In [18]:
model = tf.keras.Model(inputs=x_inp, outputs=predictions)
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.01),
              loss=tf.keras.losses.categorical_crossentropy,
              metrics=["acc"],
             )

As we’re training the model, we’ll want to also keep track of its generalisation performance on the validation set, which means creating another data generator, using our FullBatchNodeGenerator we created above.

In [19]:
val_gen = generator.flow(val_subjects.index, val_targets)

We can directly use the EarlyStopping functionality (docs) offered by Keras to stop training if the validation accuracy stops improving.