# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Demo Notebook: Stellar Graph Library

## Learning Objectives

At the end of the experiment, you will be able to

- understand the stellargraph library applications
- understand various tools and functional API of stellargraph library

### Introduction

The StellarGraph library offers state-of-the-art algorithms for graph machine learning, making it easy to discover patterns and answer questions about graph-structured data. It can solve many machine learning tasks:

* Representation learning for nodes and edges, to be used for visualisation and various downstream machine learning tasks;

* Classification and attribute inference of nodes or edges;

* Classification of whole graphs;

* Link prediction;

* Interpretation of node classification

Graph-structured data represent entities as nodes (or vertices) and relationships between them as edges (or links), and can include data associated with either as attributes. StellarGraph supports analysis of many kinds of graphs:

* homogeneous (with nodes and links of one type),

* heterogeneous (with more than one type of nodes and/or links)

* knowledge graphs (extreme heterogeneous graphs with thousands of types of edges)

* graphs with or without data associated with nodes

* graphs with edge weights

StellarGraph is built on `TensorFlow` and its `Keras high-level API`, as well as `Pandas` and `NumPy`. It is thus user-friendly, modular and extensible. It interoperates smoothly with code that builds on these, such as the standard Keras layers and `scikit-learn`, so it is easy to augment the core graph machine learning algorithms provided by StellarGraph. It is thus also easy to install with `pip` or Anaconda.

One of the most exciting features of StellarGraph 1.0 is a new graph data structure — built using NumPy and Pandas — that results in significantly lower memory usage and faster construction times.

### Faster machine learning on larger graphs with NumPy and Pandas


The core abstraction in the library is the [StellarGraph class](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.StellarGraph), which is a graph data structure that manages all the information about the graph or graphs being used for machine learning.

Previous versions of the StellarGraph class were backed by [NetworkX](https://networkx.org/), which allowed for quick and effective development of many graph machine learning algorithms because of its convenient and flexible API, built using nested dictionaries. However, this flexibility meant it wasn’t optimised for graph machine learning: NetworkX has different trade-offs than those best for machine learning, the most notable being the amount of memory required to store a graph.

So, over the releases leading up to 1.0, the NetworkX-backed graph data structure was replaced with a new one built using NumPy and Pandas.

#### How does it work?

There are three key parts to the new StellarGraph class:

- Efficient storage of edges
- Keeping node features available for quick indexing
- Support for arbitrary node IDs.

**Efficient storage of edges**

The new StellarGraph class stores most of its data using NumPy arrays.

![Image](https://www.kdnuggets.com/wp-content/uploads/stellargraph-numpy-pandas-2.png)

$\text{Figure 1: A NumPy array can consist of a single chunk of memory with values stored inline}$

*A Python list stores pointers to other Python objects, and each of these has extra metadata overhead, dramatically increasing the cost over NumPy*.

The edges of the graph are conceptually pairs of a source node ID and a target node ID, representing the connection between the two nodes. In the new StellarGraph class, the edges are stored as NumPy arrays containing the source and targets in a [“structure of arrays”](https://en.wikipedia.org/wiki/AoS_and_SoA) style.

#### Keeping node features available for quick indexing

NumPy arrays are also used for node features. StellarGraph is optimised for machine learning, which typically means working with vectors of “features”, or lists of numbers that encode information about each entity.


### Getting Started

In [None]:
!pip install chardet

In [None]:
# Install StellarGraph
#!pip -q install stellargraph

In [None]:
!pip install git+https://github.com/VenkateshwaranB/stellargraph.git

In [None]:
# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg

try:
    sg.utils.validate_notebook_version("1.2.1")
except AttributeError:
    raise ValueError(
        f"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed.  Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
    ) from None

In [None]:
import numpy as np
import pandas as pd

### Core

To create a StellarGraph object, at a minimum pass the edges as a Pandas DataFrame. Each row of the edges DataFrame represents an edge, where the index is the ID of the edge, and the source and target columns store the node ID of the source and target nodes.

For example, suppose we’re modelling a graph that’s a square with a diagonal:

```
a -- b
| \  |
|  \ |
d -- c
```

The DataFrame might look like:



In [None]:
edges = pd.DataFrame(
    {"source": ["a", "b", "c", "d", "a"], "target": ["b", "c", "d", "a", "c"]}
)

If this data represents an undirected graph (the ordering of each edge source/target doesn’t matter):

In [None]:
Gs = sg.StellarGraph(edges=edges)

If this data represents a directed graph (the ordering does matter):

In [None]:
Gs = sg.StellarDiGraph(edges=edges)

One can also pass information about nodes, as either:

* a `IndexedArray`

* a NumPy array, if the node IDs are 0, 1, 2, …

* a Pandas DataFrame

Each row of the nodes frame (first dimension of the NumPy array) represents a node in the graph, where the index is the ID of the node.

In [None]:
nodes = sg.IndexedArray(index=["a", "b", "c", "d"])
Gs = sg.StellarGraph(nodes, edges)

Numeric node features are taken as any columns of the nodes DataFrame. For example, if the graph above has two features `x` and `y` associated with each node:

In [None]:
# As a IndexedArray (no column names):
feature_array = np.array([[-1, 0.4], [2, 0.1], [-3, 0.9], [4, 0]])
nodes = sg.IndexedArray(feature_array, index=["a", "b", "c", "d"])

# As a Pandas DataFrame:
nodes = pd.DataFrame(
    {"x": [-1, 2, -3, 4], "y": [0.4, 0.1, 0.9, 0]}, index=["a", "b", "c", "d"]
)

# As a NumPy array:
# Note, edges must change to using 0, 1, 2, 3 (instead of a, b, c, d)
nodes = feature_array

Construction directly from a `IndexedArray or NumPy` array will have the least overhead, but construction from Pandas allows for convenient data transformation.

Edge weights are taken as the optional weight column of the edges DataFrame:

In [None]:
edges = pd.DataFrame({
    "source": ["a", "b", "c", "d", "a"],
    "target": ["b", "c", "d", "a", "c"],
    "weight": [10, 0.5, 1, 3, 13]
})

Numeric edge features are taken by any columns that do not have a special meaning (that is, excluding source, target and the optional weight or edge_type_column columns). For example, if the graph has weighted edges with two features a and b associated with each node:

In [None]:
edges = pd.DataFrame({
    "source": ["a", "b", "c", "d", "a"],
    "target": ["b", "c", "d", "a", "c"],
    "weight": [10, 0.5, 1, 3, 13],
    "a": [-1, 2, -3, 4, -5],
    "b": [0.4, 0.1, 0.9, 0, 0.9],
})

### Edge Splitter


```
class stellargraph.data.EdgeSplitter(g, g_master=None)
```

Class for generating training and test data for `link prediction` in graphs.

Parameters:
- `g (StellarGraph)` – The graph to sample edges from.

- `g_master (StellarGraph)` – The graph representing the original dataset and a superset of the graph g. If it is not None, then when positive and negative edges are sampled, care is taken to make sure that a true positive edge is not sampled as a negative edge.

```
train_test_split(p=0.5, method='global', probs=None, keep_connected=False, edge_label=None, edge_attribute_label=None, edge_attribute_threshold=None, attribute_is_datetime=None, seed=None)
```

Returns:

The reduced graph (positive edges removed) and the edge data as 2 numpy arrays, the first array of dimensionality N × 2 (where N is the number of edges) holding the node ids for the edges and the second of dimensionality N × 1 holding the edge labels, 0 for negative and 1 for positive examples. The graph matches the input graph passed to the EdgeSplitter constructor: the returned graph is a `StellarGraph` instance if the input graph was one.

### Layers and models

The layer package contains implementations of popular neural network layers for graph ML as Keras layers.

#### GCN

```
class stellargraph.layer.GCN(layer_sizes, generator, bias=True, dropout=0.0, activations=None, kernel_initializer='glorot_uniform', kernel_regularizer=None, kernel_constraint=None, bias_initializer='zeros', bias_regularizer=None, bias_constraint=None, squeeze_output_batch=True)
```

A stack of Graph Convolutional layers that implement a graph convolution network model as in https://arxiv.org/abs/1609.02907

The model minimally requires specification of the layer sizes as a list of int corresponding to the feature dimensions for each hidden layer, activation functions for each hidden layers, and a generator object.

To use this class as a Keras model, the features and preprocessed adjacency matrix should be supplied using:

- the `FullBatchNodeGenerator` class for node inference

- the `ClusterNodeGenerator` class for scalable/inductive node inference using the Cluster-GCN training procedure (https://arxiv.org/abs/1905.07953)

- the `FullBatchLinkGenerator` class for link inference

To have the appropriate preprocessing the generator object should be instantiated with the `method='gcn'` argument.


Example

Creating a GCN node classification model from an existing StellarGraph object G:
```
generator = FullBatchNodeGenerator(G, method="gcn")
gcn = GCN(
        layer_sizes=[32, 4],
        activations=["elu","softmax"],
        generator=generator,
        dropout=0.5
    )
x_inp, predictions = gcn.in_out_tensors()
```

Examples using GCN:

- node classification

- node classification trained with Cluster-GCN

- semi-supervised node classification

- link prediction

- interpreting GCN predictions: dense, sparse

- ensemble model for node classification

- comparison of link prediction algorithms

Appropriate data generators: `FullBatchNodeGenerator`, `FullBatchLinkGenerator`, `ClusterNodeGenerator`.

Related models:

Other full-batch models: see the documentation of [FullBatchNodeGenerator](https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.mapper.FullBatchNodeGenerator) for a full list

`GCNSupervisedGraphClassification` for graph classification by pooling the output of `GCN`

`GCN_LSTM` for `time-series` and sequence prediction, incorporating the graph structure via `GCN`

Parameters:
- `layer_sizes (list of int)` – Output sizes of GCN layers in the stack.

- `generator (FullBatchNodeGenerator)` – The generator instance.

- `bias (bool)` – If True, a bias vector is learnt for each layer in the GCN model.

- `dropout (float)` – Dropout rate applied to input features of each GCN layer.

- `activations (list of str or func)` – Activations applied to each layer’s output; defaults to ['relu', ..., 'relu'].

- `kernel_initializer (str or func, optional)` – The initialiser to use for the weights of each layer.

- `kernel_regularizer (str or func, optional)` – The regulariser to use for the weights of each layer.

- `kernel_constraint (str or func, optional)` – The constraint to use for the weights of each layer.

- `bias_initializer (str or func, optional)` – The initialiser to use for the bias of each layer.

- `bias_regularizer (str or func, optional)` – The regulariser to use for the bias of each layer.

- `bias_constraint (str or func, optional)` – The constraint to use for the bias of each layer.

- `squeeze_output_batch (bool, optional)` – if True, remove the batch dimension when the batch size is 1. If False, leave the batch dimension.

#### GCN Supervised Graph Classification

```
class stellargraph.layer.GCNSupervisedGraphClassification(layer_sizes, activations, generator, bias=True, dropout=0.0, pooling=None, pool_all_layers=False, kernel_initializer=None, kernel_regularizer=None, kernel_constraint=None, bias_initializer=None, bias_regularizer=None, bias_constraint=None)
```

A stack of GraphConvolution layers together with a Keras GlobalAveragePooling1D layer (by default) that implement a supervised graph classification network using the GCN convolution operator (https://arxiv.org/abs/1609.02907).

The model minimally requires specification of the GCN layer sizes as a list of int corresponding to the feature dimensions for each hidden layer, activation functions for each hidden layers, and a generator object.

Examples

Creating a graph classification model from a list of StellarGraph objects (graphs). We also add two fully connected dense layers using the last one for binary classification with softmax activation:

```
generator = PaddedGraphGenerator(graphs)
model = GCNSupervisedGraphClassification(
                 layer_sizes=[32, 32],
                 activations=["elu","elu"],
                 generator=generator,
                 dropout=0.5
    )
x_inp, x_out = model.in_out_tensors()
predictions = Dense(units=8, activation='relu')(x_out)
predictions = Dense(units=2, activation='softmax')(predictions)
```

Related models:

- `DeepGraphCNN` for a specialisation using SortPooling

- `GCN` for predictions for individual nodes or links

#### Link Prediction
```
class stellargraph.layer.LinkEmbedding(*args, **kwargs)
```

Defines an edge inference function that takes source, destination node embeddings (node features) as input, and returns a numeric vector of output_dim size.

This class takes as input as either:

- A list of two tensors of shape (N, M) being the embeddings for each of the nodes in the link, where N is the number of links, and M is the node embedding size.

- A single tensor of shape (…, N, 2, M) where the axis second from last indexes the nodes in the link and N is the number of links and M the embedding size.

Examples

Consider two tensors containing the source and destination embeddings of size M:

```
x_src = tf.constant(x_src, shape=(1, M), dtype="float32")
x_dst = tf.constant(x_dst, shape=(1, M), dtype="float32")

li = LinkEmbedding(method="ip", activation="sigmoid")([x_src, x_dst])
```


Parameters:
- `axis (int)` – If a single tensor is supplied this is the axis that indexes the node embeddings so that the indices 0 and 1 give the node embeddings to be combined. This is ignored if two tensors are supplied as a list.

- `activation (str)` – activation function applied to the output, one of “softmax”, “sigmoid”, etc., or any activation function supported by Keras, see https://keras.io/activations/ for more information.

- `method (str)` –

Name of the method of combining `(src,dst)` node features or embeddings into edge embeddings. One of:

- `concat` – concatenation,

- `ip` or `dot` – inner product, $ip(u,v)=sum_{i=1..d} * u_i∗v_i$,

- `mul` or `hadamard` – element-wise multiplication, $h(u,v)_i=u_i∗v_i$,

- `l1` – L1 operator, $l1(u,v)_i=|u_i−v_i|$,

- `l2` – L2 operator, $l2(u,v)_i=(u_i−v_i)^2$,

- `avg` – average, $avg(u,v)= \frac{(u+v)}{2}$.

For all methods except `ip` or `dot` a dense layer is applied on top of the combined edge embedding to transform to a vector of size `output_dim`.