# Stellargraph example: GraphSAGE on the CORA citation network

Import NetworkX and stellar:

In [1]:
import networkx as nx
import pandas as pd
import os

import stellargraph as sg
from stellargraph.mapper import GraphSAGENodeMapper
from stellargraph.layer import GraphSage

from keras import layers, optimizers, losses, metrics, Model
from sklearn import preprocessing, feature_extraction, model_selection

Using TensorFlow backend.


### Loading the CORA network

**Downloading the CORA dataset:**
    
The dataset used in this demo can be downloaded from https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz

The following is the description of the dataset:
> The Cora dataset consists of 2708 scientific publications classified into one of seven classes.
> The citation network consists of 5429 links. Each publication in the dataset is described by a
> 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary.
> The dictionary consists of 1433 unique words. The README file in the dataset provides more details.

Download and unzip the cora.tgz file to a location on your computer and set the `data_dir` variable to
point to the location of the dataset (the directory containing "cora.cites" and "cora.content").

In [2]:
data_dir = "~/data/cora"

Load the graph from edgelist

In [3]:
edgelist = pd.read_table(os.path.join(data_dir, "cora.cites"), header=None, names=["source", "target"])

In [4]:
G = nx.from_pandas_edgelist(edgelist)

Load the features and subject for the nodes

In [5]:
feature_names = ["w_{}".format(ii) for ii in range(1433)]
column_names =  feature_names + ["subject"]
node_data = pd.read_table(os.path.join(data_dir, "cora.content"), header=None, names=column_names)

We aim to train a graph-ML model that will predict the "subject" attribute on the nodes. These subjects are one of 7 categories:

In [6]:
set(node_data["subject"])

{'Case_Based',
 'Genetic_Algorithms',
 'Neural_Networks',
 'Probabilistic_Methods',
 'Reinforcement_Learning',
 'Rule_Learning',
 'Theory'}

### Splitting the data

For machine learning we want to take a subset of the nodes for training, and use the rest for testing. We'll use scikit-learn again to do this

In [7]:
train_data, test_data = model_selection.train_test_split(node_data, train_size=140, test_size=None, stratify=node_data['subject'])

Note using stratified sampling gives the following counts:

In [8]:
from collections import Counter
Counter(train_data['subject'])

Counter({'Genetic_Algorithms': 22,
         'Probabilistic_Methods': 22,
         'Neural_Networks': 42,
         'Case_Based': 16,
         'Theory': 18,
         'Reinforcement_Learning': 11,
         'Rule_Learning': 9})

The training set has class imbalance that might need to be compensated, e.g., via using a weighted cross-entropy loss in model training, with class weights inversely proportional to class support. However, we will ignore the class imbalance in this example, for simplicity.

### Converting to numeric arrays

For our categorical target, we will use one-hot vectors that will be fed into a soft-max Keras layer during training. To do this conversion ...

In [9]:
target_encoding = feature_extraction.DictVectorizer(sparse=False)

train_targets = target_encoding.fit_transform(train_data[["subject"]].to_dict('records'))
test_targets = target_encoding.transform(test_data[["subject"]].to_dict('records'))

In [10]:
target_encoding = feature_extraction.DictVectorizer(sparse=False)
node_targets = target_encoding.fit_transform(node_data[["subject"]].to_dict('records'))

We now do the same for the node attributes we want to use to predict the subject. These are the feature vectors that the Keras model will use as input. The CORA dataset contains attributes 'w_x' that correspond to words found in that publication. If a word occurs more than once in a publication the relevant attribute will be set to one, otherwise it will be zero.

In [11]:
node_features = node_data[feature_names].values

We now put these numeric features into the graph as node attributes

In [12]:
for nid, f in zip(node_data.index, node_features):
    G.node[nid]["feature"] = f
    G.node[nid]["label"] = "paper"

## Creating the GraphSAGE model in Keras

Now create a StellarGraph object from the NetworkX graph and the node features and targets. It is StellarGraph objects that we use in this library to perform machine learning tasks on.

In [13]:
G = sg.StellarGraph(G)

Prepare sg for ML:

In [14]:
G.fit_attribute_spec()



In [15]:
print(G.info())

StellarGraph: Undirected multigraph
 Nodes: 2708, Edges: 5278

 Node types:
  paper: [2708]
        Attributes: {'feature'}
    Edge types: paper-->paper

 Edge types:
    paper-->paper: [5278]



To feed data from the graph to the Keras model we need a mapper. The mappers are specialized to the model and the learning task so we choose the `GraphSAGENodeMapper` as we are predicting node attributes with a GraphSAGE model.

We need two other parameters, the `batch_size` to use for training and the number of nodes to sample at each level of the model. Here we choose a two-level model with 10 nodes sampled in the first layer, and 5 in the second.

In [16]:
batch_size = 10; num_samples = [10, 5]

For training we map only the training nodes returned from our splitter and the target values.

In [17]:
train_nodes = train_data.index
test_nodes = test_data.index

In [18]:
train_mapper = GraphSAGENodeMapper(G, train_nodes, batch_size, num_samples, targets=train_targets)

Now we can specify our machine learning model, we need a few more parameters for this:

 * the `output_dims` is the hidden feature size of each layer in the model
 * The `bias` and `dropout` are internal parameters of the model. 

In [19]:
graphsage_model = GraphSAGE(
    layer_sizes=[20, 20],
    mapper=train_mapper,
    bias=True,
    dropout=0.5,
)

Now we create a model to predict the 7 categories using Keras softmax layers. Note that we need to use the `G.get_target_size` method to find the number of categories in the data.

In [20]:
x_inp, x_out = graphsage_model.default_model(flatten_output=True)
prediction = layers.Dense(units=train_targets.shape[1], activation="softmax")(x_out)

### Training the model

Now let's create the actual Keras model with the graph inputs `x_inp` provided by the `graph_model` and outputs being the predictions from the softmax layer

In [21]:
model = Model(inputs=x_inp, outputs=prediction)
model.compile(
    optimizer=optimizers.Adam(lr=0.005),
    loss=losses.categorical_crossentropy,
    metrics=[metrics.categorical_accuracy],
)

In [22]:
hitsory = model.fit_generator(
    train_mapper,
    epochs=10,
    verbose=1,
    shuffle=True,
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Now we have trained the model we can evaluate on the test set. We will need to create another mapper for this using the test node IDs:

In [23]:
test_mapper = node_mappers.GraphSAGENodeMapper(
    G, test_nodes, batch_size, num_samples, targets=test_targets
)
test_metrics = model.evaluate_generator(test_mapper)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))


Test Set Metrics:
	loss: 0.8991
	categorical_accuracy: 0.7547


Now let's get the predictions themselves for all nodes using another mapper:

### Making predictions with the model

In [24]:
all_nodes = node_data.index
all_mapper = node_mappers.GraphSAGENodeMapper(G, all_nodes, batch_size, num_samples)
all_predictions = model.predict_generator(all_mapper)

These predictions will be the output of the softmax layer, so to get final categories we'll use the `inverse_transform` method of our target attribute specifcation to turn these values back to the original categories

In [25]:
node_predictions = target_encoding.inverse_transform(all_predictions)

Let's have a look at a few:

In [26]:
results = pd.DataFrame(node_predictions, index=all_nodes).idxmax(axis=1)
pd.DataFrame({"Predicted": results, "True": node_data['subject']}).head(10)

Unnamed: 0,Predicted,True
31336,subject=Neural_Networks,Neural_Networks
1061127,subject=Rule_Learning,Rule_Learning
1106406,subject=Rule_Learning,Reinforcement_Learning
13195,subject=Probabilistic_Methods,Reinforcement_Learning
37879,subject=Probabilistic_Methods,Probabilistic_Methods
1126012,subject=Probabilistic_Methods,Probabilistic_Methods
1107140,subject=Probabilistic_Methods,Theory
1102850,subject=Neural_Networks,Neural_Networks
31349,subject=Neural_Networks,Neural_Networks
1106418,subject=Theory,Theory
