# Classification of reddit user posts using graph machine learning with Stellargraph

We apply the graph machine algorithm APPNP [1] to the task of classifying reddit user posts into 41 different categories using the dataset published in [2] which can be downloaded [here](http://snap.stanford.edu/graphsage/reddit.zip).  

The following is a description of the dataset [2]:

>Reddit is a large online discussion forum where users post and comment on content in different topical
>communities. We constructed a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. We sampled
>50 large communities and built a post-to-post graph, connecting posts if the same user comments
>on both. In total this dataset contains 232,965 posts with an average degree of 492. We use the first
>20 days for training and the remaining days for testing (with 30% used for validation). For features,
>we use off-the-shelf 300-dimensional GloVe CommonCrawl word vectors [3]; for each post, we
>concatenated (i) the average embedding of the post title, (ii) the average embedding of all the post’s
>comments (iii) the post’s score, and (iv) the number of comments made on the post.


We demonstrate the advantage of using graph features and the scalability of the APPNP algorithm for node classification on the reddit dataset.  We first train a MLP on the node features and then propagate this model using APPNP.  Training is only done on the node features, this approach allows model training to be completed in under a 1 minute on an 8th gen quad-core i7.


**References**

1. [Predict then propagate: Graph neural networks meet personalized pagerank](https://arxiv.org/pdf/1810.05997.pdf). J. Klicpera,  A. Bojchevski, & S. Günnemann arxiv:1810.05997, 2018.


2. [Inductive Representation Learning on Large Graphs](https://arxiv.org/pdf/1706.02216.pdf). W.L. Hamilton, R. Ying, and J. Leskovec arXiv:1706.02216 [cs.SI], 2017.


3. [Glove: Global vectors for word representation](https://nlp.stanford.edu/pubs/glove.pdf). J. Pennington, R. Socher, and C. D. Manning. In EMNLP, 2014.

<table><tr><td>Run the master version of this notebook:</td><td><a href="https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/node-classification/ppnp/appnp-reddit.ipynb" alt="Open In Binder" target="_parent"><img src="https://mybinder.org/badge_logo.svg"/></a></td><td><a href="https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/node-classification/ppnp/appnp-reddit.ipynb" alt="Open In Colab" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg"/></a></td></tr></table>

In [1]:
# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
  %pip install -q stellargraph[demos]

In [2]:
import tensorflow as tf

tf.compat.v1.disable_eager_execution()

import networkx as nx
import pandas as pd
import numpy as np
import os

import stellargraph as sg
from stellargraph.mapper import FullBatchNodeGenerator
from stellargraph.layer import APPNP
from stellargraph.layer.appnp import APPNPPropagationLayer
from stellargraph.core.utils import GCN_Aadj_feats_op

from tensorflow.keras import layers, optimizers, losses, metrics, Model, models
from sklearn import preprocessing, feature_extraction
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn import metrics
from scipy.sparse import coo_matrix

import matplotlib.pyplot as plt
import json

%matplotlib inline

## Loading the Reddit Dataset

First, we load the reddit dataset which is stored as a series of json files.  We first load the graph data and then the node features and labels and ensure that indexing is consistent across the graph, labels, and node features.

In [3]:
data_dir = os.path.expanduser("~/data/reddit")

In [4]:
def load_reddit(data_dir):

    with open(os.path.join(data_dir, "reddit-G.json")) as gfile:
        graph_data = json.load(gfile)

    list_node_ids = list(d["id"] for d in graph_data["nodes"])

    edge_generator = ((link["source"], link["target"]) for link in graph_data["links"])
    edge_df = pd.DataFrame(edge_generator, columns=["target", "source"])

    with open(os.path.join(data_dir, "reddit-class_map.json")) as tfile:
        labels = json.load(tfile)

    feats = np.load(data_dir + "/reddit-feats.npy")

    feats[:, 0] = np.log(feats[:, 0] + 1.0)
    feats[:, 1] = np.log(feats[:, 1] - min(np.min(feats[:, 1]), -1))

    feat_id_map = json.load(open(data_dir + "/reddit-id_map.json"))

    # sort node features to match the order of feat_id_map
    sorted_idxs = np.array([feat_id_map[key] for key in list_node_ids])
    feats = feats[sorted_idxs, :]
    node_data = pd.DataFrame(feats)

    # sort node labnels to match the order of feat_id_map
    labels = np.array([labels[key] for key in list_node_ids])

    return edge_df, node_data, labels, list_node_ids, graph_data

In [5]:
edge_df, node_data, labels, list_node_ids, graph_data = load_reddit(data_dir)
print("Number of nodes:", len(node_data))
print("Number of edges:", len(edge_df))

Number of nodes: 231443
Number of edges: 11606919


In [6]:
target_encoding = preprocessing.OneHotEncoder(sparse=False, categories="auto")
targets = target_encoding.fit_transform(labels.reshape(-1, 1))
targets = pd.DataFrame(targets)

We then split the data into train/val/test based on the labels stored in the dataset. This is the same train/val/test split used in the [graphsage paper](https://arxiv.org/pdf/1706.02216.pdf). Then we fit a standard scaler on only the training data and use it to standardize all of the data.

In [7]:
def map_node_to_split(node):
    if node["test"]:
        return "test"
    elif node["val"]:
        return "val"
    else:
        return "train"


train_test_val_dict = dict(
    zip(list_node_ids, map(map_node_to_split, graph_data["nodes"]))
)

train_mask = [(train_test_val_dict[key] == "train") for key in list_node_ids]
val_mask = [(train_test_val_dict[key] == "val") for key in list_node_ids]
test_mask = [(train_test_val_dict[key] == "test") for key in list_node_ids]

scaler = preprocessing.StandardScaler()
scaler.fit(node_data[train_mask].values)
node_data.iloc[:, :] = scaler.transform(node_data.values)

train_data, train_targets = node_data[train_mask], targets[train_mask]
val_data, val_targets = node_data[val_mask], targets[val_mask]
test_data, test_targets = node_data[test_mask], targets[test_mask]

print("{} training nodes.".format(len(train_data)))
print("{} validation nodes.".format(len(val_data)))
print("{} testing nodes.".format(len(test_data)))

152410 training nodes.
23699 validation nodes.
55334 testing nodes.


## Creating a stellargraph object and generator 

We know create a `StellarGraph` object and use this to create the data generators.

In [8]:
G = sg.StellarGraph(
    nodes=node_data, edges=edge_df, source_column="source", target_column="target"
)
generator = FullBatchNodeGenerator(G, method="gcn", sparse=True)

test_gen = generator.flow(test_data.index, test_targets)

Using GCN (local pooling) filters...


## Training

We now create a MLP and train on the node features.

In [9]:
in_layer = layers.Input(shape=(train_data.shape[-1]))

layer = layers.Dense(512, activation="relu", kernel_regularizer="l2")(in_layer)
layer = layers.Dropout(0.5)(layer)
layer = layers.Dense(512, activation="relu", kernel_regularizer="l2")(layer)
layer = layers.Dropout(0.5)(layer)

# note the dimension of the output should equal the number of classes to predict!
layer = layers.Dense(train_targets.shape[-1], activation="softmax")(layer)

fully_connected_model = Model(inputs=in_layer, outputs=layer)

fully_connected_model.compile(
    optimizer=optimizers.Adam(lr=0.0001),
    loss=losses.categorical_crossentropy,
    metrics=["acc"],
)



In [10]:
history = fully_connected_model.fit(
    train_data,
    train_targets,
    epochs=100,
    validation_data=(val_data, val_targets),
    batch_size=300,
    verbose=0,
)

The MLP attains an accuracy of ~70% on the test set.

In [11]:
test_metrics = fully_connected_model.evaluate(test_data, test_targets)
print("\nTest Set Metrics:")
for name, val in zip(fully_connected_model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))


Test Set Metrics:
	loss: 1.2515
	acc: 0.7222


## APPNP Propagation

We now create an APPNP model and propagate the MLP. No further training is happening in this step.  We then test the propagated model on the test data.

In [12]:
appnp = APPNP(
    layer_sizes=[train_targets.shape[-1]],
    activations=["relu"],
    bias=True,
    generator=generator,
    teleport_probability=0.2,
    dropout=0.5,
    kernel_regularizer="l2",
)

x_inp, x_out = appnp.propagate_model(fully_connected_model)
predictions = layers.Softmax()(x_out)

propagated_model = Model(inputs=x_inp, outputs=predictions)
propagated_model.compile(
    loss="categorical_crossentropy", metrics=["acc"], optimizer="Adam"
)

In [13]:
test_metrics = propagated_model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(propagated_model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))


Test Set Metrics:
	loss: 3.6114
	acc: 0.8842


Propagating the MLP with APPNP increases the test set accuracy by ~15% without any further training. As we are performing single-label multiclass classification the accuracy is equivalent to the micro F1 metric. This micro F1 is comparable to that attained in the GraphSAGE paper [2]. GraphSAGE with LSTM aggregation attains a best supervised F1 of 0.95 in [2], however APPNP only required ~3 minutes of training on an 8th gen i7 compared to the hours required for GraphSAGE while still attaining a similar F1.

<table><tr><td>Run the master version of this notebook:</td><td><a href="https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/node-classification/ppnp/appnp-reddit.ipynb" alt="Open In Binder" target="_parent"><img src="https://mybinder.org/badge_logo.svg"/></a></td><td><a href="https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/node-classification/ppnp/appnp-reddit.ipynb" alt="Open In Colab" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg"/></a></td></tr></table>