End-to-end graph classification on the SNAP Reddit Threads dataset with PyTorch Geometric. Three encoders are compared under an identical protocol: GIN, PNA, and GAT. Model selection uses validation Matthews correlation coefficient. Structural node features are engineered because the dataset provides no raw node attributes.
The source is SNAP Reddit Threads. The graphs were collected in May 2018. Nodes are Reddit users who participate in a thread. Undirected edges are reply relations between users. The task is binary graph classification: predict whether a thread is discussion-based.
| Property | Value |
|---|---|
| Number of graphs | 203,088 |
| Directed | No |
| Node features | No |
| Edge features | No |
| Graph labels | Binary |
| Temporal | No |
| Nodes per graph | 11–97 |
| Density | 0.021–0.382 |
| Diameter | 2–27 |
Raw files:
reddit_target.csvwith columnsid,targetreddit_edges.jsonmapping graph id to edge lists
threads-gnn/
├── configs/default.yaml
├── .env.example
├── scripts/install.sh
├── scripts/push_github.sh
├── data/
├── features/
├── models/
├── training/
├── scripts/
├── schemas.py
└── main.py
Open a terminal in Colab and run the commands below.
git clone https://github.com/pymlex/threads-gnn.git
cd threads-gnnCreates .env from .env.example, reinstalls PyG wheels matched to the Colab PyTorch build, and verifies torch-scatter. GitHub authentication runs only when gh is not already logged in.
bash scripts/install.shEdit .env and set HF_TOKEN. Optional fields: GITHUB_NAME, GITHUB_EMAIL.
Downloads SNAP data and builds sharded processed graphs with structural features.
python scripts/preprocess.py --config configs/default.yaml| Argument | Default | Description |
|---|---|---|
--config |
configs/default.yaml |
Experiment configuration path |
Trains GIN, PNA, and GAT by default with identical splits and hyperparameters.
python scripts/train.py --config configs/default.yaml| Argument | Default | Description |
|---|---|---|
--config |
configs/default.yaml |
Experiment configuration path |
--architecture |
all |
all, gin, pna, or gat |
--pooling |
from config | mean, sum, or attention |
Ranks models by validation MCC and writes runs/selected_model.json.
python scripts/compare.py| Argument | Default | Description |
|---|---|---|
--runs-dir |
runs |
Directory with run outputs |
--seed |
42 |
Random seed in run folder names |
Evaluates all best checkpoints on the test split by default.
python scripts/eval.py --config configs/default.yaml| Argument | Default | Description |
|---|---|---|
--config |
configs/default.yaml |
Experiment configuration path |
--architecture |
all |
all, gin, pna, or gat |
--checkpoint |
auto | Path for a single-architecture run |
--split |
test |
train, val, or test |
--seed |
42 |
Seed used in checkpoint filenames |
python scripts/plot_curves.py| Argument | Default | Description |
|---|---|---|
--runs-dir |
runs |
Directory with epoch metrics |
--seed |
42 |
Random seed in run folder names |
Loads best checkpoints, runs inference on the test split, and writes logit histograms plus ROC curves.
python scripts/plot_diagnostics.py --config configs/default.yaml| Argument | Default | Description |
|---|---|---|
--config |
configs/default.yaml |
Experiment configuration path |
--runs-dir |
runs |
Directory for figure output |
--checkpoints-dir |
checkpoints |
Checkpoint directory |
--split |
test |
train, val, or test |
--seed |
42 |
Random seed in checkpoint filenames |
Aggregates metrics, writes comparison tables and training curves, then commits and pushes runs/ artefacts to the repository.
bash scripts/push_github.sh| Argument | Default | Description |
|---|---|---|
| seed | 42 |
Random seed in run folder names |
Tracked files: architecture_comparison.csv, selected_model.json, training_curves.png, per-architecture epoch_metrics.csv, final_metrics.json, confusion matrices, classification reports, and test predictions. Checkpoints remain local and are uploaded to Hugging Face separately.
python scripts/run_all.py --config configs/default.yaml| Argument | Default | Description |
|---|---|---|
--config |
configs/default.yaml |
Experiment configuration path |
Reads HF_TOKEN from .env. Uploads the selected GIN checkpoint and model_card.md.
python scripts/push_hf.py --repo-id pymlex/threads-gnn| Argument | Default | Description |
|---|---|---|
--repo-id |
pymlex/threads-gnn |
Hugging Face model repository |
--runs-dir |
runs |
Directory with experiment outputs |
--checkpoints-dir |
checkpoints |
Checkpoint directory |
--seed |
42 |
Random seed in checkpoint filenames |
Because the dataset has no node features, each graph receives engineered structural descriptors controlled by FeatureConfig in schemas.py. All features are enabled by default. The exact configuration is saved to data/processed/feature_config.json during preprocessing.
For a graph
Degree
Log degree
Normalised degree
Degree bucket embedding
One-hot encoding of
Clustering coefficient
where
K-core number
The core number
PageRank
with
Laplacian positional encodings
Let
Random-walk structural encodings
Let
With the default configuration, the input dimension is
Three graph encoders are compared under an identical protocol:
Shared input projection
Shared virtual node update after every encoder layer. For graph
Shared classifier head on the pooled graph embedding
Graph Isomorphism Network treats neighbourhood aggregation as an injective multiset function. For layer
With
flowchart TB
X["Structural features x"] --> Proj["Input projection"]
Proj --> Enc["GINConv, MLP, Residual, LayerNorm, Virtual node pool-broadcast, x4 layers"]
Enc --> Pool["Attention pooling"]
Pool --> Out["Classifier MLP, Binary logits"]
Enc --> Sum["Neighbour sum"]
Sum --> Enc
Principal Neighbourhood Aggregation keeps multiple statistics over each neighbourhood and rescales them by node degree. Let
Degree scalers
flowchart TB
X["Structural features x"] --> Proj["Input projection"]
Proj --> PNA["PNAConv"]
subgraph Agg["Neighbourhood statistics"]
M["mean"]
MX["max"]
MN["min"]
SD["std"]
end
Join["Combine"]
subgraph Sc["Degree scalers"]
Id["identity"]
Amp["amplification"]
Att["attenuation"]
end
PNA --> M
PNA --> MX
PNA --> MN
PNA --> SD
M --> Join
MX --> Join
MN --> Join
SD --> Join
Join --> Id
Join --> Amp
Join --> Att
Id --> Enc["Virtual node, PNA block x3"]
Amp --> Enc
Att --> Enc
Enc --> Pool["Attention pooling"]
Pool --> Out["Classifier MLP, Binary logits"]
Graph Attention Network assigns a data-dependent weight to every edge. For head
Layers
flowchart TB
X["Structural features x"] --> Proj["Input projection"]
Proj --> GAT["GATConv"]
subgraph Heads["Attention heads"]
H1["Head 1"]
H2["Head 2"]
H3["Head 3"]
H4["Head 4"]
end
GAT --> H1
GAT --> H2
GAT --> H3
GAT --> H4
H1 --> Merge["Concat or mean"]
H2 --> Merge
H3 --> Merge
H4 --> Merge
Merge --> Enc["Virtual node, GAT block x3"]
Enc --> Pool["Attention pooling"]
Pool --> Out["Classifier MLP, Binary logits"]
Three pooling operators are implemented.
Global mean pooling
Global sum pooling
Attention pooling
All three architectures use the same pooling method from configs/default.yaml.
- stratified train, validation, and test split with ratios
$0.8 / 0.1 / 0.1$ - random seed
$42$ - AdamW optimiser with learning rate
$3 \times 10^{-3}$ and weight decay$10^{-4}$ - cosine learning-rate schedule over
$40$ epochs - full-precision training on GPU
- gradient clipping with max norm
$1.0$ - early stopping on validation MCC with patience
$8$ - batch size
$4096$
The test split is never used for model selection. Architectures are ranked by best validation MCC. Test metrics for the selected architecture are reported once after training.
Per-epoch metrics are logged with tqdm and saved to runs/<architecture>_seed42/epoch_metrics.csv. Final metrics are saved to runs/<architecture>_seed42/final_metrics.json.
Matthews correlation coefficient:
Additional metrics: accuracy, balanced accuracy, precision, recall, F1, ROC-AUC, PR-AUC, confusion matrix, classification report.
Experiments use seed
| Architecture | Best val MCC | Val F1 | Val ROC-AUC | Test MCC | Test F1 | Test ROC-AUC |
|---|---|---|---|---|---|---|
| GIN | 0.5609 | 0.7998 | 0.8414 | 0.5642 | 0.8017 | 0.8417 |
| PNA | 0.5609 | 0.8001 | 0.8414 | 0.5635 | 0.8016 | 0.8419 |
| GAT | 0.5592 | 0.7971 | 0.8416 | 0.5655 | 0.8002 | 0.8418 |
GIN is selected with validation MCC
Validation MCC rises sharply in the first five epochs and plateaus near
The three ROC curves overlap almost completely. Test ROC-AUC values are 0.8417 for GIN, 0.8419 for PNA, and 0.8418 for GAT. All models reach true positive rate 0.80 near false positive rate 0.25, then flatten toward the upper-right corner. Ranking quality is therefore encoder-invariant on this split: differences between architectures appear only after fixing a decision threshold, not in the order of predicted scores.
Histograms plot the class-1 logit on the test split, split by ground-truth label.
GIN forms two well-separated modes. True class
PNA compresses the positive mode into a narrow spike near logit
GAT shows the widest overlap between classes. True class
Per-architecture figures: runs/gin_seed42/test_roc_curve.png, runs/pna_seed42/test_roc_curve.png, runs/gat_seed42/test_roc_curve.png, and the matching test_logit_histogram.png files.
All models favour recall on the positive class. Recall on class 0 stays near
GIN, test split. Rows are true labels, columns are predicted labels. Left panel shows counts, right panel shows row-normalised rates.
PNA, test split. Same layout as GIN. PNA recovers slightly more true class 1 graphs than GIN at the cost of extra false positives on class 0.
GAT, test split. Same layout as GIN. GAT reduces false positives on class 0 relative to GIN and PNA.
Test metrics for the checkpoint with best validation MCC:
| Metric | Value |
|---|---|
| MCC | 0.5642 |
| Accuracy | 0.7783 |
| Balanced accuracy | 0.7758 |
| Precision | 0.7400 |
| Recall | 0.8745 |
| F1 | 0.8017 |
| ROC-AUC | 0.8417 |
| PR-AUC | 0.8087 |
Test confusion counts:
| Pred 0 | Pred 1 | |
|---|---|---|
| True 0 | 6706 | 3197 |
| True 1 | 1306 | 9100 |
On this dataset the choice of graph encoder has a small effect once structural features, virtual node, and attention pooling are fixed. GIN wins model selection by validation MCC, yet none of the three encoders separates on ROC-AUC. Logit histograms explain the residual gap: GIN and PNA sharpen the score distribution, while GAT keeps more mass near the boundary and shifts the precision-recall trade-off toward class 0. PNA ranks first on ROC-AUC by a margin of
Best checkpoint: pymlex/threads-gnn
from huggingface_hub import hf_hub_download
import torch
checkpoint_path = hf_hub_download(repo_id="pymlex/threads-gnn", filename="model.pt")
checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)| Parameter | Value |
|---|---|
| hidden_dim | 128 |
| num_layers | 4 |
| dropout | 0.2 |
| num_heads | 4 |
| batch_size | 4096 |
| learning_rate | |
| num_epochs | 40 |
| weight_decay | |
| early_stopping_patience | 8 |
| virtual_node | enabled |
| pooling | attention |
@misc{threads_gnn,
author = {Alex Zyukov},
title = {Graph Classification on SNAP Reddit Threads},
year = {2026},
publisher = {GitHub},
howpublished = {\url{https://github.com/pymlex/threads-gnn}},
}The project is under GPL-3.0 license.
@inproceedings{karateclub,
title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}},
author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
year = {2020},
pages = {3125--3132},
booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
organization = {ACM},
}
@inproceedings{xu2019gin,
title = {How Powerful are Graph Neural Networks?},
author = {Keyulu Xu and Weihua Hu and Jure Leskovec and Stefanie Jegelka},
booktitle = {International Conference on Learning Representations},
year = {2019},
}
@inproceedings{corso2020pna,
title = {Principal Neighbourhood Aggregation for Graph Nets},
author = {Gabriele Corso and Luca Cavalleri and Dominique Beaini and Pietro Li and Petar Velickovic},
booktitle = {Advances in Neural Information Processing Systems},
year = {2020},
}
@inproceedings{velickovic2018gat,
title = {Graph Attention Networks},
author = {Petar Velickovic and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Li and Yoshua Bengio},
booktitle = {International Conference on Learning Representations},
year = {2018},
}




