This repository supports the following paper:
M. Zhang, P. Li, Y. Xia, K. Wang, and L. Jin, Revisiting Graph Neural Networks for Link Prediction. [PDF]
SEAL is a GNN-based link prediction method. It first extracts a k-hop enclosing subgraph for each target link, then applies a Double Radius Node Labeling (DRNL) scheme to give each node an integer label as its additional feature. Finally, these labeled enclosing subgraphs are fed to a graph neural network to predict link existences.
This repository reimplements SEAL with the PyTorch-Geometric library, and tests SEAL on the Open Graph Benchmark (OGB) datasets. SEAL ranked 1st place on 3 out of 4 link prediction datasets in the OGB Leaderboard at the time of submission. It additionally supports Planetoid like datasets, such as Cora, CiteSeer and PubMed, where random 0.85/0.05/0.1 split and AUC metric are used. Using custom datasets is also easy by replacing the Planetoid dataset with your own.
ogbl-ppa | ogbl-collab | ogbl-ddi | ogbl-citation2 | |
---|---|---|---|---|
Val results | 51.25%±2.52%* | 63.89%±0.49%* | 28.49%±2.69% | 87.57%±0.31%* |
Test results | 48.80%±3.16%* | 63.64%±0.71%* | 30.56%±3.86% | 87.67%±0.32%* |
* State-of-the-art results; evaluation metrics are Hits@100, Hits@50, Hits@20 and MRR, respectively. For ogbl-collab, we have switched to the new rule, where after all hyperparameters are determined on the validation set, we include validation edges in the training graph and retrain to report the test performance. For ogbl-citation2, it is an updated version of the deprecated ogbl-citation.
The original implementation of SEAL is here.
The original paper of SEAL is:
M. Zhang and Y. Chen, Link Prediction Based on Graph Neural Networks, Advances in Neural Information Processing Systems (NIPS-18). [PDF]
Latest tested combination: Python 3.8.5 + PyTorch 1.6.0 + PyTorch_Geometric 1.6.1 + OGB 1.2.4.
Install PyTorch
Install PyTorch_Geometric
Install OGB
Other required python libraries include: numpy, scipy, tqdm etc.
python seal_link_pred.py --dataset ogbl-ppa --num_hops 1 --use_feature --use_edge_weight --eval_steps 5 --epochs 20 --dynamic_train --dynamic_val --dynamic_test --train_percent 5
python seal_link_pred.py --dataset ogbl-collab --num_hops 1 --use_feature --train_percent 10 --use_valedges_as_input
According to OGB, this dataset allows including validation links in training when all the hyperparameters are finalized using the validation set. Thus, you should first tune your hyperparameters without "--use_valedges_as_input", and then append "--use_valedges_as_input" to your final command when all the hyperparameters are determined. See issue.
python seal_link_pred.py --dataset ogbl-ddi --num_hops 1 --ratio_per_hop 0.2 --use_edge_weight --eval_steps 1 --epochs 10 --dynamic_val --dynamic_test --train_percent 1
For the above three datasets, append "--runs 10" to do experiments for 10 times and get the average results.
python seal_link_pred.py --dataset ogbl-citation2 --num_hops 1 --use_feature --use_edge_weight --eval_steps 1 --epochs 10 --dynamic_train --dynamic_val --dynamic_test --train_percent 2 --val_percent 1 --test_percent 1
Because this dataset uses mean reciprocal rank (MRR) as the evaluation metric where each positive testing link is ranked against 1000 random negative ones, it requires extracting 1001 enclosing subgraphs for every testing link. This is very time consuming. Thus, the above command uses "--val_percent 1" and "--test_percent 1" to only evaluate on 1% of validation and test data to get a fast unbiased estimate of the true MRR. To get the true MRR, please change them to "--val_percent 100" and "test_percent 100". Also, because this dataset is expensive to evaluate, we first train 10 models with 1% validation data in parallel, record the best epoch's model from each run, and then evaluate all 10 best models together using the "--test_multiple_models --val_percent 100 --test_percent 100" option. This option enables evaluating multiple pretrained models together with a single subgraph extraction process for each link, thus avoiding extracting subgraphs for testing links repeatedly for 10 times. You need to specify your pretrained model paths in "seal_link_pred.py".
python seal_link_pred.py --dataset Cora --num_hops 3 --use_feature --hidden_channels 256 --runs 10
python seal_link_pred.py --dataset CiteSeer --num_hops 3 --hidden_channels 256 --runs 10
python seal_link_pred.py --dataset PubMed --num_hops 3 --use_feature --dynamic_train --runs 10
For all datasets, if you specify "--dynamic_train", the enclosing subgraphs of the training links will be extracted on the fly instead of preprocessing and saving to disk. Similarly for "--dynamic_val" and "--dynamic_test". You can increase "--num_workers" to accelerate the dynamic subgraph extraction process.
By default SEAL uses the DRNL labeling trick. You can alternatively use other labeling tricks such as DE (distance encoding), DE+, ZO (zero-one labeling), etc., by appending "--node_label de", "--node_label de+", and "--node_label zo".
This repository also implements two link prediction heuristics: Common Neighbor (CN) and Adamic Adar (AA), which turn out to have surprisingly better performance than many GNN methods on ogbl-ppa and ogbl-collab. An example usage of Common Neighbor is:
python seal_link_pred.py --use_heuristic CN --dataset ogbl-ppa
SEAL_OGB is released under an MIT license. Find out more about it here.
If you find the code useful, please cite our papers.
@misc{zhang2020revisiting,
title={Revisiting Graph Neural Networks for Link Prediction},
author={Muhan Zhang and Pan Li and Yinglong Xia and Kai Wang and Long Jin},
year={2020},
eprint={2010.16103},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@inproceedings{zhang2018link,
title={Link prediction based on graph neural networks},
author={Zhang, Muhan and Chen, Yixin},
booktitle={Advances in Neural Information Processing Systems},
pages={5165--5175},
year={2018}
}
Muhan Zhang, Facebook AI
10/13/2020