Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent evaluation on ogbl-collab datasets #457

Closed
Barcavin opened this issue Sep 1, 2023 · 3 comments
Closed

Inconsistent evaluation on ogbl-collab datasets #457

Barcavin opened this issue Sep 1, 2023 · 3 comments

Comments

@Barcavin
Copy link
Contributor

Barcavin commented Sep 1, 2023

Hi,

According to the rule of evaluation (https://ogb.stanford.edu/docs/leader_rules/#:~:text=The%20only%20exception,the%20validation%20labels.), the Collab for link prediction allows using the validation set during the model training. However, the example code in (https://github.com/snap-stanford/ogb/blob/master/examples/linkproppred/collab/gnn.py) seems to only use the validation set for inference rather than training. After using these validation sets as the training edges, the performance of vanilla SAGE can achieve 68+ in Hits@50.

The implementation can be found here (https://github.com/Barcavin/ogb/tree/val_as_input_collab/examples/linkproppred/collab).
In fact, GCN can reach 69.45 ± 0.52 and SAGE can reach 68.20 ± 0.35. The differences between this implementation and the original example code are:

  1. Use val as training signals and message-passing.
  2. Only 1-layer of GNN.
  3. Only use inner product rather than Hadamard product with a MLP.
  4. Run for 2000 epochs.

I believe the most critical trick to make the model perform well is the learnable node embedding rather than the node attributes. To reproduce, please run python gnn.py --use_valedges_as_input [--use_sage]

Therefore, I am confused about what the correct way is to evaluate model performance on Collab.

Besides, I found that some of the submissions on the leaderboard of Collab utilize the validation set as training edges (both supervision signal and message-passing edges) while others use it only for inference (message-passing edges). This may cause an evaluation discrepancy for these models. For example, the current top-1 (GIDN@YITU) uses validation sets in the training, while ELPH uses the validation set only for inference.

Thus, I believe a common protocol for evaluating models on Collab needs to be placed for a fair comparison.

Thanks,

@weihua916
Copy link
Contributor

Hi! The evaluation rule is stated as is. One can use validation edges for both training and inference as long as all hyper-parameters are selected based on validation edges (not test edges). As you rightly pointed out, our example code indeed only uses the validation set for inference, but it is just for simplicity. Your example code is totally valid, but it's a bit interesting to see you are validating on validation edges while also using validation edges as training supervision. So you are essentially using training loss to do model selection? Wouldn't that cause serious over-fitting?

@Barcavin
Copy link
Contributor Author

Barcavin commented Sep 2, 2023 via email

@weihua916
Copy link
Contributor

Got it. Thanks for clarifying. Please feel free to submit to our leaderboard yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants