-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RandomLinkSplit gives no neg_edge_label
on dense graphs
#5581
Comments
Hey @andreimargeloiu, sorry I couldn't understand one thing: In your example
If you are only using the But to your issue: How about we instead proportionally assign the negative edges to the three datasets? Right now we do (roughly) num_neg_train = int(num_train * self.neg_sampling_ratio)
num_neg_val = int(num_val * self.neg_sampling_ratio)
num_neg_test = int(num_test * self.neg_sampling_ratio)
num_neg = num_neg_train + num_neg_val + num_neg_test
neg_edge_index = negative_sampling(edge_index, size,
num_neg_samples=num_neg,
method='sparse') We could adjust these once we get the number of negative samples ... something like num_neg_train = num_neg_train/num_neg * neg_edge_index.size(1) |
Thanks for chiming in @Padarn. Adjusting I don't think we want to re-use negative samples across the different splits since this will introduce data leakage. |
Yeah that also makes sense @rusty1s. I do think we should probably provide a path to having it work (i.e in this case should optionally not return an error), but it could be controlled by a flag. Maybe something like |
What does |
I think this could all happen before sampling (if we assume |
Oh, I see. Yeah, we can adjust the ratio in this case. Raising a warning and doing it by default should work as well. This avoids the need to add another argument to the already pretty long list of arguments of |
I'm also okay with that. I'll raise a PR when I get some time. |
Addressed in the above PR |
🐛 Describe the bug
Context: I'm programmatically creating undirected graphs for training a Graph Autoencoder (GAE). Some graphs are very dense, containing 90% of all possible edges.
Training the GAE requires sampling missing edges as follows:
T.RandomLinkSplit(num_val=0.15, num_test=0, is_undirected=True, split_labels=True, add_negative_train_samples=True, neg_sampling_ratio=1.0)
Issue: Because the graphs are too dense, there aren't enough missing edges to sample negative edges. Consider a graph with 100 nodes and 9000 edges (from a maximum of 9900 edges). There are only 900 missing edges to sample as negative edges, but because I'm setting
neg_sampling_ratio=1.0
, PyG expected to find 9000 missing edges.The issue is that PyG doesn't check there aren't enough missing edges, and it creates training datasets without setting
neg_edge_label
. When creating the training data, the current code doesn't assign anyneg_edge_index
to the training data. See below the train/valid/test data.Error: When creating a dataloader comes the
KeyError: 'neg_edge_label'
because the training data is missing this key.Suggested solution: I suggest a two-step solution:
neg_edge_label
andneg_edge_label_index
, then all three datasets shall contain it. In the above example, valid/test contain these keys, but the training dataset doesn't.Environment
The text was updated successfully, but these errors were encountered: