New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split Error in RandomLinkSplit #3668
Comments
I think this is totally correct. It seems like you are looking at the shapes of
Let me know if this resolves your concerns :) |
It is not completely solved yet. One question is that when the link occurred in the training, validation, testing at the same time. Is there an information leakage among different dataset, especially for link prediction? |
You mean that the link appears during training both for message passing and ground-truth? I think it depends. For example, in the case that you want to classify edges into ratings, it's totally fine to use the knowledge of existence of edges during message passing (it would be different if you would use the knowledge of ratings used for supervision). To completely eliminate any data leakage, have a look at the |
Thx, this provides a reasonable interpretation for the split for link prediction. |
Would you please @rusty1s elaborate on this that what do you mean by message passing phase for link prediction? |
For link prediction with GNNs, we first perform message passing on the original graph and use the resulting node embeddings to infer the probability of new links. As such, we have links to perform message passing on ( |
@rusty1s According to this video for link prediction task we have 4 types of edges:
A simple example would help a lot! thanks in advance. |
|
Thank you @rusty1s I guess I get the idea. My last question: |
Yes, this is correct. Validation and test edges need to always be disjoint. |
It seems that negative samples are automatically generated in edge_label and edge_label_index of the validation set and test set even "add_negative_train_samples=False". Is this to evaluate the model more fairly? |
Yes, this is correct. For inference, we typically want to evaluate on the same set of positive and negative edges across epochs. |
Sorry for hijacking the thread but does the RandomLinkSplit perform splits on
I am sorry but I am having a hard time interpreting the output of the RandomLinkSplit function. |
The split is performed based on |
So how should I utilize the edge_label_index? I tried but the edge_level contains [0, 1] and edge_label_index contains [2, num_edges]. I am a bit confused as to how I can leverage those to split the edge_attr. I tried setting key = 'y' which results in a successful split of y with the desired outcome but not for edge_attr. Do you have a code snippet that can explain the process? Thanks for the prompt reply. |
Note that
|
However, y is related to the edge_attr. As in |
In that case, you might want to drop the perm = torch.randperm(data.num_edges)
data.train_idx = perm[:int(0.8 * data.num_edges)]
data.val_idx = perm[int(0.8 * data.num_edges):int(0.9 * data.num_edges)]
data.test_idx = perm[int(0.9 * data.num_edges):] Let me know if that works for you. |
That works. Thanks! That needs to be saved. I could contribute to the docs but I don't know how to. |
Sounds good. We could just add this bit of information as a note to the |
Thought it will be better at the top as a piece of contextual information rather than at the bottom. |
Please feel free to contribute this in a PR to credit you. I can fine-tune it afterwards :) |
Feel a bit stupid to open a PR for such a small commit. Are there any task boards I can view? I'll look if I can make any other contributions. |
Small PRs are the best :) Otherwise, we are also looking for some help to fill our "Dataset Cheatsheet". |
Okay. I'll see what I can do :) |
Sorry for updating the thread but I just want to be sure that I'm understanding correctly the insights discussed by @rusty1s and @katyansun. If I'm understanding well:
Then the usage of those edge_label depends of us:
|
Let me know if this makes sense. |
It indeed makes sense, thanks for clarifying! If that makes sense, I could add some lines to the doc and/or write a short function to split edges when we want to find missing links |
Sure, happy to extend the documentation in this regard :) |
Sorry for repeat this thread, @rusty1s . I have simple question about the mismatch between the number of splitted dataset. data
# Data(x=[47957, 256], edge_index=[2,2161412]
# Train : Val : Test = 7 : 1 : 2
# Expected number of splitted dataset: train(1512988), val(216141), test(432282)
# Case1: using RandomLinkSplit
transform = RandomLinkSplit(num_val=0.1, num_test=0.2, is_undirected=True, split_labels=True)
train_data, val_data, test_data = transform(data)
train_data
# Data(x=[47957, 256], edge_index=[2, 2222032], pos_edge_label=[1111016], pos_edge_label_index=[2, 1111016], neg_edge_label=[1111016], neg_edge_label_index=[2, 1111016])
val_data
# Data(x=[47957, 256], edge_index=[2, 2222032], pos_edge_label=[158716], pos_edge_label_index=[2, 158716], neg_edge_label=[158716], neg_edge_label_index=[2, 158716])
test_data
# Data(x=[47957, 256], edge_index=[2, 2539464], pos_edge_label=[317432], pos_edge_label_index=[2, 317432], neg_edge_label=[317432], neg_edge_label_index=[2, 317432]) Although I read the whole comments in this thread, I'm not sure why there were missed edges. Is it due to the isolated edges which can't perform message passing? # Case2: manual splitting
perm = torch.randperm(data.num_edges)
data.train_idx = perm[:int(0.7 * data.num_edges)]
data.val_idx = perm[int(0.7 * data.num_edges):int(0.8 * data.num_edges)]
data.test_idx = perm[int(0.8 * data.num_edges):]
data
# Data(x=[47957, 256], edge_index=[2, 2161412], train_idx=[1512988], val_idx=[216141], test_idx=[432283]) |
This is likely due to the |
I also have a question in the context of By default, no negative edges are sampled for the training set in
Would the behavior be the same if I don't sample negative edges in the |
It is not the same. If you sample negative training edges in |
Thanks for clarifying @rusty1s |
馃悰 Bug
When I use the
RandomLinkSplit
to split dataset MovieLens, I found that the split data is wrong.To Reproduce
The link prediction task is as follows:
I get the following result:
train: 80670(this is right)
val: 80670(wrong)
test: 90753(wrong)
Expected behavior
The number of edges
('user', 'rates', 'movie')
in this dataset is 100836. According to the ratio (0.8, 0.1, 0.1), we should get the split dataset as follows:train: 80670(this is right)
val: 10083(wrong)
test: 10083(wrong)
Environment
torch_geometric.__version__
): 2.0.2torch.__version__
): 1.10.03.9
): 3.8conda
,pip
, source): piptorch-scatter
): Not yet.Additional context
I review the source code, I found the error may be made in the line 176 in
RandomLinkSplit
with wrong parameters.The text was updated successfully, but these errors were encountered: