Split Error in RandomLinkSplit #3668

lmy86263 · 2021-12-10T01:32:59Z

🐛 Bug

When I use the RandomLinkSplit to split dataset MovieLens, I found that the split data is wrong.

To Reproduce

The link prediction task is as follows:

train_data, val_data, test_data = T.RandomLinkSplit(
        num_val=0.1,
        num_test=0.1,
        neg_sampling_ratio=0.0,
        edge_types=[('user', 'rates', 'movie')],
        rev_edge_types=[('movie', 'rev_rates', 'user')],
    )(data)

I get the following result:

train: 80670(this is right)
val: 80670(wrong)
test: 90753(wrong)

Expected behavior

The number of edges ('user', 'rates', 'movie') in this dataset is 100836. According to the ratio (0.8, 0.1, 0.1), we should get the split dataset as follows:

train: 80670(this is right)
val: 10083(wrong)
test: 10083(wrong)

Environment

PyG version (torch_geometric.__version__): 2.0.2
PyTorch version: (torch.__version__): 1.10.0
OS (e.g., Linux): MacOS
Python version (e.g., 3.9): 3.8
CUDA/cuDNN version: CPU
How you installed PyTorch and PyG (conda, pip, source): pip
Any other relevant information (e.g., version of torch-scatter): Not yet.

Additional context

I review the source code, I found the error may be made in the line 176 in RandomLinkSplit with wrong parameters.

The text was updated successfully, but these errors were encountered:

rusty1s · 2021-12-11T13:19:11Z

I think this is totally correct. It seems like you are looking at the shapes of edge_index, while you may want to look at the shapes of edge_label and edge_label_index (which correctly model a 80/10/10 split ratio). Here, edge_index is solely used for message passing, i.e.,

for training, we exchange messages on all training edges
for validation, we exchange messages on all training edges
for testing, we exchange messages on all training and validation edges

Let me know if this resolves your concerns :)

lmy86263 · 2021-12-13T00:15:56Z

It is not completely solved yet. One question is that when the link occurred in the training, validation, testing at the same time. Is there an information leakage among different dataset, especially for link prediction?

rusty1s · 2021-12-13T16:24:51Z

You mean that the link appears during training both for message passing and ground-truth? I think it depends. For example, in the case that you want to classify edges into ratings, it's totally fine to use the knowledge of existence of edges during message passing (it would be different if you would use the knowledge of ratings used for supervision).

To completely eliminate any data leakage, have a look at the disjoint_train_ratio of RandomLinkSplit.

lmy86263 · 2021-12-16T05:45:13Z

Thx, this provides a reasonable interpretation for the split for link prediction.

shahinghasemi · 2021-12-18T13:21:09Z

Here, edge_index is solely used for message passing, i.e.,
for training, we exchange messages on all training edges
for validation, we exchange messages on all training edges
for testing, we exchange messages on all training and validation edges

Would you please @rusty1s elaborate on this that what do you mean by message passing phase for link prediction?

rusty1s · 2021-12-18T17:57:07Z

For link prediction with GNNs, we first perform message passing on the original graph and use the resulting node embeddings to infer the probability of new links. As such, we have links to perform message passing on (edge_index), and links which we want to train/evaluate against (edge_label_index). RandomLinkSplit takes of separating these two correctly.

shahinghasemi · 2021-12-25T08:45:01Z

@rusty1s According to this video for link prediction task we have 4 types of edges: training supervision edges, training message edges, validation edges and testing edges. I'm a little bit confused about training supervision edges and training message edges.
Here's my question (context: in heterogeneous network):

What's the difference between training supervision edges and training message edges? I know they're both used in training phase however don't know the difference?
Can training supervision edges and training message edges have common edges? or they should be disjoint sets?

A simple example would help a lot! thanks in advance.

rusty1s · 2021-12-25T12:44:08Z

"Training message edges" are the edges that are used in the GNN part of your model: The edges that you use to exchange neighborhood information and to enhance your node representations. "Training supervision edges" are then used to train your final link predictor: Given a training supervision edge, you take the source and destination node representations obtained from a GNN and use them as input to predict the probability of a link.
This depends on the model and validation performance. In GAE (https://arxiv.org/abs/1611.07308), training supervision edges and training message edges denote the same set of edges. IN SEAL (https://arxiv.org/pdf/1802.09691.pdf), training supervision edges and training message edges are disjoint.

In general, I think using the same set of edges for message passing and supervision may lead to same data leakage in your training phase, but this depends on the power/expressiveness of your model. For example, GAE uses a GCN-based encoder and a dot-product based decoder. Both encoder and decoder have limited power, so the data leakage capabilities of the model are limited as well.

shahinghasemi · 2021-12-26T07:28:26Z

Thank you @rusty1s I guess I get the idea. My last question:
is this correct? the test edges should not be included in both message edges and supervision edges in other words they're disjoint sets.

rusty1s · 2021-12-26T07:38:25Z

Yes, this is correct. Validation and test edges need to always be disjoint.

CocoGzh · 2022-01-02T13:44:05Z

For link prediction with GNNs, we first perform message passing on the original graph and use the resulting node embeddings to infer the probability of new links. As such, we have links to perform message passing on (edge_index), and links which we want to train/evaluate against (edge_label_index). RandomLinkSplit takes of separating these two correctly.

It seems that negative samples are automatically generated in edge_label and edge_label_index of the validation set and test set even "add_negative_train_samples=False". Is this to evaluate the model more fairly?

rusty1s · 2022-01-05T15:10:43Z

Yes, this is correct. For inference, we typically want to evaluate on the same set of positive and negative edges across epochs.

ashim-mahara · 2022-01-24T04:19:12Z

Sorry for hijacking the thread but does the RandomLinkSplit perform splits on edge_attr and the label tensor y too? If yes, how do I access the edge attr? BTW my output after splitting is:

split_transform = RandomLinkSplit(num_test = 0.2, num_val = 0.1, is_undirected=False)
train_data, val_data, test_data = split_transform(data)

print(train_data)

Data(x=[19129, 1], edge_index=[2, 1979514], edge_attr=[1979514, 80], y=[1979514], is_directed=True, edge_label=[3959028], edge_label_index=[2, 3959028])

print(val_data)

Data(x=[19129, 1], edge_index=[2, 1979514], edge_attr=[1979514, 80], y=[1979514], is_directed=True, edge_label=[565574], edge_label_index=[2, 565574])

print(test_data)

Data(x=[19129, 1], edge_index=[2, 2262301], edge_attr=[2262301, 80], y=[2262301], is_directed=True, edge_label=[1131150], edge_label_index=[2, 1131150])

I am sorry but I am having a hard time interpreting the output of the RandomLinkSplit function.

rusty1s · 2022-01-24T09:36:29Z

The split is performed based on edge_index and applied to all attributes that are identified as edge features (in your case edge_attr and y). It will also create edge_label and edge_label_index attributes, which will contain negative sampled edges and their labels. I hope this clarifies some of your doubts.

ashim-mahara · 2022-01-24T09:46:41Z

So how should I utilize the edge_label_index? I tried but the edge_level contains [0, 1] and edge_label_index contains [2, num_edges]. I am a bit confused as to how I can leverage those to split the edge_attr. I tried setting key = 'y' which results in a successful split of y with the desired outcome but not for edge_attr. Do you have a code snippet that can explain the process? Thanks for the prompt reply.

rusty1s · 2022-01-24T09:53:05Z

Note that edge_attr is already splitted as well. With key="y", you get the following behavior:

edge_index and edge_attr shall be used for message passing via a given GNN
edge_label_index and y shall be used for supervision/loss computation. That is, for each edge in edge_label_index, y denotes the ground-truth labels. Furthermore, additional labels are added for negative sampled edges.

ashim-mahara · 2022-01-24T10:12:14Z

However, y is related to the edge_attr. As in y = theta(edge_attr). So for each (source, edge_attr, destination), I would like to compute a label y. y could also be interpreted as an edge_label. I am sorry but I am very new to GNNs and trying to learn.

rusty1s · 2022-01-24T17:26:00Z

In that case, you might want to drop the RandomLinkSplit transform (which is more applicable for a link prediction scenario in which links in the graph are actually missing), and perform a standard random splitting on your own:

perm = torch.randperm(data.num_edges)
data.train_idx = perm[:int(0.8 * data.num_edges)]
data.val_idx = perm[int(0.8 * data.num_edges):int(0.9 * data.num_edges)]
data.test_idx = perm[int(0.9 * data.num_edges):]

Let me know if that works for you.

ashim-mahara · 2022-01-25T02:56:47Z

That works. Thanks! That needs to be saved. I could contribute to the docs but I don't know how to.

rusty1s · 2022-01-26T11:49:32Z

Sounds good. We could just add this bit of information as a note to the RandomLinkSplit documentation, see here.

ashim-mahara · 2022-01-26T12:10:51Z

RandomLinkSplit transform is primarily used for a Link Prediction Scenario whereby the task is to predict missing links in a graph. at line 22.

Thought it will be better at the top as a piece of contextual information rather than at the bottom.

rusty1s · 2022-01-26T12:20:13Z

Please feel free to contribute this in a PR to credit you. I can fine-tune it afterwards :)

ashim-mahara · 2022-01-26T12:22:01Z

Feel a bit stupid to open a PR for such a small commit. Are there any task boards I can view? I'll look if I can make any other contributions.

rusty1s · 2022-01-26T12:25:21Z

Small PRs are the best :) Otherwise, we are also looking for some help to fill our "Dataset Cheatsheet".

ashim-mahara · 2022-01-26T12:27:54Z

Okay. I'll see what I can do :)

SimonCrouzet · 2022-07-31T16:58:31Z

Sorry for updating the thread but I just want to be sure that I'm understanding correctly the insights discussed by @rusty1s and @katyansun.

If I'm understanding well:

if we want to label edges ourselves, the best way to do it is to set the label as data[('user', 'rates', 'movie')].y, no matter if the label is a feature part of data[('user', 'rates', 'movie')].edge_attr or no (therefore if the label is part of edge_attr, it's up to us to either remove it from edge_attr or use layers not propagating edge_attr to prevent any data contamination) and set key=y when using RandomLinkSplit
Negative test edges, i.e. "inexistent edges" are always added to the val and test splits, and after applying RandomLinkSplit on data with binary edge labels ( y = {0 or 1} ) we ends up with edge_label being 0 for inexistant edge, 1 for edge with y=0 and 2 for edge with y=1

Then the usage of those edge_label depends of us:

if for our task we consider that negative test edges are indeed inexistent (i.e. if we know the entire graph but would like to be able to predict links and their labels) we are computing our loss and metrics out of every type of edge_label
if our task is to explore the missing links (i.e. from the existing edges we know we would like to guess new links, as we consider inexistent link as missing data to be retrieved) we have to compute our loss and metrics without considering negative edges (edge_label=0). For this case, if we want to perform regression rather than classification (y being continuous), we then can not use RandomLinkSplit and should perform standard random splitting on our own following the suggestion made by @rusty1s , correct?

rusty1s · 2022-08-03T15:09:07Z

Yes, that is correct.
If you have labels y = { 0, 1 }, then after negative sampling they will be increased to y = { 1, 2 }. Whenever you input edge_label_index or y into the split function, we assume that the task you are trying to solve is a edge-level classification task.
If the task is to find missing links, you usually would need need an edge_label_index, but rather treat positives links as label 1, and negative links as label 0.

Let me know if this makes sense.

SimonCrouzet · 2022-08-08T13:22:59Z

It indeed makes sense, thanks for clarifying!

If that makes sense, I could add some lines to the doc and/or write a short function to split edges when we want to find missing links

rusty1s · 2022-08-08T14:07:41Z

Sure, happy to extend the documentation in this regard :)

songsong0425 · 2023-06-08T07:24:04Z

Sorry for repeat this thread, @rusty1s . I have simple question about the mismatch between the number of splitted dataset.
When I try link prediction task and run the RandomLinkSplit in torch_geometric.transforms, it returned different number of edges as below:

data
# Data(x=[47957, 256], edge_index=[2,2161412]
# Train : Val : Test = 7 : 1 : 2
# Expected number of splitted dataset: train(1512988), val(216141), test(432282)

# Case1: using RandomLinkSplit
transform = RandomLinkSplit(num_val=0.1, num_test=0.2, is_undirected=True, split_labels=True)
train_data, val_data, test_data = transform(data)

train_data
# Data(x=[47957, 256], edge_index=[2, 2222032], pos_edge_label=[1111016], pos_edge_label_index=[2, 1111016], neg_edge_label=[1111016], neg_edge_label_index=[2, 1111016])
val_data
# Data(x=[47957, 256], edge_index=[2, 2222032], pos_edge_label=[158716], pos_edge_label_index=[2, 158716], neg_edge_label=[158716], neg_edge_label_index=[2, 158716])
test_data
# Data(x=[47957, 256], edge_index=[2, 2539464], pos_edge_label=[317432], pos_edge_label_index=[2, 317432], neg_edge_label=[317432], neg_edge_label_index=[2, 317432])

Although I read the whole comments in this thread, I'm not sure why there were missed edges. Is it due to the isolated edges which can't perform message passing?
Also, If I split the edges manually, will it evoke any problem during the model training, validation, and test?

# Case2: manual splitting
perm = torch.randperm(data.num_edges)
data.train_idx = perm[:int(0.7 * data.num_edges)]
data.val_idx = perm[int(0.7 * data.num_edges):int(0.8 * data.num_edges)]
data.test_idx = perm[int(0.8 * data.num_edges):]

data
# Data(x=[47957, 256], edge_index=[2, 2161412], train_idx=[1512988], val_idx=[216141], test_idx=[432283])

rusty1s · 2023-06-10T18:43:38Z

This is likely due to the is_undirected option since it will only return the upper half of edges for supervision. Is your graph really undirected?

LuisaWerner · 2023-08-08T10:51:35Z

I also have a question in the context of RandomLinkSplit:

By default, no negative edges are sampled for the training set in RandomLinkSplit. However, I saw in the example script for link prediction here that negative edges are sampled in the train method.

neg_edge_index = negative_sampling(
        edge_index=train_data.edge_index, num_nodes=train_data.num_nodes,
        num_neg_samples=train_data.edge_label_index.size(1), method='sparse')

Would the behavior be the same if I don't sample negative edges in the train method but instead modify T.RandomLinkSplit(num_val=0.05, num_test=0.1, is_undirected=False, add_negative_train_samples=True)?
In other words, does setting add_negative_train_samples = True do the same as adding the negative sampling to the training method?

rusty1s · 2023-08-11T14:44:48Z

It is not the same. If you sample negative training edges in RandomLinkSplit, these negative samples will be fixed for the whole training procedure. Negative sampling on-the-fly here instead achieves that we are guaranteed to always see a different set of negative samples during training, thus providing a better learning signal (in general).

LuisaWerner · 2023-09-11T13:32:47Z

Thanks for clarifying @rusty1s

lmy86263 added the bug label Dec 10, 2021

rusty1s removed the bug label Dec 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split Error in RandomLinkSplit #3668

Split Error in RandomLinkSplit #3668

lmy86263 commented Dec 10, 2021

rusty1s commented Dec 11, 2021

lmy86263 commented Dec 13, 2021

rusty1s commented Dec 13, 2021 •

edited

lmy86263 commented Dec 16, 2021

shahinghasemi commented Dec 18, 2021

rusty1s commented Dec 18, 2021

shahinghasemi commented Dec 25, 2021 •

edited

rusty1s commented Dec 25, 2021

shahinghasemi commented Dec 26, 2021

rusty1s commented Dec 26, 2021

CocoGzh commented Jan 2, 2022

rusty1s commented Jan 5, 2022

ashim-mahara commented Jan 24, 2022

rusty1s commented Jan 24, 2022

ashim-mahara commented Jan 24, 2022

rusty1s commented Jan 24, 2022

ashim-mahara commented Jan 24, 2022

rusty1s commented Jan 24, 2022

ashim-mahara commented Jan 25, 2022

rusty1s commented Jan 26, 2022

ashim-mahara commented Jan 26, 2022

rusty1s commented Jan 26, 2022

ashim-mahara commented Jan 26, 2022

rusty1s commented Jan 26, 2022

ashim-mahara commented Jan 26, 2022

SimonCrouzet commented Jul 31, 2022

rusty1s commented Aug 3, 2022

SimonCrouzet commented Aug 8, 2022

rusty1s commented Aug 8, 2022

songsong0425 commented Jun 8, 2023 •

edited

rusty1s commented Jun 10, 2023

LuisaWerner commented Aug 8, 2023

rusty1s commented Aug 11, 2023

LuisaWerner commented Sep 11, 2023

Split Error in RandomLinkSplit #3668

Split Error in RandomLinkSplit #3668

Comments

lmy86263 commented Dec 10, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

rusty1s commented Dec 11, 2021

lmy86263 commented Dec 13, 2021

rusty1s commented Dec 13, 2021 • edited

lmy86263 commented Dec 16, 2021

shahinghasemi commented Dec 18, 2021

rusty1s commented Dec 18, 2021

shahinghasemi commented Dec 25, 2021 • edited

rusty1s commented Dec 25, 2021

shahinghasemi commented Dec 26, 2021

rusty1s commented Dec 26, 2021

CocoGzh commented Jan 2, 2022

rusty1s commented Jan 5, 2022

ashim-mahara commented Jan 24, 2022

rusty1s commented Jan 24, 2022

ashim-mahara commented Jan 24, 2022

rusty1s commented Jan 24, 2022

ashim-mahara commented Jan 24, 2022

rusty1s commented Jan 24, 2022

ashim-mahara commented Jan 25, 2022

rusty1s commented Jan 26, 2022

ashim-mahara commented Jan 26, 2022

rusty1s commented Jan 26, 2022

ashim-mahara commented Jan 26, 2022

rusty1s commented Jan 26, 2022

ashim-mahara commented Jan 26, 2022

SimonCrouzet commented Jul 31, 2022

rusty1s commented Aug 3, 2022

SimonCrouzet commented Aug 8, 2022

rusty1s commented Aug 8, 2022

songsong0425 commented Jun 8, 2023 • edited

rusty1s commented Jun 10, 2023

LuisaWerner commented Aug 8, 2023

rusty1s commented Aug 11, 2023

LuisaWerner commented Sep 11, 2023

rusty1s commented Dec 13, 2021 •

edited

shahinghasemi commented Dec 25, 2021 •

edited

songsong0425 commented Jun 8, 2023 •

edited