RandomLinkSplit gives no `neg_edge_label` on dense graphs #5581

andreimargeloiu · 2022-09-30T10:00:32Z

🐛 Describe the bug

Context: I'm programmatically creating undirected graphs for training a Graph Autoencoder (GAE). Some graphs are very dense, containing 90% of all possible edges.

Training the GAE requires sampling missing edges as follows:
T.RandomLinkSplit(num_val=0.15, num_test=0, is_undirected=True, split_labels=True, add_negative_train_samples=True, neg_sampling_ratio=1.0)

Issue: Because the graphs are too dense, there aren't enough missing edges to sample negative edges. Consider a graph with 100 nodes and 9000 edges (from a maximum of 9900 edges). There are only 900 missing edges to sample as negative edges, but because I'm setting neg_sampling_ratio=1.0, PyG expected to find 9000 missing edges.

The issue is that PyG doesn't check there aren't enough missing edges, and it creates training datasets without setting neg_edge_label. When creating the training data, the current code doesn't assign any neg_edge_index to the training data. See below the train/valid/test data.

(Data(x=[141, 141], edge_index=[2, 16072], pos_edge_label=[8036], pos_edge_label_index=[2, 8036]), # Train; notice no negative edges
 Data(x=[141, 141], edge_index=[2, 16072], pos_edge_label=[1417], pos_edge_label_index=[2, 1417], neg_edge_label=[834], neg_edge_label_index=[2, 834]), # Validation
 Data(x=[141, 141], edge_index=[2, 18906], pos_edge_label=[0], pos_edge_label_index=[2, 0])) # Test

Error: When creating a dataloader comes the KeyError: 'neg_edge_label' because the training data is missing this key.

gnn_dataloader = tg.loader.DataLoader(graphs_dataset, batch_size=len(graphs_dataset), shuffle=True)
gnn_batch_train, _, _ = next(iter(gnn_dataloader))

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_175984/1912375602.py in <module>
      1 gnn_dataloader = tg.loader.DataLoader(graphs_dataset, batch_size=len(graphs_dataset), shuffle=True) # exclude_keys=['neg_edge_label']
----> 2 gnn_batch_train, _, _ = next(iter(gnn_dataloader))

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    559     def _next_data(self):
    560         index = self._next_index()  # may raise StopIteration
--> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    562         if self._pin_memory:
    563             data = _utils.pin_memory.pin_memory(data)

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     50         else:
     51             data = self.dataset[possibly_batched_index]
---> 52         return self.collate_fn(data)

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py in __call__(self, batch)
     32             return type(elem)(*(self(s) for s in zip(*batch)))
     33         elif isinstance(elem, Sequence) and not isinstance(elem, str):
---> 34             return [self(s) for s in zip(*batch)]
     35 
     36         raise TypeError(f'DataLoader found invalid type: {type(elem)}')

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py in <listcomp>(.0)
     32             return type(elem)(*(self(s) for s in zip(*batch)))
     33         elif isinstance(elem, Sequence) and not isinstance(elem, str):
---> 34             return [self(s) for s in zip(*batch)]
     35 
     36         raise TypeError(f'DataLoader found invalid type: {type(elem)}')

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py in __call__(self, batch)
     18         if isinstance(elem, BaseData):
     19             return Batch.from_data_list(batch, self.follow_batch,
---> 20                                         self.exclude_keys)
     21         elif isinstance(elem, torch.Tensor):
     22             return default_collate(batch)

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/data/batch.py in from_data_list(cls, data_list, follow_batch, exclude_keys)
     72             add_batch=not isinstance(data_list[0], Batch),
     73             follow_batch=follow_batch,
---> 74             exclude_keys=exclude_keys,
     75         )
     76 

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/data/collate.py in collate(cls, data_list, increment, add_batch, follow_batch, exclude_keys)
     68                 continue
     69 
---> 70             values = [store[attr] for store in stores]
     71 
     72             # The `num_nodes` attribute needs special treatment, as we need to

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/data/collate.py in <listcomp>(.0)
     68                 continue
     69 
---> 70             values = [store[attr] for store in stores]
     71 
     72             # The `num_nodes` attribute needs special treatment, as we need to

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/data/storage.py in __getitem__(self, key)
     68 
     69     def __getitem__(self, key: str) -> Any:
---> 70         return self._mapping[key]
     71 
     72     def __setitem__(self, key: str, value: Any):

KeyError: 'neg_edge_label'

Suggested solution: I suggest a two-step solution:

Raise a Warning that there aren't enough missing edges to sample negative edges
If one of train/valid/test contains the keys neg_edge_label and neg_edge_label_index, then all three datasets shall contain it. In the above example, valid/test contain these keys, but the training dataset doesn't.

Environment

PyG version: 2.1.0
PyTorch version: 1.10.1
Python version: 3.7.9

The text was updated successfully, but these errors were encountered:

Padarn · 2022-09-30T13:27:57Z

Hey @andreimargeloiu, sorry I couldn't understand one thing: In your example

gnn_dataloader = tg.loader.DataLoader(graphs_dataset, batch_size=len(graphs_dataset), shuffle=True)
gnn_batch_train, _, _ = next(iter(gnn_dataloader))

If you are only using the train data why are you creating a loader from all three datasets?

But to your issue: How about we instead proportionally assign the negative edges to the three datasets? Right now we do (roughly)

num_neg_train = int(num_train * self.neg_sampling_ratio)
num_neg_val = int(num_val * self.neg_sampling_ratio)
num_neg_test = int(num_test * self.neg_sampling_ratio)
num_neg = num_neg_train + num_neg_val + num_neg_test
neg_edge_index = negative_sampling(edge_index, size,
                                               num_neg_samples=num_neg,
                                               method='sparse')

We could adjust these once we get the number of negative samples ... something like

num_neg_train = num_neg_train/num_neg * neg_edge_index.size(1)

rusty1s · 2022-10-01T19:09:48Z

Thanks for chiming in @Padarn. Adjusting neg_sampling_ratio sounds good but we could still show a warning/error out with a better error message here. WDYT?

I don't think we want to re-use negative samples across the different splits since this will introduce data leakage.

Padarn · 2022-10-02T00:34:21Z

Yeah that also makes sense @rusty1s.

I do think we should probably provide a path to having it work (i.e in this case should optionally not return an error), but it could be controlled by a flag.

Maybe something like neg_sampling_ratio_fixed=True by default and False optionally. The naming of this flag is a bit nasty though.

rusty1s · 2022-10-03T03:09:29Z

What does neg_sampling_ratio_fixed=True mean? And how would the code path of having it work without an error look like? Not sure I understand.

Padarn · 2022-10-03T04:02:28Z

neg_sampling_ratio_fixed=True would mean an error is raised if there are not enough negative edges to sample to match that ratio.

neg_sampling_ratio_fixed=False would allow for the ratio to be smaller if there are not enough negative edges.

I think this could all happen before sampling (if we assume edge_index only has unique elements, or check that), in the case of neg_sampling_ratio_fixed=False and there are not enough edges, we would adjust the ratio to match the maximum so that train/test/val would all get the same proportion.

rusty1s · 2022-10-03T04:08:44Z

Oh, I see. Yeah, we can adjust the ratio in this case. Raising a warning and doing it by default should work as well. This avoids the need to add another argument to the already pretty long list of arguments of RandomLinkSplit.

Padarn · 2022-10-03T04:54:11Z

I'm also okay with that. I'll raise a PR when I get some time.

Padarn · 2022-10-10T23:57:44Z

Addressed in the above PR

andreimargeloiu added the bug label Sep 30, 2022

Padarn added loader transform and removed loader labels Sep 30, 2022

Padarn mentioned this issue Oct 10, 2022

Add ratio adjustment in RandomLinkSplit when not enough negative edges exist #5642

Merged

Padarn added the 2 - Priority P2 label Oct 10, 2022

Padarn closed this as completed Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomLinkSplit gives no `neg_edge_label` on dense graphs #5581

RandomLinkSplit gives no `neg_edge_label` on dense graphs #5581

andreimargeloiu commented Sep 30, 2022

Padarn commented Sep 30, 2022 •

edited

Loading

rusty1s commented Oct 1, 2022

Padarn commented Oct 2, 2022

rusty1s commented Oct 3, 2022

Padarn commented Oct 3, 2022

rusty1s commented Oct 3, 2022

Padarn commented Oct 3, 2022

Padarn commented Oct 10, 2022

RandomLinkSplit gives no neg_edge_label on dense graphs #5581

RandomLinkSplit gives no neg_edge_label on dense graphs #5581

Comments

andreimargeloiu commented Sep 30, 2022

🐛 Describe the bug

Environment

Padarn commented Sep 30, 2022 • edited Loading

rusty1s commented Oct 1, 2022

Padarn commented Oct 2, 2022

rusty1s commented Oct 3, 2022

Padarn commented Oct 3, 2022

rusty1s commented Oct 3, 2022

Padarn commented Oct 3, 2022

Padarn commented Oct 10, 2022

RandomLinkSplit gives no `neg_edge_label` on dense graphs #5581

RandomLinkSplit gives no `neg_edge_label` on dense graphs #5581

Padarn commented Sep 30, 2022 •

edited

Loading