Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomLinkSplit gives no neg_edge_label on dense graphs #5581

Closed
andreimargeloiu opened this issue Sep 30, 2022 · 8 comments
Closed

RandomLinkSplit gives no neg_edge_label on dense graphs #5581

andreimargeloiu opened this issue Sep 30, 2022 · 8 comments

Comments

@andreimargeloiu
Copy link

🐛 Describe the bug

Context: I'm programmatically creating undirected graphs for training a Graph Autoencoder (GAE). Some graphs are very dense, containing 90% of all possible edges.

Training the GAE requires sampling missing edges as follows:
T.RandomLinkSplit(num_val=0.15, num_test=0, is_undirected=True, split_labels=True, add_negative_train_samples=True, neg_sampling_ratio=1.0)

Issue: Because the graphs are too dense, there aren't enough missing edges to sample negative edges. Consider a graph with 100 nodes and 9000 edges (from a maximum of 9900 edges). There are only 900 missing edges to sample as negative edges, but because I'm setting neg_sampling_ratio=1.0, PyG expected to find 9000 missing edges.

The issue is that PyG doesn't check there aren't enough missing edges, and it creates training datasets without setting neg_edge_label. When creating the training data, the current code doesn't assign any neg_edge_index to the training data. See below the train/valid/test data.

(Data(x=[141, 141], edge_index=[2, 16072], pos_edge_label=[8036], pos_edge_label_index=[2, 8036]), # Train; notice no negative edges
 Data(x=[141, 141], edge_index=[2, 16072], pos_edge_label=[1417], pos_edge_label_index=[2, 1417], neg_edge_label=[834], neg_edge_label_index=[2, 834]), # Validation
 Data(x=[141, 141], edge_index=[2, 18906], pos_edge_label=[0], pos_edge_label_index=[2, 0])) # Test

Error: When creating a dataloader comes the KeyError: 'neg_edge_label' because the training data is missing this key.

gnn_dataloader = tg.loader.DataLoader(graphs_dataset, batch_size=len(graphs_dataset), shuffle=True)
gnn_batch_train, _, _ = next(iter(gnn_dataloader))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_175984/1912375602.py in <module>
      1 gnn_dataloader = tg.loader.DataLoader(graphs_dataset, batch_size=len(graphs_dataset), shuffle=True) # exclude_keys=['neg_edge_label']
----> 2 gnn_batch_train, _, _ = next(iter(gnn_dataloader))

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    559     def _next_data(self):
    560         index = self._next_index()  # may raise StopIteration
--> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    562         if self._pin_memory:
    563             data = _utils.pin_memory.pin_memory(data)

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     50         else:
     51             data = self.dataset[possibly_batched_index]
---> 52         return self.collate_fn(data)

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py in __call__(self, batch)
     32             return type(elem)(*(self(s) for s in zip(*batch)))
     33         elif isinstance(elem, Sequence) and not isinstance(elem, str):
---> 34             return [self(s) for s in zip(*batch)]
     35 
     36         raise TypeError(f'DataLoader found invalid type: {type(elem)}')

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py in <listcomp>(.0)
     32             return type(elem)(*(self(s) for s in zip(*batch)))
     33         elif isinstance(elem, Sequence) and not isinstance(elem, str):
---> 34             return [self(s) for s in zip(*batch)]
     35 
     36         raise TypeError(f'DataLoader found invalid type: {type(elem)}')

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py in __call__(self, batch)
     18         if isinstance(elem, BaseData):
     19             return Batch.from_data_list(batch, self.follow_batch,
---> 20                                         self.exclude_keys)
     21         elif isinstance(elem, torch.Tensor):
     22             return default_collate(batch)

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/data/batch.py in from_data_list(cls, data_list, follow_batch, exclude_keys)
     72             add_batch=not isinstance(data_list[0], Batch),
     73             follow_batch=follow_batch,
---> 74             exclude_keys=exclude_keys,
     75         )
     76 

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/data/collate.py in collate(cls, data_list, increment, add_batch, follow_batch, exclude_keys)
     68                 continue
     69 
---> 70             values = [store[attr] for store in stores]
     71 
     72             # The `num_nodes` attribute needs special treatment, as we need to

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/data/collate.py in <listcomp>(.0)
     68                 continue
     69 
---> 70             values = [store[attr] for store in stores]
     71 
     72             # The `num_nodes` attribute needs special treatment, as we need to

~/miniconda3/envs/low-data/lib/python3.7/site-packages/torch_geometric/data/storage.py in __getitem__(self, key)
     68 
     69     def __getitem__(self, key: str) -> Any:
---> 70         return self._mapping[key]
     71 
     72     def __setitem__(self, key: str, value: Any):

KeyError: 'neg_edge_label'

Suggested solution: I suggest a two-step solution:

  1. Raise a Warning that there aren't enough missing edges to sample negative edges
  2. If one of train/valid/test contains the keys neg_edge_label and neg_edge_label_index, then all three datasets shall contain it. In the above example, valid/test contain these keys, but the training dataset doesn't.

Environment

  • PyG version: 2.1.0
  • PyTorch version: 1.10.1
  • Python version: 3.7.9
@Padarn
Copy link
Contributor

Padarn commented Sep 30, 2022

Hey @andreimargeloiu, sorry I couldn't understand one thing: In your example

gnn_dataloader = tg.loader.DataLoader(graphs_dataset, batch_size=len(graphs_dataset), shuffle=True)
gnn_batch_train, _, _ = next(iter(gnn_dataloader))

If you are only using the train data why are you creating a loader from all three datasets?

But to your issue: How about we instead proportionally assign the negative edges to the three datasets? Right now we do (roughly)

num_neg_train = int(num_train * self.neg_sampling_ratio)
num_neg_val = int(num_val * self.neg_sampling_ratio)
num_neg_test = int(num_test * self.neg_sampling_ratio)
num_neg = num_neg_train + num_neg_val + num_neg_test
neg_edge_index = negative_sampling(edge_index, size,
                                               num_neg_samples=num_neg,
                                               method='sparse')

We could adjust these once we get the number of negative samples ... something like

num_neg_train = num_neg_train/num_neg * neg_edge_index.size(1)

@rusty1s
Copy link
Member

rusty1s commented Oct 1, 2022

Thanks for chiming in @Padarn. Adjusting neg_sampling_ratio sounds good but we could still show a warning/error out with a better error message here. WDYT?

I don't think we want to re-use negative samples across the different splits since this will introduce data leakage.

@Padarn
Copy link
Contributor

Padarn commented Oct 2, 2022

Yeah that also makes sense @rusty1s.

I do think we should probably provide a path to having it work (i.e in this case should optionally not return an error), but it could be controlled by a flag.

Maybe something like neg_sampling_ratio_fixed=True by default and False optionally. The naming of this flag is a bit nasty though.

@rusty1s
Copy link
Member

rusty1s commented Oct 3, 2022

What does neg_sampling_ratio_fixed=True mean? And how would the code path of having it work without an error look like? Not sure I understand.

@Padarn
Copy link
Contributor

Padarn commented Oct 3, 2022

neg_sampling_ratio_fixed=True would mean an error is raised if there are not enough negative edges to sample to match that ratio.

neg_sampling_ratio_fixed=False would allow for the ratio to be smaller if there are not enough negative edges.

I think this could all happen before sampling (if we assume edge_index only has unique elements, or check that), in the case of neg_sampling_ratio_fixed=False and there are not enough edges, we would adjust the ratio to match the maximum so that train/test/val would all get the same proportion.

@rusty1s
Copy link
Member

rusty1s commented Oct 3, 2022

Oh, I see. Yeah, we can adjust the ratio in this case. Raising a warning and doing it by default should work as well. This avoids the need to add another argument to the already pretty long list of arguments of RandomLinkSplit.

@Padarn
Copy link
Contributor

Padarn commented Oct 3, 2022

I'm also okay with that. I'll raise a PR when I get some time.

@Padarn
Copy link
Contributor

Padarn commented Oct 10, 2022

Addressed in the above PR

@Padarn Padarn closed this as completed Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants