Upgrade sampler capability #209

lvermue · 2020-12-10T15:17:42Z

This PR will:

Allow to configure which sides should be corrupted {head, relation, tail}
Add filtering to the sampler (filtering true triples found in the training dataset from the proposed corrupted triples)

Trigger CI

src/pykeen/sampling/basic_negative_sampler.py

src/pykeen/sampling/negative_sampler.py

mberr · 2020-12-10T15:59:20Z

@lvermue sorry if I am commenting on work in progress 😅

lvermue · 2020-12-10T19:03:44Z

🙄

lvermue · 2020-12-10T20:11:07Z

@mali-git @mberr Does it make sense to include relation corruption for the Bernoulli-Sampler?
This was not the case in Wang2014, or am I mistaken?

lvermue · 2020-12-11T10:09:15Z

@PyKEEN-bot Would you mind testing this PR?

Trigger CI

src/pykeen/sampling/negative_sampler.py

cthoyt · 2020-12-11T12:26:09Z

src/pykeen/sampling/negative_sampler.py

    @abstractmethod
    def sample(self, positive_batch: torch.LongTensor) -> torch.LongTensor:
        """Generate negative samples from the positive batch."""
        raise NotImplementedError
+
+    def _filter_negative_triples(self, negative_batch: torch.LongTensor) -> torch.LongTensor:


is this necessary? I thought we argued in the paper that we could skip this because it's low probability

At least make a note in the docstring on when this assumption isn't so good

That depends. In usual datasets it's low probability. However there are some aspects to it.
E.g. is this expected to act a bit as a regularization, since it adds noise signal to the training data. To make things more complicated, the added noise signal depends on the ratio of true triples for a given entity relation or entity entity pair. Therefore, the effects are hard to distinguish and a researcher might want to exclude the possibility of having false negatives in the proposed negative triples.

sounds like something that could go in the module docstring ;)

src/pykeen/sampling/basic_negative_sampler.py

mali-git · 2020-12-11T15:09:20Z

@mali-git @mberr Does it make sense to include relation corruption for the Bernoulli-Sampler?
This was not the case in Wang2014, or am I mistaken?

In my opinion not, because heads and tails are corrupted based on the property of the relations. Uniformly sampling the relations would not reflect the initial motivation behind the Bernoulli-Sampler.

lvermue · 2020-12-11T19:57:30Z

@PyKEEN-bot test

src/pykeen/sampling/negative_sampler.py

src/pykeen/training/slcwa.py

src/pykeen/sampling/negative_sampler.py

…to upgrade_sampler

tests/test_samplers.py

src/pykeen/sampling/negative_sampler.py

src/pykeen/sampling/basic_negative_sampler.py

Co-authored-by: Charles Tapley Hoyt <cthoyt@gmail.com>

src/pykeen/sampling/basic_negative_sampler.py

src/pykeen/sampling/bernoulli_negative_sampler.py

src/pykeen/sampling/negative_sampler.py

cthoyt · 2020-12-12T18:02:15Z

src/pykeen/sampling/basic_negative_sampler.py

+        :param num_negs_per_pos: Number of negative samples to make per positive triple. Defaults to 1.
+        :param filtered: Whether proposed corrupted triples that are in the training data should be filtered.
+            Defaults to False.
+        :param corruption_scheme: What sides ('h', 'r', 't') should be corrupted. Defaults to head and tail ('h', 't').


Did the reviewer have a specific use case when you might want to do something other than default?

No, but there could be reasonable situations, e.g. when working with images where you know the objects/entities, but are interested in training to reason the most likely relation.

I cannot find a reference right now, but iirc visual relation detection usually first detects all objects in the image (h / t) and then predicts the relations between them (r). One argument is that the distribution of visual representations of relations is more of a long tail distribution, and might include representations of the objects: e.g. in (man, rides, bike)' the visual representations of rides likely (partially) shows both, the man and the bike.

Far better explanations can be given by @sharifza 🙂

This allows better documentation

Trigger CI

cthoyt

The new API looks good and seems to be tested and documented very well.

There are two remaining comments from my side - one about possibly adding more documentation about a use case for non-default corruption schemes. If there are good reasons a person should know this feature exists besides satisfying the reviewer, it might be nice to include in this PR. The second is just a cosmetic thing. After CI is done, feel free to address those or just merge. Thanks for the good work.

@mberr can hopefully check the more methodological aspects then get this one done. Thanks Laurent!

…to upgrade_sampler

lvermue · 2020-12-12T20:22:04Z

@PyKEEN-bot Here's the merge contestant 🎉

mberr

besides these comments, lgtm

src/pykeen/sampling/basic_negative_sampler.py

mberr · 2020-12-13T10:39:30Z

src/pykeen/sampling/basic_negative_sampler.py

+        :param num_negs_per_pos: Number of negative samples to make per positive triple. Defaults to 1.
+        :param filtered: Whether proposed corrupted triples that are in the training data should be filtered.
+            Defaults to False.
+        :param corruption_scheme: What sides ('h', 'r', 't') should be corrupted. Defaults to head and tail ('h', 't').


I cannot find a reference right now, but iirc visual relation detection usually first detects all objects in the image (h / t) and then predicts the relations between them (r). One argument is that the distribution of visual representations of relations is more of a long tail distribution, and might include representations of the objects: e.g. in (man, rides, bike)' the visual representations of rides likely (partially) shows both, the man and the bike.

Far better explanations can be given by @sharifza 🙂

src/pykeen/sampling/basic_negative_sampler.py

src/pykeen/sampling/negative_sampler.py

mberr · 2020-12-13T10:51:56Z

tests/test_samplers.py

@@ -59,7 +59,7 @@ def setUp(self) -> None:

    def test_sample(self) -> None:
        # Generate negative sample
-        negative_batch = self.negative_sampler.sample(positive_batch=self.positive_batch)
+        negative_batch, _ = self.negative_sampler.sample(positive_batch=self.positive_batch)


TL;DR: Here, this comment is unnecessary.

_ is actually a variable name, although by code convention normally not used for "proper" variables. It however means that the second return value is bound to a variable and kept in memory until it gets ouf of scope. Thus, in particular when having pytorch code, it is usually preferrable to use

negative_batch = self.negative_sampler.sample(positive_batch=self.positive_batch)[0]

As said in the TL;DR, this is only a unittest so minimizing memory utilization shouldn't be of primary interest here 😅

lvermue · 2020-12-13T12:11:55Z

@PyKEEN-bot What do you think about the latest commit?

Add corruption scheme configuration possibility for negative samplers

10b62d9

Trigger CI

mberr reviewed Dec 10, 2020

View reviewed changes

src/pykeen/sampling/basic_negative_sampler.py Outdated Show resolved Hide resolved

mberr reviewed Dec 10, 2020

View reviewed changes

src/pykeen/sampling/basic_negative_sampler.py Outdated Show resolved Hide resolved

mberr reviewed Dec 10, 2020

View reviewed changes

src/pykeen/sampling/negative_sampler.py Outdated Show resolved Hide resolved

mberr reviewed Dec 10, 2020

View reviewed changes

src/pykeen/sampling/negative_sampler.py Outdated Show resolved Hide resolved

Fix variable naming errors

6b48579

lvermue added 2 commits December 10, 2020 21:13

Remove unnecessary code

5d47a41

Add negative sample filtering

f3ff3cf

PyKEEN-bot and others added 2 commits December 11, 2020 10:09

Trigger CI

d7e108a

Update basic_negative_sampler.py

ade1143

Trigger CI

cthoyt reviewed Dec 11, 2020

View reviewed changes

src/pykeen/sampling/negative_sampler.py Outdated Show resolved Hide resolved

cthoyt reviewed Dec 11, 2020

View reviewed changes

src/pykeen/sampling/negative_sampler.py Outdated Show resolved Hide resolved

cthoyt reviewed Dec 11, 2020

View reviewed changes

src/pykeen/sampling/negative_sampler.py Outdated Show resolved Hide resolved

cthoyt reviewed Dec 11, 2020

View reviewed changes

src/pykeen/sampling/basic_negative_sampler.py Outdated Show resolved Hide resolved

Merge branch 'master' into upgrade_sampler

396f27d

lvermue added 2 commits December 11, 2020 17:03

Update docstrings

1f006ed

Add negative samples filtering for pairwise losses

6695397

Trigger CI

dee3fa4

mberr reviewed Dec 11, 2020

View reviewed changes

src/pykeen/sampling/negative_sampler.py Outdated Show resolved Hide resolved

cthoyt reviewed Dec 11, 2020

View reviewed changes

src/pykeen/training/slcwa.py Show resolved Hide resolved

cthoyt reviewed Dec 11, 2020

View reviewed changes

src/pykeen/sampling/negative_sampler.py Show resolved Hide resolved

lvermue added 2 commits December 12, 2020 13:56

Code refactoring

39ae601

Merge branch 'upgrade_sampler' of https://github.com/pykeen/pykeen in…

07b784c

…to upgrade_sampler