Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade sampler capability #209

Merged
merged 34 commits into from Dec 13, 2020
Merged

Upgrade sampler capability #209

merged 34 commits into from Dec 13, 2020

Conversation

lvermue
Copy link
Member

@lvermue lvermue commented Dec 10, 2020

This PR will:

  • Allow to configure which sides should be corrupted {head, relation, tail}
  • Add filtering to the sampler (filtering true triples found in the training dataset from the proposed corrupted triples)

@mberr
Copy link
Member

mberr commented Dec 10, 2020

@lvermue sorry if I am commenting on work in progress 😅

@lvermue
Copy link
Member Author

lvermue commented Dec 10, 2020

🙄

@lvermue
Copy link
Member Author

lvermue commented Dec 10, 2020

@mali-git @mberr Does it make sense to include relation corruption for the Bernoulli-Sampler?
This was not the case in Wang2014, or am I mistaken?

@lvermue
Copy link
Member Author

lvermue commented Dec 11, 2020

@PyKEEN-bot Would you mind testing this PR?

@abstractmethod
def sample(self, positive_batch: torch.LongTensor) -> torch.LongTensor:
"""Generate negative samples from the positive batch."""
raise NotImplementedError

def _filter_negative_triples(self, negative_batch: torch.LongTensor) -> torch.LongTensor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? I thought we argued in the paper that we could skip this because it's low probability

At least make a note in the docstring on when this assumption isn't so good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends. In usual datasets it's low probability. However there are some aspects to it.
E.g. is this expected to act a bit as a regularization, since it adds noise signal to the training data. To make things more complicated, the added noise signal depends on the ratio of true triples for a given entity relation or entity entity pair. Therefore, the effects are hard to distinguish and a researcher might want to exclude the possibility of having false negatives in the proposed negative triples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like something that could go in the module docstring ;)

@mali-git
Copy link
Member

mali-git commented Dec 11, 2020

@mali-git @mberr Does it make sense to include relation corruption for the Bernoulli-Sampler?
This was not the case in Wang2014, or am I mistaken?

In my opinion not, because heads and tails are corrupted based on the property of the relations. Uniformly sampling the relations would not reflect the initial motivation behind the Bernoulli-Sampler.

@lvermue
Copy link
Member Author

lvermue commented Dec 11, 2020

@PyKEEN-bot test

Co-authored-by: Charles Tapley Hoyt <cthoyt@gmail.com>
:param num_negs_per_pos: Number of negative samples to make per positive triple. Defaults to 1.
:param filtered: Whether proposed corrupted triples that are in the training data should be filtered.
Defaults to False.
:param corruption_scheme: What sides ('h', 'r', 't') should be corrupted. Defaults to head and tail ('h', 't').
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the reviewer have a specific use case when you might want to do something other than default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but there could be reasonable situations, e.g. when working with images where you know the objects/entities, but are interested in training to reason the most likely relation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find a reference right now, but iirc visual relation detection usually first detects all objects in the image (h / t) and then predicts the relations between them (r). One argument is that the distribution of visual representations of relations is more of a long tail distribution, and might include representations of the objects: e.g. in (man, rides, bike)' the visual representations of rides likely (partially) shows both, the man and the bike.

Far better explanations can be given by @sharifza 🙂

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new API looks good and seems to be tested and documented very well.

There are two remaining comments from my side - one about possibly adding more documentation about a use case for non-default corruption schemes. If there are good reasons a person should know this feature exists besides satisfying the reviewer, it might be nice to include in this PR. The second is just a cosmetic thing. After CI is done, feel free to address those or just merge. Thanks for the good work.

@mberr can hopefully check the more methodological aspects then get this one done. Thanks Laurent!

@lvermue
Copy link
Member Author

lvermue commented Dec 12, 2020

@PyKEEN-bot Here's the merge contestant 🎉

Copy link
Member

@mberr mberr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides these comments, lgtm

src/pykeen/sampling/basic_negative_sampler.py Show resolved Hide resolved
src/pykeen/sampling/basic_negative_sampler.py Show resolved Hide resolved
src/pykeen/sampling/basic_negative_sampler.py Outdated Show resolved Hide resolved
:param num_negs_per_pos: Number of negative samples to make per positive triple. Defaults to 1.
:param filtered: Whether proposed corrupted triples that are in the training data should be filtered.
Defaults to False.
:param corruption_scheme: What sides ('h', 'r', 't') should be corrupted. Defaults to head and tail ('h', 't').
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find a reference right now, but iirc visual relation detection usually first detects all objects in the image (h / t) and then predicts the relations between them (r). One argument is that the distribution of visual representations of relations is more of a long tail distribution, and might include representations of the objects: e.g. in (man, rides, bike)' the visual representations of rides likely (partially) shows both, the man and the bike.

Far better explanations can be given by @sharifza 🙂

src/pykeen/sampling/basic_negative_sampler.py Outdated Show resolved Hide resolved
src/pykeen/sampling/basic_negative_sampler.py Outdated Show resolved Hide resolved
src/pykeen/sampling/negative_sampler.py Outdated Show resolved Hide resolved
@@ -59,7 +59,7 @@ def setUp(self) -> None:

def test_sample(self) -> None:
# Generate negative sample
negative_batch = self.negative_sampler.sample(positive_batch=self.positive_batch)
negative_batch, _ = self.negative_sampler.sample(positive_batch=self.positive_batch)
Copy link
Member

@mberr mberr Dec 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: Here, this comment is unnecessary.

_ is actually a variable name, although by code convention normally not used for "proper" variables. It however means that the second return value is bound to a variable and kept in memory until it gets ouf of scope. Thus, in particular when having pytorch code, it is usually preferrable to use

negative_batch = self.negative_sampler.sample(positive_batch=self.positive_batch)[0]

As said in the TL;DR, this is only a unittest so minimizing memory utilization shouldn't be of primary interest here 😅

@lvermue lvermue marked this pull request as ready for review December 13, 2020 12:10
@lvermue
Copy link
Member Author

lvermue commented Dec 13, 2020

@PyKEEN-bot What do you think about the latest commit?

@lvermue lvermue merged commit 421ced1 into master Dec 13, 2020
@lvermue lvermue deleted the upgrade_sampler branch December 13, 2020 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants