Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌼 🔍 Add filterer for negative samples based on Bloom Filter #401

Merged
merged 38 commits into from Apr 25, 2021

Conversation

mberr
Copy link
Member

@mberr mberr commented Apr 23, 2021

This PR revises the filtering of generated negative samples during training. It removes the old faulty implementation (cf. #272 ), and implements a fast filterer based on Bloom filters, as well as a slower Python set-based variant with guaranteed correctness. It also updates the documentation.

Bloom Filterer

The Bloom filterer uses an pure PyTorch implementation of a Bloom Filter to existence check.

False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set".

Thus, it can happen that some negative samples are rejected which do not occur in the training triples, however, non-rejection of existing triples is impossible. The implementation allows to specify a desired error rate (defaulting to 0.1%), and derives appropriate data structure parameters based on theoretical results.

Bloom filters scale nicely in time and space complexity.

For implementation I took inspiration from looking at https://github.com/hiway/python-bloom-filter/, and selected hash functions from https://github.com/skeeto/hash-prospector#two-round-functions.

Exemplary results

YAGO3-10 Exact results in less than 2MiB index size.
import humanize
from pykeen.datasets import get_dataset
from pykeen.sampling.filtering import BloomFilterer
dataset = get_dataset(dataset="yago310")
f = BloomFilterer(triples_factory=dataset.training, error_rate=0.001)
print(f)
print(humanize.naturalsize(f.bit_array.numel() / 8))
for key, value in dataset.factory_dict.items():
    print(f"{key:20}: {float(f.contains(batch=value.mapped_triples).float().mean()):3.2%}")
BloomFilterer(error_rate=0.001, size=15513993, rounds=10, ideal_num_elements=1079040, )
1.9 MB
training            : 100.00%
testing             : 0.00%
validation          : 0.00%

See a more thorough benchmarking at https://github.com/pykeen/bloom-filterer-benchmark

Dependencies

Related Issues

Fixes #272
Fixes #273

@mberr mberr changed the title Add filterer for negative samples based on Bloom Filter 🌼🔍 Add filterer for negative samples based on Bloom Filter Apr 23, 2021
@cthoyt cthoyt changed the title 🌼🔍 Add filterer for negative samples based on Bloom Filter 🌼 🔍 Add filterer for negative samples based on Bloom Filter Apr 23, 2021
The triples.
"""
for i in self.probe(batch=triples):
self.bit_array[i] = True
Copy link
Member

@cthoyt cthoyt Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to do self.bit_array[self.probe(batch=triples)] = True? If so, is there any benefit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work:

self.probe is a generator, but indexing into arrays requires a list instead. With a list, we would need to allocate num_probes * size arrays instead of the current more memory-friendly solution.

P.S.: self.bit_array[i] with duplicate i can lead to race conditions. Here we do not care, since each of the updates writes the same value.

@cthoyt
Copy link
Member

cthoyt commented Apr 23, 2021

@mberr can you add some high level docs on how/why you could use the bloom filter versus the normal one versus no filter?

@mberr
Copy link
Member Author

mberr commented Apr 23, 2021

@mberr can you add some high level docs on how/why you could use the bloom filter versus the normal one versus no filter?

The first decision is whether to use filtering for negative samples or not.

If you decide for it, the current implementation is quite expensive since it more or less compares against all triples, every time a new negative samples batch is created. Bloom filters are an index structure for existence queries, which provide (theoretically) space and time (complexity) improvements over the existing solution.

One downside is that they do not guarantee their result: they might reject some negative samples although they are not in the training triples. The chance of incorrectness can be controlled to some degree by trading larger size / slower inference for smaller incorrectness probability.

Part of the motivation is coming from #273 - there it was reported that filtering is quite slow, as well as one simple solution, which however, does not scale very well (it requires num_entities**2 * num_relations < MAX_LONG_VALUE). Bloom filters are an index structure particularly designed for existence checks, and overcome the size limitation. I have not benchmarked sufficiently to see whether the theoretical superiority is also reflected in better runtime.

There are also some concerns about the correctness of the existing non-Bloom-filter-based implementation, cf. #272, which seem to be justified from first glance.

P.S.: Filtering negative samples during training is different to the filtered evaluation. For training, we might want to trade speed for some incorrectness.

@cthoyt
Copy link
Member

cthoyt commented Apr 23, 2021

@mberr great, that is very helpful. Now we just need a code example on how a user might turn on bloom filtering (or turn off filtering completely, since the default filter is turned on by default). I'll admit I know this isn't so obvious how to do at the moment - I think it means you need to use the negative_sampler_kwargs, right?

@mberr
Copy link
Member Author

mberr commented Apr 23, 2021

@mberr great, that is very helpful. Now we just need a code example on how a user might turn on bloom filtering (or turn off filtering completely, since the default filter is turned on by default). I'll admit I know this isn't so obvious how to do at the moment - I think it means you need to use the negative_sampler_kwargs, right?

I am no super-user of the pipeline method, but I guess

pipeline(model="distmult", dataset="nations", negative_sampler_kwargs=dict(filtered=True, filterer="bloom"))

should do the trick. Any suggestion on where to put this?

The formulation of standard negative sampling algorithms leads to the generation of known false negatives, which will
have low ranks and therefore worsen the metrics reported during rank-based evaluation. The "filtered setting" proposed
by [bordes2013]_ can remove known false negatives, but it comes with a large performance cost. Since the number of
false negatives is effectively small, this correction can be reasonably omitted. By default, PyKEEN does *not*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to double check that the messaging is consistent. PyKEEN does not do filtering by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cthoyt the filtered setting from Bordes is different, and applies for the filtered evaluation. This PR is about filtering false negatives during training.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During evaluation, we use filtering by default (afaik)

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mberr double check that we're consistently messaging that PyKEEN does not use filtering by default, then after CI passes let's merge

@mberr mberr marked this pull request as ready for review April 25, 2021 09:24
@cthoyt cthoyt merged commit 5e5bd9d into master Apr 25, 2021
@cthoyt cthoyt deleted the bloom-filter-2 branch April 25, 2021 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve negative triple filtering performance Bug in filtered negative triples
2 participants