New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌼 🔍 Add filterer for negative samples based on Bloom Filter #401
Conversation
The triples. | ||
""" | ||
for i in self.probe(batch=triples): | ||
self.bit_array[i] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to do self.bit_array[self.probe(batch=triples)] = True
? If so, is there any benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work:
self.probe
is a generator, but indexing into arrays requires a list instead. With a list, we would need to allocate num_probes * size
arrays instead of the current more memory-friendly solution.
P.S.: self.bit_array[i]
with duplicate i
can lead to race conditions. Here we do not care, since each of the updates writes the same value.
@mberr can you add some high level docs on how/why you could use the bloom filter versus the normal one versus no filter? |
The first decision is whether to use filtering for negative samples or not. If you decide for it, the current implementation is quite expensive since it more or less compares against all triples, every time a new negative samples batch is created. Bloom filters are an index structure for existence queries, which provide (theoretically) space and time (complexity) improvements over the existing solution. One downside is that they do not guarantee their result: they might reject some negative samples although they are not in the training triples. The chance of incorrectness can be controlled to some degree by trading larger size / slower inference for smaller incorrectness probability. Part of the motivation is coming from #273 - there it was reported that filtering is quite slow, as well as one simple solution, which however, does not scale very well (it requires There are also some concerns about the correctness of the existing non-Bloom-filter-based implementation, cf. #272, which seem to be justified from first glance. P.S.: Filtering negative samples during training is different to the filtered evaluation. For training, we might want to trade speed for some incorrectness. |
@mberr great, that is very helpful. Now we just need a code example on how a user might turn on bloom filtering (or turn off filtering completely, since the default filter is turned on by default). I'll admit I know this isn't so obvious how to do at the moment - I think it means you need to use the |
I am no super-user of the pipeline method, but I guess pipeline(model="distmult", dataset="nations", negative_sampler_kwargs=dict(filtered=True, filterer="bloom")) should do the trick. Any suggestion on where to put this? |
Trigger CI
The formulation of standard negative sampling algorithms leads to the generation of known false negatives, which will | ||
have low ranks and therefore worsen the metrics reported during rank-based evaluation. The "filtered setting" proposed | ||
by [bordes2013]_ can remove known false negatives, but it comes with a large performance cost. Since the number of | ||
false negatives is effectively small, this correction can be reasonably omitted. By default, PyKEEN does *not* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to double check that the messaging is consistent. PyKEEN does not do filtering by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cthoyt the filtered setting from Bordes is different, and applies for the filtered evaluation. This PR is about filtering false negatives during training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During evaluation, we use filtering by default (afaik)
Trigger CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mberr double check that we're consistently messaging that PyKEEN does not use filtering by default, then after CI passes let's merge
Trigger CI
This PR revises the filtering of generated negative samples during training. It removes the old faulty implementation (cf. #272 ), and implements a fast filterer based on Bloom filters, as well as a slower Python set-based variant with guaranteed correctness. It also updates the documentation.
Bloom Filterer
The Bloom filterer uses an pure PyTorch implementation of a Bloom Filter to existence check.
Thus, it can happen that some negative samples are rejected which do not occur in the training triples, however, non-rejection of existing triples is impossible. The implementation allows to specify a desired error rate (defaulting to 0.1%), and derives appropriate data structure parameters based on theoretical results.
Bloom filters scale nicely in time and space complexity.
For implementation I took inspiration from looking at https://github.com/hiway/python-bloom-filter/, and selected hash functions from https://github.com/skeeto/hash-prospector#two-round-functions.
Exemplary results
YAGO3-10
Exact results in less than 2MiB index size.See a more thorough benchmarking at https://github.com/pykeen/bloom-filterer-benchmark
Dependencies
Related Issues
Fixes #272
Fixes #273