🌼 🔍 Add filterer for negative samples based on Bloom Filter #401

mberr · 2021-04-23T13:53:43Z

This PR revises the filtering of generated negative samples during training. It removes the old faulty implementation (cf. #272 ), and implements a fast filterer based on Bloom filters, as well as a slower Python set-based variant with guaranteed correctness. It also updates the documentation.

Bloom Filterer

The Bloom filterer uses an pure PyTorch implementation of a Bloom Filter to existence check.

False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set".

Thus, it can happen that some negative samples are rejected which do not occur in the training triples, however, non-rejection of existing triples is impossible. The implementation allows to specify a desired error rate (defaulting to 0.1%), and derives appropriate data structure parameters based on theoretical results.

Bloom filters scale nicely in time and space complexity.

For implementation I took inspiration from looking at https://github.com/hiway/python-bloom-filter/, and selected hash functions from https://github.com/skeeto/hash-prospector#two-round-functions.

Exemplary results

YAGO3-10

Exact results in less than 2MiB index size.

import humanize
from pykeen.datasets import get_dataset
from pykeen.sampling.filtering import BloomFilterer
dataset = get_dataset(dataset="yago310")
f = BloomFilterer(triples_factory=dataset.training, error_rate=0.001)
print(f)
print(humanize.naturalsize(f.bit_array.numel() / 8))
for key, value in dataset.factory_dict.items():
    print(f"{key:20}: {float(f.contains(batch=value.mapped_triples).float().mean()):3.2%}")

BloomFilterer(error_rate=0.001, size=15513993, rounds=10, ideal_num_elements=1079040, )
1.9 MB
training            : 100.00%
testing             : 0.00%
validation          : 0.00%

See a more thorough benchmarking at https://github.com/pykeen/bloom-filterer-benchmark

Dependencies

🐸 🍱 Extract the negative batch filter component into own component #400 (this PR contains some commits from 🐸 🍱 Extract the negative batch filter component into own component #400 )

Related Issues

Fixes #272
Fixes #273

Trigger CI

cthoyt · 2021-04-23T15:19:09Z

src/pykeen/sampling/filtering.py

+            The triples.
+        """
+        for i in self.probe(batch=triples):
+            self.bit_array[i] = True


is it possible to do self.bit_array[self.probe(batch=triples)] = True? If so, is there any benefit?

I don't think this will work:

self.probe is a generator, but indexing into arrays requires a list instead. With a list, we would need to allocate num_probes * size arrays instead of the current more memory-friendly solution.

P.S.: self.bit_array[i] with duplicate i can lead to race conditions. Here we do not care, since each of the updates writes the same value.

Trigger CI

cthoyt · 2021-04-23T15:59:53Z

@mberr can you add some high level docs on how/why you could use the bloom filter versus the normal one versus no filter?

mberr · 2021-04-23T16:20:02Z

@mberr can you add some high level docs on how/why you could use the bloom filter versus the normal one versus no filter?

The first decision is whether to use filtering for negative samples or not.

If you decide for it, the current implementation is quite expensive since it more or less compares against all triples, every time a new negative samples batch is created. Bloom filters are an index structure for existence queries, which provide (theoretically) space and time (complexity) improvements over the existing solution.

One downside is that they do not guarantee their result: they might reject some negative samples although they are not in the training triples. The chance of incorrectness can be controlled to some degree by trading larger size / slower inference for smaller incorrectness probability.

Part of the motivation is coming from #273 - there it was reported that filtering is quite slow, as well as one simple solution, which however, does not scale very well (it requires num_entities**2 * num_relations < MAX_LONG_VALUE). Bloom filters are an index structure particularly designed for existence checks, and overcome the size limitation. I have not benchmarked sufficiently to see whether the theoretical superiority is also reflected in better runtime.

There are also some concerns about the correctness of the existing non-Bloom-filter-based implementation, cf. #272, which seem to be justified from first glance.

P.S.: Filtering negative samples during training is different to the filtered evaluation. For training, we might want to trade speed for some incorrectness.

cthoyt · 2021-04-23T16:58:58Z

@mberr great, that is very helpful. Now we just need a code example on how a user might turn on bloom filtering (or turn off filtering completely, since the default filter is turned on by default). I'll admit I know this isn't so obvious how to do at the moment - I think it means you need to use the negative_sampler_kwargs, right?

mberr · 2021-04-23T17:08:23Z

@mberr great, that is very helpful. Now we just need a code example on how a user might turn on bloom filtering (or turn off filtering completely, since the default filter is turned on by default). I'll admit I know this isn't so obvious how to do at the moment - I think it means you need to use the negative_sampler_kwargs, right?

I am no super-user of the pipeline method, but I guess

pipeline(model="distmult", dataset="nations", negative_sampler_kwargs=dict(filtered=True, filterer="bloom"))

should do the trick. Any suggestion on where to put this?

Trigger CI

src/pykeen/sampling/filtering.py

Trigger CI

cthoyt · 2021-04-24T12:54:47Z

docs/source/tutorial/understanding_evaluation.rst

+The formulation of standard negative sampling algorithms leads to the generation of known false negatives, which will
+have low ranks and therefore worsen the metrics reported during rank-based evaluation. The "filtered setting" proposed
+by [bordes2013]_ can remove known false negatives, but it comes with a large performance cost. Since the number of
+false negatives is effectively small, this correction can be reasonably omitted. By default, PyKEEN does *not*


need to double check that the messaging is consistent. PyKEEN does not do filtering by default.

@cthoyt the filtered setting from Bordes is different, and applies for the filtered evaluation. This PR is about filtering false negatives during training.

During evaluation, we use filtering by default (afaik)

Trigger CI

cthoyt

@mberr double check that we're consistently messaging that PyKEEN does not use filtering by default, then after CI passes let's merge

Trigger CI

mberr added 6 commits April 23, 2021 14:32

extract filterer into own component

fc540cd

Remove NoFilterer

cb58453

fix tests

81e6715

fix imports

197b058

Add Bloom filter based filterer for negative triples

beff653

fix renamed field

0e8f8b3

mberr changed the title ~~Add filterer for negative samples based on Bloom Filter~~ 🌼🔍 Add filterer for negative samples based on Bloom Filter Apr 23, 2021

cthoyt changed the title ~~🌼🔍 Add filterer for negative samples based on Bloom Filter~~ 🌼 🔍 Add filterer for negative samples based on Bloom Filter Apr 23, 2021

cthoyt added 4 commits April 23, 2021 17:10

Merge branch 'master' into bloom-filter-2

6d2119d

Remove unused code

968135b

Add tests for bloom filter and metatests

2184310

Pass flake8

3352b5a

Trigger CI

cthoyt reviewed Apr 23, 2021

View reviewed changes

cthoyt and others added 3 commits April 23, 2021 17:23

Move static methods down

1b63515

fix probe

6bd0e69

Trigger CI

Merge remote-tracking branch 'origin/bloom-filter-2' into bloom-filter-2

edac02d

mberr added 2 commits April 23, 2021 19:04

Add a comment

61edfb6

Add filterer_kwargs

cce7fbe

mberr and others added 8 commits April 23, 2021 19:22

update docstring of default filterer

8443976

Add docstring to __init__

1706244

Add Python set-based filterer

3381a88

Add tests for the set-based filterer

3589b1f

re-use set_random_seed utility

278ffe2

Add PythonSetFilterer to __all__

e525e8a

Trigger CI

Update type annotations and docs

3986868

Update docs and type hints

b96055b

mberr commented Apr 24, 2021

View reviewed changes

src/pykeen/sampling/filtering.py Outdated Show resolved Hide resolved

cthoyt added 4 commits April 24, 2021 14:33

Update docs

5c81ba3

Update filtering.py

c755484

Merge branch 'master' into bloom-filter-2

a1af9e6

Update docs

9572cc0

Trigger CI

cthoyt reviewed Apr 24, 2021

View reviewed changes

Update filtering.py

30b97f1

Trigger CI

cthoyt approved these changes Apr 24, 2021

View reviewed changes

mberr added 9 commits April 25, 2021 10:50

Rename old filterer

8ab87b9

switch default filterer to BloomFilterer

c74cdf4

Remove old filterer

ff65abc

fix typo

0243524

update docstring

54e6285

reduce duplicate code

e1fe60e

fix basic negative sampler with filtering

e62adda

Fix Bernoulli negative sampler for activated filtering

342bec0

Update filtering paragraph in understanding evaluation tutorial

b34b99e

Trigger CI

mberr marked this pull request as ready for review April 25, 2021 09:24

fix blank line after function docstring

9aa00d8

Trigger CI

cthoyt merged commit 5e5bd9d into master Apr 25, 2021

cthoyt deleted the bloom-filter-2 branch April 25, 2021 09:44

cthoyt mentioned this pull request Apr 25, 2021

🎚️🎛️ Flexibly Define Positive Triples for Filtering #398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌼 🔍 Add filterer for negative samples based on Bloom Filter #401

🌼 🔍 Add filterer for negative samples based on Bloom Filter #401

mberr commented Apr 23, 2021 •

edited

cthoyt Apr 23, 2021 •

edited

mberr Apr 23, 2021

cthoyt commented Apr 23, 2021

mberr commented Apr 23, 2021

cthoyt commented Apr 23, 2021 •

edited

mberr commented Apr 23, 2021

cthoyt Apr 24, 2021

mberr Apr 25, 2021

mberr Apr 25, 2021

cthoyt left a comment

🌼 🔍 Add filterer for negative samples based on Bloom Filter #401

🌼 🔍 Add filterer for negative samples based on Bloom Filter #401

Conversation

mberr commented Apr 23, 2021 • edited

Bloom Filterer

Exemplary results

Dependencies

Related Issues

cthoyt Apr 23, 2021 • edited

Choose a reason for hiding this comment

mberr Apr 23, 2021

Choose a reason for hiding this comment

cthoyt commented Apr 23, 2021

mberr commented Apr 23, 2021

cthoyt commented Apr 23, 2021 • edited

mberr commented Apr 23, 2021

cthoyt Apr 24, 2021

Choose a reason for hiding this comment

mberr Apr 25, 2021

Choose a reason for hiding this comment

mberr Apr 25, 2021

Choose a reason for hiding this comment

cthoyt left a comment

Choose a reason for hiding this comment

mberr commented Apr 23, 2021 •

edited

cthoyt Apr 23, 2021 •

edited

cthoyt commented Apr 23, 2021 •

edited