Bloom filters

ni implements a minimalistic bloom filter library to do efficient membership queries. This is visible in two ways -- first as a set of Perl functions:

$f = bloom_new($n, $p): creates an empty bloom filter designed to contain $n elements with a false-positive probability of $p.
bloom_add($f, $x): adds the string $x to $f destructively.
bloom_contains($f, $x): returns 1 if $f might contain $x, 0 if it definitely does not contain $x.

These functions are available inside all Perl contexts. The other, and more common, way that you'd use bloom filters is by a trio of operators zB (compress into bloom filter), rb (rows which match bloom filter), and r^b (rows which don't match bloom filter). The syntax is this:

$ ni rows.txt zB45 \>filter     # compress lines as elements
                                # resulting binary filter will store 10^4 items
                                # at 10^-5 false positive rate

$ ni other-data.txt rbAfilter   # return rows of other-data.txt for which field
                                # A occurs inside the bloom filter

$ ni other-data.txt r^bAfilter  # return rows of other-data.txt for which field
                                # A does not occur inside the bloom filter

Because the filter data is just a lambda, it's also possible to say something like rbA[rows.txt zB45]. For example:

$ ni nE4 rbA[i108 i571 i3491 zB45]
108
571
3491
$ ni nE4 r^bA[nE4 rp'a != 61 && a != 108' zB45]
61
108

You can also stuff a bloom filter into a data closure:

$ ni ::bloom[i108 i571 i3491 zB45] nE4 fAA rbA//:bloom
108	108
571	571
3491	3491

And, of course, you can use bloom filters directly:

$ ni ::bloom[i100 i101 i102 zB45] nE4 p'r a, a + 1' rp'bloom_contains bloom, a'
100	101
101	102
102	103

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bloom.md

bloom.md

Bloom filters

Files

bloom.md

Latest commit

History

bloom.md

File metadata and controls

Bloom filters