Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk filters #122

Merged
merged 1 commit into from
Jun 20, 2023
Merged

Chunk filters #122

merged 1 commit into from
Jun 20, 2023

Conversation

kjnilsson
Copy link
Contributor

@kjnilsson kjnilsson commented May 15, 2023

  • Initial implementation
  • Use unfiltered as a special kind of filter
  • Allow multiple filters for reads
  • bloom filter size config
  • Optimise bloom filter hashing to hash only once and bit mangle to get two values out of it.
  • More tests!

@kjnilsson kjnilsson changed the title first cut chunk filtering Chunk filters May 18, 2023
@kjnilsson kjnilsson force-pushed the chunk-filtering branch 2 times, most recently from 4093202 to 0afaa38 Compare June 20, 2023 09:15
This implements a reasonable efficient way to avoid reading chunks
that do not contain messages that the reader is not interested in.

For each chunk where there is at least one write that includes a
filter value a bloom filter will be added to the chunk immediately
after the chunk header. The chunk filter is calculated from all the
filter values in the chunk. If there are writes that contain no
filter value the chunk filters first bit will also be set to indicate
there are "unfiltered" entries in the chunk.

At read time an offset reader can supply one or more filter values that are used
to match against the bloom filter. If none of the read filter values
matches the chunk body is not read and the reader moves on to the next
chunk and so on.

There is an option also to include "unfiltered" entries in the match
evaluation.

The bloom filter can be in the size range of 16-255 bytes with the default
the smallest value at 16 bytes. This should be good for distinguishing
between ~50 unique filter values.

The bloom filter size is recorded in the first of the spare 4 bytes
at the end of the chunk header.
@kjnilsson kjnilsson marked this pull request as ready for review June 20, 2023 09:18
@kjnilsson kjnilsson merged commit 32b7f82 into main Jun 20, 2023
6 checks passed
@acogoluegnes acogoluegnes deleted the chunk-filtering branch June 20, 2023 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant