Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized implementation for string indexes #632

Merged
merged 26 commits into from Jan 10, 2020
Merged

Conversation

mavam
Copy link
Member

@mavam mavam commented Nov 1, 2019

The new index is meant for opaque identifiers that can only be retrieved via equality lookups. If a string type contains the #index=hash attribute, the index factory chooses this new hash-based implementation.

@mavam mavam requested a review from a team November 1, 2019 22:13
@mavam mavam added enhancement ✨ performance Improvements or regressions of performance labels Nov 1, 2019
@mavam
Copy link
Member Author

mavam commented Nov 1, 2019

I haven't had the time yet, but it would be great to have some numbers on the performance delta. Ideally for both space and time.

@mavam
Copy link
Member Author

mavam commented Nov 2, 2019

There's a fundamental issue I overlooked in the design: the index does not support inequality lookups. It can't because it hashes the input. We'll have to reconsider the usefulness of a space-efficient equality-only string index before proceeding.

@mavam mavam force-pushed the story/ch4493 branch 3 times, most recently from ce68bfe to 88b7f60 Compare November 18, 2019 20:55
@mavam mavam added feature New functionality and removed enhancement ✨ labels Nov 18, 2019
@mavam mavam changed the title Add optimized implementation of opaque string indexes Add optimized implementation for string indexes Nov 18, 2019
libvast/vast/hash_index.hpp Outdated Show resolved Hide resolved
This index is meant for opaque identifiers that are can only be
retrieved via equality lookups. If a string type contains the #id
attribute in a schema, the index factory chooses a new hash-based
implementation.

The hash index is a vector of digests, where each digest is trimmed to a
specific byte length. This works only if the hash function distributes
its value uniformly over the entire output bits. Whether our current
choice of hash function (xxhash64) is an open question that needs to
figured out.
The value index keeps two extra bitmaps independent of its
implementation: a null bitmap and a mask bitmap. The null bitmap is for
all nil values and the mask value used to be for all IDs that exist in
the index *and* the nil values.

This change makes the two bitmaps disjoint. The result is not only less
bits overall, but also less bit-wise operations when performing a
lookup, because the null bitmap no longer needs to be NANDed out.
@mavam mavam force-pushed the story/ch4493 branch 2 times, most recently from a0dcc1c to 3ca898e Compare January 8, 2020 21:17
Copy link
Member

@tobim tobim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the first 3 commits only, so I can do this in chunks.

Looking good so far.

libvast/src/value_index_factory.cpp Show resolved Hide resolved
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first pass. Conceptually, this seems all good to me, and I didn't notice anything weird in a short test run. Just a few small notes here and there for now.

libvast/src/format/zeek.cpp Outdated Show resolved Hide resolved
libvast/src/format/zeek.cpp Outdated Show resolved Hide resolved
libvast/src/view.cpp Show resolved Hide resolved
libvast/vast/view.hpp Show resolved Hide resolved
libvast/vast/hash_index.hpp Outdated Show resolved Hide resolved
libvast/vast/hash_index.hpp Outdated Show resolved Hide resolved
libvast/vast/hash_index.hpp Show resolved Hide resolved
libvast/vast/hash_index.hpp Outdated Show resolved Hide resolved
Copy link
Member

@tobim tobim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed my review strategy to go file by file.

libvast/src/format/zeek.cpp Outdated Show resolved Hide resolved
libvast/src/value_index_factory.cpp Show resolved Hide resolved
libvast/vast/value_index.hpp Outdated Show resolved Hide resolved
@dominiklohmann
Copy link
Member

The commit ede727e seems to have triggered an ADL correctness issue, causing the wrong operator to be called and thus the wrong type name to be displayed. That should likely be fixed in a separate PR.

This only popped up because of an overload in the Zeek type printer that
didn't get triggered because it lacked a const qualifier.
tobim
tobim previously approved these changes Jan 9, 2020
@mavam
Copy link
Member Author

mavam commented Jan 9, 2020

@dominiklohmann Sorry, I fixed the issue already before seeing your request to do this separately.

@dominiklohmann dominiklohmann merged commit 00e8abf into master Jan 10, 2020
@dominiklohmann dominiklohmann deleted the story/ch4493 branch January 10, 2020 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality performance Improvements or regressions of performance
Projects
None yet
3 participants