-
-
Notifications
You must be signed in to change notification settings - Fork 97
Type FlatBuffers #1888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type FlatBuffers #1888
Conversation
ab1039b
to
ffd90cb
Compare
Adding a file via SOURCES in `add_custom_command` does not make the target automatically depend on it, which DEPENDS should be used for instead.
It is due for a revmap and was unused until now, so let's start with a clean slate.
This was simply forgotten when renaming the old `type` to `legacy_type`.
This commit introduces the new `vast::type`, a FlatBuffers-based type with a flat representation, support for memory-mapping and zero-copy slicing of nested types.
This has turned out to be a bit more complex than initially assumed, and the new tests failed before I made the changes to the sum type access specialization. We need to do a double dispatch here: From type index to sum type token index, and then from that to a dispatch function that calls the visitor with the correct type. We could theoretically map type index to dispatch function directly, but that causes a lot of binary bloat because the mapping between type index and sum type token index needs to no longer be computed just once in total, but rather once for every instantiation of apply. So this will have to do with a double dispatch.
The static factory function had a rather unfortunate name. This avoids needless confusion and ensures `chunk.empty()` is not mistaken for `chunk.size() == 0`.
Usually you already have a subspan of the chunk's underlying span that you want to create a sliced chunk from, and this overload offers that directly.
3443395
to
21d7f17
Compare
d8ce204
to
e6a2b54
Compare
After some internal back and forth, this is what we ended up wanting things to be named: - s/tagged_type/enriched_type - s/tag/attribute
e6a2b54
to
99eb390
Compare
1c92dae
to
4a3989e
Compare
This is inherently unsafe, so let's forbid it. It's safe for basic types because they have static lifetime.
Co-authored-by: Tobias Mayer <tobim@fastmail.fm>
a41ee3f
to
af9a22a
Compare
Co-authored-by: tobim <tobim@fastmail.fm>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch has been running on a testbed at Tenzir for the past 15 hours with no unexpected performance changes.
The time to load a partition with the old serialization format has increased slightly increased, but we expect that to return back to the old numbers once the partition uses the new serialization format for its layout.
📔 Description
This refactors our type system to be based on a single, extensible FlatBuffers table. This comes with multiple advantages:
type
is now cache friendly. It's probably safe to say that this was the biggest downside of the old type system. Iterating over the leaves of a arecord_type
is no longer significantly slower than iterating over just the fields.type
is now sliceable from other FlatBuffers table (e.g., Table Slices), and shares its lifetime with the sliced-from underlying chunk.type
must no longer be deserialized as its underlying data format is a contiguous buffer now.I've taken the opportunity to clean up the type API at the same time. This includes the following developer-facing changes:
alias_type
no longer exists. Visiting atype
steps through aliases transparently.Attributes are now called tags.(reverted late on)record_type
API no longer operates with keys directly, but rather only with indices and offsets. The functionsrecord_type::resolve_prefix
andrecord_type::resolve_suffix
must be used to convert a key into offsets.enumeration_type
API allows for explicitly assigning keys to individual fields. This is preparatory work for aligning our data model with Apache Arrow's, which offers the relatedarrow::DictionaryType
.record_type::fields
andrecord_type::leaves
return iterables over fields and leaves of a record respectively.concrete_type
,basic_type
, andcomplex_type
concepts make writing generic visitors a lot easier.There exists conversions between types and legacy type in both directions to ease the transition. The conversion from type to legacy type will be removed towards the end of this PR.
The user-facing changes of this PR are hopefully minimal. The conversion from
legacy_type
totype
when exporting old data from pre-existing databases may incur a small performance overhead, but for newly imported data this should speed up both the import and export pipelines across the board. There are a lot of smaller bug fixes for edge cases with leaf field key handling and alias type resolution throughout the code base.🚦 Status
Consider implementing a new version of the partition FlatBuffers table.Not in this PR.Consider makingNot in this PR.record_type::iterable::iterator
a random access iterator.Consider makingNot in this PR.record_type::leaf_iterable::iterator
a bidirectional iterator.📝 Checklist
🎯 Review Instructions
type.hpp
file.