Skip to content

Type FlatBuffers #1888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 125 commits into from
Nov 25, 2021
Merged

Type FlatBuffers #1888

merged 125 commits into from
Nov 25, 2021

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Sep 23, 2021

📔 Description

This refactors our type system to be based on a single, extensible FlatBuffers table. This comes with multiple advantages:

  • type is now cache friendly. It's probably safe to say that this was the biggest downside of the old type system. Iterating over the leaves of a a record_type is no longer significantly slower than iterating over just the fields.
  • type is now sliceable from other FlatBuffers table (e.g., Table Slices), and shares its lifetime with the sliced-from underlying chunk. type must no longer be deserialized as its underlying data format is a contiguous buffer now.

I've taken the opportunity to clean up the type API at the same time. This includes the following developer-facing changes:

  • The alias_type no longer exists. Visiting a type steps through aliases transparently.
  • Attributes are now called tags. (reverted late on)
  • The record_type API no longer operates with keys directly, but rather only with indices and offsets. The functions record_type::resolve_prefix and record_type::resolve_suffix must be used to convert a key into offsets.
  • The enumeration_type API allows for explicitly assigning keys to individual fields. This is preparatory work for aligning our data model with Apache Arrow's, which offers the related arrow::DictionaryType.
  • The functions record_type::fields and record_type::leaves return iterables over fields and leaves of a record respectively.
  • Complex types are no longer default-constructible.
  • The new concrete_type, basic_type, and complex_type concepts make writing generic visitors a lot easier.

There exists conversions between types and legacy type in both directions to ease the transition. The conversion from type to legacy type will be removed towards the end of this PR.

The user-facing changes of this PR are hopefully minimal. The conversion from legacy_type to type when exporting old data from pre-existing databases may incur a small performance overhead, but for newly imported data this should speed up both the import and export pipelines across the board. There are a lot of smaller bug fixes for edge cases with leaf field key handling and alias type resolution throughout the code base.

🚦 Status

  • Design the new type FlatBuffers table and the basic type API.
  • Add support for basic types.
  • Add support for type aliases and tags.
  • Add support for list types.
  • Add support for map types.
  • Add support for enumeration types.
  • Add support for record types.
  • Implement new versions of encoding-specific table slice FlatBuffers tables.
  • Convert libvast to operate on new type system directly.
  • Convert unit tests, tools, and plugins to operate on new type system directly.
  • Remove now dead code from the legacy type API (most operations should be unnecessary now).
  • Consider implementing a new version of the partition FlatBuffers table. Not in this PR.
  • Consider making record_type::iterable::iterator a random access iterator. Not in this PR.
  • Consider making record_type::leaf_iterable::iterator a bidirectional iterator. Not in this PR.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

  1. Look at the new Type FlatBuffers table and at the new type.hpp file.
  2. Look at the individual commits. Most of them should be self-explanatory.
  3. Wait until this PR is out of draft mode.
  4. Measure overall performance impact on both pre-existing and newly created databases.

@dominiklohmann dominiklohmann added feature New functionality performance Improvements or regressions of performance labels Sep 23, 2021
@dominiklohmann dominiklohmann force-pushed the story/ch20439/type-flatbuffers branch 3 times, most recently from ab1039b to ffd90cb Compare September 28, 2021 11:53
Adding a file via SOURCES in `add_custom_command` does not make the
target automatically depend on it, which DEPENDS should be used for
instead.
It is due for a revmap and was unused until now, so let's start with a
clean slate.
This was simply forgotten when renaming the old `type` to `legacy_type`.
This commit introduces the new `vast::type`, a FlatBuffers-based type
with a flat representation, support for memory-mapping and zero-copy
slicing of nested types.
This has turned out to be a bit more complex than initially assumed, and
the new tests failed before I made the changes to the sum type access
specialization. We need to do a double dispatch here: From type index to
sum type token index, and then from that to a dispatch function that
calls the visitor with the correct type.

We could theoretically map type index to dispatch function directly, but
that causes a lot of binary bloat because the mapping between type index
and sum type token index needs to no longer be computed just once in
total, but rather once for every instantiation of apply. So this will
have to do with a double dispatch.
The static factory function had a rather unfortunate name. This avoids
needless confusion and ensures `chunk.empty()` is not mistaken for
`chunk.size() == 0`.
Usually you already have a subspan of the chunk's underlying span that
you want to create a sliced chunk from, and this overload offers that
directly.
@dominiklohmann dominiklohmann force-pushed the story/ch20439/type-flatbuffers branch from 3443395 to 21d7f17 Compare November 19, 2021 07:59
@dominiklohmann dominiklohmann force-pushed the story/ch20439/type-flatbuffers branch from d8ce204 to e6a2b54 Compare November 19, 2021 14:54
@dominiklohmann dominiklohmann requested a review from tobim November 19, 2021 14:54
After some internal back and forth, this is what we ended up wanting
things to be named:
 - s/tagged_type/enriched_type
 - s/tag/attribute
@dominiklohmann dominiklohmann force-pushed the story/ch20439/type-flatbuffers branch from e6a2b54 to 99eb390 Compare November 19, 2021 15:02
@dominiklohmann dominiklohmann mentioned this pull request Nov 19, 2021
3 tasks
@dominiklohmann dominiklohmann force-pushed the story/ch20439/type-flatbuffers branch from 1c92dae to 4a3989e Compare November 22, 2021 14:56
dominiklohmann and others added 3 commits November 24, 2021 09:28
This is inherently unsafe, so let's forbid it. It's safe for basic types
because they have static lifetime.
Co-authored-by: Tobias Mayer <tobim@fastmail.fm>
@dominiklohmann dominiklohmann force-pushed the story/ch20439/type-flatbuffers branch from a41ee3f to af9a22a Compare November 24, 2021 08:29
Co-authored-by: tobim <tobim@fastmail.fm>
Copy link
Member

@tobim tobim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch has been running on a testbed at Tenzir for the past 15 hours with no unexpected performance changes.

The time to load a partition with the old serialization format has increased slightly increased, but we expect that to return back to the old numbers once the partition uses the new serialization format for its layout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality performance Improvements or regressions of performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use FlatBuffers for serializing type
3 participants