-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client-side chunks 1: introduce Chunk
and its suffle/sort routines
#6438
Conversation
403f441
to
7276a7c
Compare
948b430
to
fd577c0
Compare
7276a7c
to
94a5e7f
Compare
// TODO(cmc): maybe this would be better as raw i64s so getting time columns in and out of | ||
// chunks is just a blind memcpy… it's probably not worth the hassle for now though. | ||
// We'll see how things evolve as we start putting chunks in the backend. | ||
pub(crate) times: Vec<TimeInt>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how the backend side goes, I might actually end up not deserializing these at all, if I can afford it. That would be sweet.
/// data within. | ||
#[derive(Debug, Clone)] | ||
pub struct Chunk { | ||
pub(crate) id: ChunkId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm exploring the possibility of always making sure that the ID of a chunk is the same as the ID of its first row (in sorted order).
That would be way more useful than a random ID generated post-micro-batching, and would give way more meaning to sorting chunks based on their IDs.
/// | ||
/// Iff you know for sure whether the data is already appropriately sorted or not, specify `is_sorted`. | ||
/// When left unspecified (`None`), it will be computed in O(n) time. | ||
pub fn new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO in this PR or another: when creating a chunk of static data, there is no reason to keep anything but the last row (in sorted row-id order).
The backend will have to support multi-rows static chunks anyhow since clients can send anything, which both the query engine and compaction will know how to take care of, but it's a nice little optimization on the standard path.
/// Empty if this is a static chunk. | ||
pub(crate) timelines: BTreeMap<Timeline, ChunkTimeline>, | ||
|
||
/// A sparse `ListArray` for each component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge arrow doesn't have a spec for "sparse" listarray.
Do you mean nullable listarray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, worth thinking about. Arrow now supports a ListView: https://arrow.apache.org/docs/format/Columnar.html#listview-layout
This could give us a mechanism to shuffle just the offsets in cases where we don't want to pay the full cost of rearranging the child buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge arrow doesn't have a spec for "sparse" listarray.
Do you mean nullable listarray?
I just find the "official" terminology extremely confusing: what's a nullable listarray exactly? a listarray that can be null? a listarray that can contain null values? both?
#[allow(clippy::collapsible_if)] // readability | ||
if cfg!(debug_assertions) { | ||
for &time in times { | ||
if time < time_range.min() || time > time_range.max() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is time_range allowed to be conservative or should we also be sanity-checking that this is a tight bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tighter checks definitely cannot hurt
crates/re_chunk/src/shuffle.rs
Outdated
/// | ||
/// If `make_contiguous` is `true`, the underlying arrow data will be copied and shuffled in | ||
/// memory in order to make it contiguous. | ||
/// Otherwise, only the offsets will be shuffled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, only the offsets will be shuffled.
I don't believe this is allowed for ListArray. Offsets must be monotonically increasing and dense -- the length of each array is (offset[n+1] - offset[n])
We could, however, do this with ListView instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, nice catch. No idea why arrow2
allows it :|
We're not going to get ListView
into arrow2
any time soon obviously, so I'll just remove the non-contiguous path and leave a TODO that links to our arrow-rs
migration ticket.
This new and improved `re_format_arrow` ™️ brings two major improvements: - It is now designed to format standard Arrow dataframes (aka chunks or batches), i.e. a `Schema` and a `Chunk`. In particular: chunk-level and field-level schema metadata will now be rendered properly with the rest of the table. - Tables larger than your terminal will now do their best to fit in, while making sure to still show just enough data. E.g. here's an excerpt of a real-world Rerun dataframe from our `helix` example: ``` cargo r -p rerun-cli --no-default-features --features native_viewer -- print helix.rrd --verbose ``` before (`main`): ![image](https://github.com/rerun-io/rerun/assets/2910679/99169b2a-d972-439d-900a-8f122a4d5ca3) and after: ![image](https://github.com/rerun-io/rerun/assets/2910679/3fe7acce-d646-4ff2-bfae-eb5073d17741) --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
1981f31
to
e22ff58
Compare
A `TransportChunk` is a `Chunk` that is ready for transport and/or storage. It is very cheap to go from `Chunk` to a `TransportChunk` and vice-versa. A `TransportChunk` maps 1:1 to a native Arrow `RecordBatch`. It has a stable ABI, and can be cheaply send across process boundaries. `arrow2` has no `RecordBatch` type; we will get one once we migrate to `arrow-rs`. A `TransportChunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. We rely heavily on chunk-level and field-level metadata to communicate Rerun-specific semantics over the wire, e.g. whether some columns are already properly sorted. The Arrow metadata system is fairly limited -- it's all untyped strings --, but for now that seems good enough. It will be trivial to switch to something else later, if need be. - Fixes #1760 - Fixes #1692 - Fixes #3360 - Fixes #1696 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
This is a fork of the old `DataTable` batcher, and works very similarly. Like before, this batcher will micro-batch using both space and time thresholds. There are two main differences: - This batcher maintains a dataframe per-entity, as opposed to the old one which worked globally. - Once a threshold is reached, this batcher further splits the incoming batch in order to fulfill these invariants: ```rust /// In particular, a [`Chunk`] cannot: /// * contain data for more than one entity path /// * contain rows with different sets of timelines /// * use more than one datatype for a given component /// * contain more rows than a pre-configured threshold if one or more timelines are unsorted ``` Most of the code is the same, the real interesting piece is `PendingRow::many_into_chunks`, as well as the newly added tests. - Fixes #4431 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
Integrate the new chunk batcher in all SDKs, and get rid of the old one. On the backend, we make sure to deserialize incoming chunks into the old `DataTable`s, so business can continue as usual. Although the new batcher has a much more complicated task with all these sub-splits to manage, it is somehow already more performant than the old one 🤷♂️: ```bash # this branch cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.499 s ± 0.117 s [User: 5.544 s, System: 1.836 s] Range (min … max): 4.226 s … 4.640 s 15 runs # main cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.407 s ± 0.773 s [User: 8.423 s, System: 0.880 s] Range (min … max): 2.997 s … 6.148 s 15 runs ``` Notice the massive difference in user time. --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441
Introduces the new
re_chunk
crate:Specifically, it introduces the
Chunk
type itself, and all methods and helpers related to sorting.A
Chunk
is self-describing: it contains all the data and metadata needed to index it into storage.There are a lot of things that need to be sorted within a
Chunk
, and as such we must make sure to keep track of what is or isn't sorted at all times, to avoid needlessly re-sorting things everytime a chunk changes hands.This necessitates a bunch of sanity checking all over the place to make sure we never end up in undefined states.
Chunk
is not about transport, it's about providing a nice-to-work with representation when manipulating a chunk in memory.Transporting a
Chunk
happens in the next PR.DataTable::sort
shared withDataStore
#1981Part of a PR series to implement our new chunk-based data model on the client-side (SDKs):
Chunk
and its suffle/sort routines #6438TransportChunk
#6439Checklist
main
build: rerun.io/viewernightly
build: rerun.io/viewerTo run all checks from
main
, comment on the PR with@rerun-bot full-check
.