Client-side chunks 1: introduce `Chunk` and its suffle/sort routines #6438

teh-cmc · 2024-05-27T15:00:46Z

Introduces the new re_chunk crate:

A chunk of Rerun data, encoded using Arrow. Used for logging, transport, storage and compute.

Specifically, it introduces the Chunk type itself, and all methods and helpers related to sorting.
A Chunk is self-describing: it contains all the data and metadata needed to index it into storage.

There are a lot of things that need to be sorted within a Chunk, and as such we must make sure to keep track of what is or isn't sorted at all times, to avoid needlessly re-sorting things everytime a chunk changes hands.
This necessitates a bunch of sanity checking all over the place to make sure we never end up in undefined states.

Chunk is not about transport, it's about providing a nice-to-work with representation when manipulating a chunk in memory.
Transporting a Chunk happens in the next PR.

Fixes Efficient DataTable::sort shared with DataStore #1981

Part of a PR series to implement our new chunk-based data model on the client-side (SDKs):

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using examples from latest main build: rerun.io/viewer
- Using full set of examples from nightly build: rerun.io/viewer
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
If applicable, add a new check to the release checklist!

To run all checks from main, comment on the PR with @rerun-bot full-check.

teh-cmc · 2024-05-30T06:43:04Z

crates/re_chunk/src/chunk.rs

+    // TODO(cmc): maybe this would be better as raw i64s so getting time columns in and out of
+    // chunks is just a blind memcpy… it's probably not worth the hassle for now though.
+    // We'll see how things evolve as we start putting chunks in the backend.
+    pub(crate) times: Vec<TimeInt>,


Depending on how the backend side goes, I might actually end up not deserializing these at all, if I can afford it. That would be sweet.

teh-cmc · 2024-05-30T07:10:04Z

crates/re_chunk/src/chunk.rs

+/// data within.
+#[derive(Debug, Clone)]
+pub struct Chunk {
+    pub(crate) id: ChunkId,


I'm exploring the possibility of always making sure that the ID of a chunk is the same as the ID of its first row (in sorted order).

That would be way more useful than a random ID generated post-micro-batching, and would give way more meaning to sorting chunks based on their IDs.

teh-cmc · 2024-05-30T07:14:29Z

crates/re_chunk/src/chunk.rs

+    ///
+    /// Iff you know for sure whether the data is already appropriately sorted or not, specify `is_sorted`.
+    /// When left unspecified (`None`), it will be computed in O(n) time.
+    pub fn new(


TODO in this PR or another: when creating a chunk of static data, there is no reason to keep anything but the last row (in sorted row-id order).

The backend will have to support multi-rows static chunks anyhow since clients can send anything, which both the query engine and compaction will know how to take care of, but it's a nice little optimization on the standard path.

jleibs · 2024-05-30T15:30:28Z

crates/re_chunk/src/chunk.rs

+    /// Empty if this is a static chunk.
+    pub(crate) timelines: BTreeMap<Timeline, ChunkTimeline>,
+
+    /// A sparse `ListArray` for each component.


To my knowledge arrow doesn't have a spec for "sparse" listarray.

Do you mean nullable listarray?

Also, worth thinking about. Arrow now supports a ListView: https://arrow.apache.org/docs/format/Columnar.html#listview-layout

This could give us a mechanism to shuffle just the offsets in cases where we don't want to pay the full cost of rearranging the child buffer.

To my knowledge arrow doesn't have a spec for "sparse" listarray.

Do you mean nullable listarray?

I just find the "official" terminology extremely confusing: what's a nullable listarray exactly? a listarray that can be null? a listarray that can contain null values? both?

jleibs · 2024-05-30T15:36:21Z

crates/re_chunk/src/chunk.rs

+        #[allow(clippy::collapsible_if)] // readability
+        if cfg!(debug_assertions) {
+            for &time in times {
+                if time < time_range.min() || time > time_range.max() {


Is time_range allowed to be conservative or should we also be sanity-checking that this is a tight bound?

Tighter checks definitely cannot hurt

jleibs · 2024-05-30T15:44:05Z

crates/re_chunk/src/shuffle.rs

+    ///
+    /// If `make_contiguous` is `true`, the underlying arrow data will be copied and shuffled in
+    /// memory in order to make it contiguous.
+    /// Otherwise, only the offsets will be shuffled.


Otherwise, only the offsets will be shuffled.

I don't believe this is allowed for ListArray. Offsets must be monotonically increasing and dense -- the length of each array is (offset[n+1] - offset[n])

We could, however, do this with ListView instead.

Oh yeah, nice catch. No idea why arrow2 allows it :|

We're not going to get ListView into arrow2 any time soon obviously, so I'll just remove the non-contiguous path and leave a TODO that links to our arrow-rs migration ticket.

crates/re_chunk/src/shuffle.rs

This new and improved `re_format_arrow` ™️ brings two major improvements: - It is now designed to format standard Arrow dataframes (aka chunks or batches), i.e. a `Schema` and a `Chunk`. In particular: chunk-level and field-level schema metadata will now be rendered properly with the rest of the table. - Tables larger than your terminal will now do their best to fit in, while making sure to still show just enough data. E.g. here's an excerpt of a real-world Rerun dataframe from our `helix` example: ``` cargo r -p rerun-cli --no-default-features --features native_viewer -- print helix.rrd --verbose ``` before (`main`): ![image](https://github.com/rerun-io/rerun/assets/2910679/99169b2a-d972-439d-900a-8f122a4d5ca3) and after: ![image](https://github.com/rerun-io/rerun/assets/2910679/3fe7acce-d646-4ff2-bfae-eb5073d17741) --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

A `TransportChunk` is a `Chunk` that is ready for transport and/or storage. It is very cheap to go from `Chunk` to a `TransportChunk` and vice-versa. A `TransportChunk` maps 1:1 to a native Arrow `RecordBatch`. It has a stable ABI, and can be cheaply send across process boundaries. `arrow2` has no `RecordBatch` type; we will get one once we migrate to `arrow-rs`. A `TransportChunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. We rely heavily on chunk-level and field-level metadata to communicate Rerun-specific semantics over the wire, e.g. whether some columns are already properly sorted. The Arrow metadata system is fairly limited -- it's all untyped strings --, but for now that seems good enough. It will be trivial to switch to something else later, if need be. - Fixes #1760 - Fixes #1692 - Fixes #3360 - Fixes #1696 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

This is a fork of the old `DataTable` batcher, and works very similarly. Like before, this batcher will micro-batch using both space and time thresholds. There are two main differences: - This batcher maintains a dataframe per-entity, as opposed to the old one which worked globally. - Once a threshold is reached, this batcher further splits the incoming batch in order to fulfill these invariants: ```rust /// In particular, a [`Chunk`] cannot: /// * contain data for more than one entity path /// * contain rows with different sets of timelines /// * use more than one datatype for a given component /// * contain more rows than a pre-configured threshold if one or more timelines are unsorted ``` Most of the code is the same, the real interesting piece is `PendingRow::many_into_chunks`, as well as the newly added tests. - Fixes #4431 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

Integrate the new chunk batcher in all SDKs, and get rid of the old one. On the backend, we make sure to deserialize incoming chunks into the old `DataTable`s, so business can continue as usual. Although the new batcher has a much more complicated task with all these sub-splits to manage, it is somehow already more performant than the old one 🤷‍♂️: ```bash # this branch cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.499 s ± 0.117 s [User: 5.544 s, System: 1.836 s] Range (min … max): 4.226 s … 4.640 s 15 runs # main cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.407 s ± 0.773 s [User: 8.423 s, System: 0.880 s] Range (min … max): 2.997 s … 6.148 s 15 runs ``` Notice the massive difference in user time. --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

teh-cmc added 🏹 arrow concerning arrow ⛃ re_datastore affects the datastore itself do-not-merge Do not merge this PR include in changelog 🔩 data model 🪵 Log & send APIs Affects the user-facing API for all languages labels May 27, 2024

teh-cmc force-pushed the cmc/dense_chunks_1_intro branch 2 times, most recently from 403f441 to 7276a7c Compare May 27, 2024 15:21

This was referenced May 27, 2024

Client-side chunks 0: improved arrow chunk formatters #6437

Merged

Client-side chunks 2: introduce TransportChunk #6439

Merged

Client-side chunks 3: micro-batching #6440

Merged

Client-side chunks 4: integrations #6441

Merged

teh-cmc marked this pull request as ready for review May 27, 2024 16:29

teh-cmc force-pushed the cmc/dense_chunks_0_better_formatting branch from 948b430 to fd577c0 Compare May 29, 2024 07:28

teh-cmc force-pushed the cmc/dense_chunks_1_intro branch from 7276a7c to 94a5e7f Compare May 29, 2024 07:31

teh-cmc commented May 30, 2024

View reviewed changes

jleibs reviewed May 30, 2024

View reviewed changes

teh-cmc removed the do-not-merge Do not merge this PR label May 31, 2024

Base automatically changed from cmc/dense_chunks_0_better_formatting to main May 31, 2024 08:00

teh-cmc added 5 commits May 31, 2024 10:26

more sizebytes helpers for arrow

d4eedb4

introduce re_chunk and shuffling routines

8cbc815

fix unrelatedly broken bench

9fcdb87

review

a26bcd7

post-rebase lock updates

e22ff58

teh-cmc force-pushed the cmc/dense_chunks_1_intro branch from 1981f31 to e22ff58 Compare May 31, 2024 08:27

more reviewg

b3074a1

teh-cmc merged commit 6d94947 into main May 31, 2024
30 checks passed

teh-cmc deleted the cmc/dense_chunks_1_intro branch May 31, 2024 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client-side chunks 1: introduce `Chunk` and its suffle/sort routines #6438

Client-side chunks 1: introduce `Chunk` and its suffle/sort routines #6438

teh-cmc commented May 27, 2024 •

edited by github-actions bot

Loading

teh-cmc May 30, 2024

teh-cmc May 30, 2024

teh-cmc May 30, 2024

jleibs May 30, 2024

jleibs May 30, 2024

teh-cmc May 31, 2024 •

edited

Loading

jleibs May 30, 2024

teh-cmc May 31, 2024

jleibs May 30, 2024

teh-cmc May 31, 2024

Client-side chunks 1: introduce Chunk and its suffle/sort routines #6438

Client-side chunks 1: introduce Chunk and its suffle/sort routines #6438

Conversation

teh-cmc commented May 27, 2024 • edited by github-actions bot Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teh-cmc May 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Client-side chunks 1: introduce `Chunk` and its suffle/sort routines #6438

Client-side chunks 1: introduce `Chunk` and its suffle/sort routines #6438

teh-cmc commented May 27, 2024 •

edited by github-actions bot

Loading

teh-cmc May 31, 2024 •

edited

Loading