Add FSSTView encoding: a ListView-style FSST array by joseph-isaacs · Pull Request #8167 · vortex-data/vortex

joseph-isaacs · 2026-05-30T15:45:12Z

FSSTView addresses its FSST-compressed codes with separate offsets and
sizes arrays (like ListView) instead of FSST's single monotonic offsets
array (like List/VarBin). Decoupling start from length means offsets need
not be monotonic or contiguous, so filter/take/slice become metadata-only:
they rewrite only the small offsets/sizes/lengths/validity arrays and reuse
the compressed byte heap and symbol table untouched.

This avoids the heap rewrite that plain FSST incurs on filter/take (which
delegate to VarBin), giving the same speed win ListView has over List.

New vortex.fsstview encoding in the fsst crate, reusing FSSTData for the
symbol table + compressed byte heap. Children are declared with the
#[array_slots(FSSTView)] proc macro (uncompressed_lengths, codes_offsets,
codes_sizes, codes_validity).
Metadata-only FilterKernel, TakeExecute, and SliceReduce.
scalar_at decodes a single element via its offset+size slice.
Canonicalization gathers the live codes (possibly out-of-order) and
bulk-decompresses into a VarBinView.
fsstview_from_fsst zero-copy conversion from an FSST array.
Registered in register_default_encodings.
Tests: canonical/filter/take/slice equivalence vs FSST, scalar_at, and
filter/take/consistency conformance for nullable and non-nullable data.

Signed-off-by: Joe Isaacs joe.isaacs@live.co.uk

FSSTView addresses its FSST-compressed codes with separate `offsets` and `sizes` arrays (like ListView) instead of FSST's single monotonic offsets array (like List/VarBin). Decoupling start from length means offsets need not be monotonic or contiguous, so filter/take/slice become metadata-only: they rewrite only the small offsets/sizes/lengths/validity arrays and reuse the compressed byte heap and symbol table untouched. This avoids the heap rewrite that plain FSST incurs on filter/take (which delegate to VarBin), giving the same speed win ListView has over List. - New `vortex.fsstview` encoding in the fsst crate, reusing FSSTData for the symbol table + compressed byte heap. Children are declared with the `#[array_slots(FSSTView)]` proc macro (uncompressed_lengths, codes_offsets, codes_sizes, codes_validity). - Metadata-only FilterKernel, TakeExecute, and SliceReduce. - scalar_at decodes a single element via its offset+size slice. - Canonicalization gathers the live codes (possibly out-of-order) and bulk-decompresses into a VarBinView. - `fsstview_from_fsst` zero-copy conversion from an FSST array. - Registered in `register_default_encodings`. - Tests: canonical/filter/take/slice equivalence vs FSST, scalar_at, and filter/take/consistency conformance for nullable and non-nullable data. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…bench Adds the second hop and the canonicalization decision for the FSSTView pipeline, plus a benchmark that measures the trade-off directly. - `fsst_filter_to_view` / `fsst_take_to_view`: reinterpret an FSSTArray as an FSSTView (sharing symbols + codes bytes) and apply the metadata-only kernel, so filtering/taking an FSSTArray never rewrites the compressed byte heap. - Canonicalization now chooses a compaction strategy (FsstViewCompaction): - Direct: live codes still contiguous/in-order (untouched or sliced view) -> one bulk decompress, no copy. - GatherBulk ("compact"): copy the scattered live codes contiguous, then one bulk decompress. Wins when strings are short/numerous (per-call overhead dominates otherwise; the gather is cheap and unlocks bulk SIMD). - PerElement ("no compact"): decompress each element's slice in place, no copy. Wins when strings are long/few (the gather copy dominates). Auto picks Direct when contiguous, else GatherBulk/PerElement by average compressed bytes/element. `canonicalize_fsstview_with` exposes each strategy for benchmarking. - benches/fsst_view_compute.rs: calls kernels directly (no dispatch) and measures each part. filter (selective/non-selective), take (shuffle / selective / dense), and a filter+take combo, over two ~2 MiB inputs (many short strings, fewer long strings). fsst pipeline compacts into a fresh FSSTArray each step then canonicalizes; fsstview pipeline stays metadata-only then canonicalizes under each compaction strategy. - Tests: from_fsst helpers vs canonical, and all compaction strategies agree on both contiguous and scattered views. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The fsst_view_compute benchmark (two ~2 MiB inputs, ~12-byte and ~256-byte strings) shows GatherBulk beats PerElement across the entire tested range, not just for short strings as originally guessed. FSST's decoder has a fast 8-wide body and a slow byte-by-byte tail; PerElement pays that tail once per element while GatherBulk pays it once for the whole heap, which dominates the gather memcpy even at 256-byte strings. Selected medians (canonicalize after the metadata-only hop): take few_long/shuffle: gather 459us vs per_element 623us take few_long/dense: gather 838us vs per_element 981us filter many_short/nonsel: gather 5.38ms vs per_element 5.92ms And the metadata-only hop itself is far cheaper than compacting FSST: take_step many_short/shuffle: view 650us vs fsst 2.84ms (~4x) take_step many_short/dense: view 604us vs fsst 4.15ms (~7x) So Auto now picks Direct when the live codes are contiguous and GatherBulk otherwise; it never selects PerElement (kept selectable for measurement, wins only in the few-very-long-strings extreme outside real columns). Drops the SHORT_STRING_THRESHOLD heuristic and updates the docs to the measured behavior. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

- fsstview_from_fsst now reuses the FSST offsets buffer for codes_offsets via a zero-copy slice of its first `len` elements, instead of re-copying into a new Vec. Only the derived sizes array is freshly allocated. - Add chain_pipeline_{fsst,view} benches: a 5-op alternating filter/take chain ending in a canonicalize. This is where the view model is meant to win — each fsst op re-compacts the byte heap (cost compounds with chain length), while the view converts once and chains metadata-only ops, deferring the single gather+decode to the end. Measured medians (100 samples): FewLong: fsst 765us -> view 481us (1.6x) ManyShort: fsst 14.49ms -> view 9.64ms (1.5x) Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

… finding Implements the "compact like a list / export paired slices into a VarBinView" idea: decode contiguous heap runs straight into a heap-ordered buffer and point VarBinView views back into it out of order, with no gather copy and duplicate dedup. Wired as FsstViewCompaction::RunCoalesce, hash-free (sort-based), handles nulls/empties/duplicates; covered by an adversarial gaps+shuffle+nullable test and the all-strategies-agree test. Benchmark verdict: it loses to GatherBulk everywhere, badly for short strings (take many_short/shuffle ~18ms vs ~5.6ms). The random access you avoid at decode time reappears at view-build time: views are built in element order over a heap-ordered output, so make_view does N cache-missing random reads (and random inlining copies for <=12-byte strings), plus an O(N log N) sort. GatherBulk's output is element-ordered, so its view-build is sequential; the cheap sequential gather memcpy beats the scattered view construction. So Auto keeps using Direct (contiguous) / GatherBulk (otherwise) and never picks RunCoalesce; it's retained as a selectable, measurable baseline. Docs updated with the full reasoning. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…nches Callgrind on a shuffle take showed the kernel's cost was dominated by running take -> fill_null -> cast -> optimize three times (offsets/sizes/lengths). The fill_null + cast is only needed when a null index could introduce a null, i.e. when the indices are nullable. For non-nullable indices (the common case) the children stay non-nullable, so we now skip fill_null entirely. Re-profiling confirms fill_null (~450K ir) and its cast (~252K ir) drop out and the take kernel falls from ~612K to ~474K instructions per call. Also add take_op_only_view / filter_op_only_view benches that hoist the one-time FSST->view conversion out of the timed loop, isolating the metadata-only op. These show the op is constant-time regardless of size or selectivity (~457 ns filter, ~657 ns take), like a ListView op — the earlier "view loses on selective" was purely the O(n) conversion being charged to every op, which only the first op of a chain actually pays. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…ge idea Adds FsstViewByteStats / fsstview_byte_stats reporting, in both compressed (code) and uncompressed (decoded) space: live vs run-spanned vs whole-heap bytes, distinct spans, run count, and the dead-byte waste a gap-merged decode would carry. A byte_stats_report test prints it for a selective filter and a shuffle take (run with --nocapture). This quantifies why merging across gaps to keep decode runs long doesn't pay: filter_10pct (keep ~10% of 65536): runs=5945 over 6616 survivors (avg ~1.1 elem/run -> survivors are isolated) compressed: live=25.8KB, heap=255KB full-heap-merge waste = 89.9% (you'd decode ~10x the needed compressed bytes) shuffle_take (reorder all): runs=1, waste=0% (RunCoalesce's ideal) -- yet it still loses on time to GatherBulk because the random access just moves to view-build. So the dead-value budget the gap-merge idea needs is blown immediately on a selective filter (90% dead), and on the one input where merging is free (shuffle, 0% dead) GatherBulk still wins. There's also a hard blocker: after a filter the dead elements' uncompressed_lengths are gone and FSST decode only returns a total written count, so a single gap-merged decode can't even locate post-gap survivors. Conclusion: GatherBulk (zero waste) / Direct (contiguous) remain the right canonicalization; the stats make the trade-off measurable. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The only overhead GatherBulk carries over the theoretical minimum is the gather memcpy, and it was copying every element's span individually. For an order-preserving filter, surviving neighbours are still heap-adjacent, so a run of k survivors can be copied in one memcpy instead of k. The gather now accumulates a contiguous [run_start, run_end) heap range and flushes it once per run, making the copy cost proportional to the number of runs rather than the number of elements. This is a strict win where survivors form long runs (non-selective filter: many_short/nonselective canonicalize ~5.38ms -> ~4.75ms) and a no-op for a shuffle (no adjacency -> one copy per element as before, behind a cheap branch). Combined with Direct (single contiguous run, zero copy), the export is now optimal: gather work scales with run count, then one bulk decode, then a sequential element-ordered view-build. Correctness: spans are still emitted in element order, so the decoded buffer stays element-ordered; coalescing only fires on genuine zero-gap adjacency. Covered by the existing all-strategies-agree and gaps+shuffle+nullable tests. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

codspeed-hq · 2026-05-30T15:52:51Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1273 untouched benchmarks
🆕 88 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	307.9 µs	273.1 µs	+12.72%
❌	Simulation	`chunked_varbinview_opt_canonical_into[(1000, 10)]`	188.1 µs	225.4 µs	-16.56%
🆕	Simulation	`chain_pipeline_fsst[FewLong]`	N/A	5.1 ms	N/A
🆕	Simulation	`filter_op_only_view[few_long/selective_10pct]`	N/A	12.1 µs	N/A
🆕	Simulation	`chain_pipeline_fsst[ManyShort]`	N/A	45.5 ms	N/A
🆕	Simulation	`chain_pipeline_view[ManyShort]`	N/A	36.4 ms	N/A
🆕	Simulation	`filter_op_only_view[many_short/nonselective_90pct]`	N/A	11.5 µs	N/A
🆕	Simulation	`filter_pipeline_fsst[few_long/nonselective_90pct]`	N/A	3.7 ms	N/A
🆕	Simulation	`filter_pipeline_fsst[few_long/selective_10pct]`	N/A	554.4 µs	N/A
🆕	Simulation	`filter_pipeline_view[few_long/nonselective_90pct/auto]`	N/A	3.3 ms	N/A
🆕	Simulation	`filter_pipeline_view[few_long/nonselective_90pct/per_element]`	N/A	3.4 ms	N/A
🆕	Simulation	`filter_pipeline_view[few_long/selective_10pct/auto]`	N/A	624.8 µs	N/A
🆕	Simulation	`chain_pipeline_view[FewLong]`	N/A	3.1 ms	N/A
🆕	Simulation	`combo_pipeline_fsst[FewLong]`	N/A	647.6 µs	N/A
🆕	Simulation	`combo_pipeline_fsst[ManyShort]`	N/A	4.5 ms	N/A
🆕	Simulation	`combo_pipeline_view[FewLong]`	N/A	671.9 µs	N/A
🆕	Simulation	`combo_pipeline_view[ManyShort]`	N/A	7.7 ms	N/A
🆕	Simulation	`filter_op_only_view[few_long/nonselective_90pct]`	N/A	10.6 µs	N/A
🆕	Simulation	`filter_op_only_view[many_short/selective_10pct]`	N/A	12.5 µs	N/A
🆕	Simulation	`filter_pipeline_fsst[many_short/nonselective_90pct]`	N/A	13.2 ms	N/A
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/fsstview-array-listview-TdW45 (cd88533) with develop (23ebab1)}

claude added 8 commits May 30, 2026 09:43

joseph-isaacs closed this May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FSSTView encoding: a ListView-style FSST array#8167

Add FSSTView encoding: a ListView-style FSST array#8167
joseph-isaacs wants to merge 8 commits into
developfrom
claude/fsstview-array-listview-TdW45

joseph-isaacs commented May 30, 2026

Uh oh!

codspeed-hq Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented May 30, 2026

Uh oh!

codspeed-hq Bot commented May 30, 2026

Merging this PR will not alter performance

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants