Skip to content

Add FSSTView encoding: a ListView-style FSST array#8167

Closed
joseph-isaacs wants to merge 8 commits into
developfrom
claude/fsstview-array-listview-TdW45
Closed

Add FSSTView encoding: a ListView-style FSST array#8167
joseph-isaacs wants to merge 8 commits into
developfrom
claude/fsstview-array-listview-TdW45

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

FSSTView addresses its FSST-compressed codes with separate offsets and
sizes arrays (like ListView) instead of FSST's single monotonic offsets
array (like List/VarBin). Decoupling start from length means offsets need
not be monotonic or contiguous, so filter/take/slice become metadata-only:
they rewrite only the small offsets/sizes/lengths/validity arrays and reuse
the compressed byte heap and symbol table untouched.

This avoids the heap rewrite that plain FSST incurs on filter/take (which
delegate to VarBin), giving the same speed win ListView has over List.

  • New vortex.fsstview encoding in the fsst crate, reusing FSSTData for the
    symbol table + compressed byte heap. Children are declared with the
    #[array_slots(FSSTView)] proc macro (uncompressed_lengths, codes_offsets,
    codes_sizes, codes_validity).
  • Metadata-only FilterKernel, TakeExecute, and SliceReduce.
  • scalar_at decodes a single element via its offset+size slice.
  • Canonicalization gathers the live codes (possibly out-of-order) and
    bulk-decompresses into a VarBinView.
  • fsstview_from_fsst zero-copy conversion from an FSST array.
  • Registered in register_default_encodings.
  • Tests: canonical/filter/take/slice equivalence vs FSST, scalar_at, and
    filter/take/consistency conformance for nullable and non-nullable data.

Signed-off-by: Joe Isaacs joe.isaacs@live.co.uk

claude added 8 commits May 30, 2026 09:43
FSSTView addresses its FSST-compressed codes with separate `offsets` and
`sizes` arrays (like ListView) instead of FSST's single monotonic offsets
array (like List/VarBin). Decoupling start from length means offsets need
not be monotonic or contiguous, so filter/take/slice become metadata-only:
they rewrite only the small offsets/sizes/lengths/validity arrays and reuse
the compressed byte heap and symbol table untouched.

This avoids the heap rewrite that plain FSST incurs on filter/take (which
delegate to VarBin), giving the same speed win ListView has over List.

- New `vortex.fsstview` encoding in the fsst crate, reusing FSSTData for the
  symbol table + compressed byte heap. Children are declared with the
  `#[array_slots(FSSTView)]` proc macro (uncompressed_lengths, codes_offsets,
  codes_sizes, codes_validity).
- Metadata-only FilterKernel, TakeExecute, and SliceReduce.
- scalar_at decodes a single element via its offset+size slice.
- Canonicalization gathers the live codes (possibly out-of-order) and
  bulk-decompresses into a VarBinView.
- `fsstview_from_fsst` zero-copy conversion from an FSST array.
- Registered in `register_default_encodings`.
- Tests: canonical/filter/take/slice equivalence vs FSST, scalar_at, and
  filter/take/consistency conformance for nullable and non-nullable data.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…bench

Adds the second hop and the canonicalization decision for the FSSTView
pipeline, plus a benchmark that measures the trade-off directly.

- `fsst_filter_to_view` / `fsst_take_to_view`: reinterpret an FSSTArray as an
  FSSTView (sharing symbols + codes bytes) and apply the metadata-only kernel,
  so filtering/taking an FSSTArray never rewrites the compressed byte heap.

- Canonicalization now chooses a compaction strategy (FsstViewCompaction):
  - Direct: live codes still contiguous/in-order (untouched or sliced view) ->
    one bulk decompress, no copy.
  - GatherBulk ("compact"): copy the scattered live codes contiguous, then one
    bulk decompress. Wins when strings are short/numerous (per-call overhead
    dominates otherwise; the gather is cheap and unlocks bulk SIMD).
  - PerElement ("no compact"): decompress each element's slice in place, no
    copy. Wins when strings are long/few (the gather copy dominates).
  Auto picks Direct when contiguous, else GatherBulk/PerElement by average
  compressed bytes/element. `canonicalize_fsstview_with` exposes each strategy
  for benchmarking.

- benches/fsst_view_compute.rs: calls kernels directly (no dispatch) and
  measures each part. filter (selective/non-selective), take (shuffle /
  selective / dense), and a filter+take combo, over two ~2 MiB inputs (many
  short strings, fewer long strings). fsst pipeline compacts into a fresh
  FSSTArray each step then canonicalizes; fsstview pipeline stays metadata-only
  then canonicalizes under each compaction strategy.

- Tests: from_fsst helpers vs canonical, and all compaction strategies agree
  on both contiguous and scattered views.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The fsst_view_compute benchmark (two ~2 MiB inputs, ~12-byte and ~256-byte
strings) shows GatherBulk beats PerElement across the entire tested range, not
just for short strings as originally guessed. FSST's decoder has a fast 8-wide
body and a slow byte-by-byte tail; PerElement pays that tail once per element
while GatherBulk pays it once for the whole heap, which dominates the gather
memcpy even at 256-byte strings.

Selected medians (canonicalize after the metadata-only hop):
  take few_long/shuffle:    gather 459us  vs per_element 623us
  take few_long/dense:      gather 838us  vs per_element 981us
  filter many_short/nonsel: gather 5.38ms vs per_element 5.92ms

And the metadata-only hop itself is far cheaper than compacting FSST:
  take_step many_short/shuffle: view 650us vs fsst 2.84ms (~4x)
  take_step many_short/dense:   view 604us vs fsst 4.15ms (~7x)

So Auto now picks Direct when the live codes are contiguous and GatherBulk
otherwise; it never selects PerElement (kept selectable for measurement, wins
only in the few-very-long-strings extreme outside real columns). Drops the
SHORT_STRING_THRESHOLD heuristic and updates the docs to the measured behavior.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
- fsstview_from_fsst now reuses the FSST offsets buffer for codes_offsets via a
  zero-copy slice of its first `len` elements, instead of re-copying into a new
  Vec. Only the derived sizes array is freshly allocated.

- Add chain_pipeline_{fsst,view} benches: a 5-op alternating filter/take chain
  ending in a canonicalize. This is where the view model is meant to win — each
  fsst op re-compacts the byte heap (cost compounds with chain length), while
  the view converts once and chains metadata-only ops, deferring the single
  gather+decode to the end.

  Measured medians (100 samples):
    FewLong:   fsst 765us  -> view 481us  (1.6x)
    ManyShort: fsst 14.49ms -> view 9.64ms (1.5x)

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
… finding

Implements the "compact like a list / export paired slices into a VarBinView"
idea: decode contiguous heap runs straight into a heap-ordered buffer and point
VarBinView views back into it out of order, with no gather copy and duplicate
dedup. Wired as FsstViewCompaction::RunCoalesce, hash-free (sort-based), handles
nulls/empties/duplicates; covered by an adversarial gaps+shuffle+nullable test
and the all-strategies-agree test.

Benchmark verdict: it loses to GatherBulk everywhere, badly for short strings
(take many_short/shuffle ~18ms vs ~5.6ms). The random access you avoid at decode
time reappears at view-build time: views are built in element order over a
heap-ordered output, so make_view does N cache-missing random reads (and random
inlining copies for <=12-byte strings), plus an O(N log N) sort. GatherBulk's
output is element-ordered, so its view-build is sequential; the cheap sequential
gather memcpy beats the scattered view construction.

So Auto keeps using Direct (contiguous) / GatherBulk (otherwise) and never picks
RunCoalesce; it's retained as a selectable, measurable baseline. Docs updated
with the full reasoning.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…nches

Callgrind on a shuffle take showed the kernel's cost was dominated by running
take -> fill_null -> cast -> optimize three times (offsets/sizes/lengths). The
fill_null + cast is only needed when a null index could introduce a null, i.e.
when the indices are nullable. For non-nullable indices (the common case) the
children stay non-nullable, so we now skip fill_null entirely. Re-profiling
confirms fill_null (~450K ir) and its cast (~252K ir) drop out and the take
kernel falls from ~612K to ~474K instructions per call.

Also add take_op_only_view / filter_op_only_view benches that hoist the
one-time FSST->view conversion out of the timed loop, isolating the
metadata-only op. These show the op is constant-time regardless of size or
selectivity (~457 ns filter, ~657 ns take), like a ListView op — the earlier
"view loses on selective" was purely the O(n) conversion being charged to every
op, which only the first op of a chain actually pays.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ge idea

Adds FsstViewByteStats / fsstview_byte_stats reporting, in both compressed (code)
and uncompressed (decoded) space: live vs run-spanned vs whole-heap bytes,
distinct spans, run count, and the dead-byte waste a gap-merged decode would
carry. A byte_stats_report test prints it for a selective filter and a shuffle
take (run with --nocapture).

This quantifies why merging across gaps to keep decode runs long doesn't pay:

  filter_10pct (keep ~10% of 65536):
    runs=5945 over 6616 survivors (avg ~1.1 elem/run -> survivors are isolated)
    compressed: live=25.8KB, heap=255KB
    full-heap-merge waste = 89.9%  (you'd decode ~10x the needed compressed bytes)

  shuffle_take (reorder all):
    runs=1, waste=0%  (RunCoalesce's ideal) -- yet it still loses on time to
    GatherBulk because the random access just moves to view-build.

So the dead-value budget the gap-merge idea needs is blown immediately on a
selective filter (90% dead), and on the one input where merging is free
(shuffle, 0% dead) GatherBulk still wins. There's also a hard blocker: after a
filter the dead elements' uncompressed_lengths are gone and FSST decode only
returns a total written count, so a single gap-merged decode can't even locate
post-gap survivors. Conclusion: GatherBulk (zero waste) / Direct (contiguous)
remain the right canonicalization; the stats make the trade-off measurable.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The only overhead GatherBulk carries over the theoretical minimum is the gather
memcpy, and it was copying every element's span individually. For an
order-preserving filter, surviving neighbours are still heap-adjacent, so a run
of k survivors can be copied in one memcpy instead of k. The gather now
accumulates a contiguous [run_start, run_end) heap range and flushes it once per
run, making the copy cost proportional to the number of runs rather than the
number of elements.

This is a strict win where survivors form long runs (non-selective filter:
many_short/nonselective canonicalize ~5.38ms -> ~4.75ms) and a no-op for a
shuffle (no adjacency -> one copy per element as before, behind a cheap branch).
Combined with Direct (single contiguous run, zero copy), the export is now
optimal: gather work scales with run count, then one bulk decode, then a
sequential element-ordered view-build.

Correctness: spans are still emitted in element order, so the decoded buffer
stays element-ordered; coalescing only fires on genuine zero-gap adjacency.
Covered by the existing all-strategies-agree and gaps+shuffle+nullable tests.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 30, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1273 untouched benchmarks
🆕 88 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_canonical_into[(100, 100)] 307.9 µs 273.1 µs +12.72%
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 188.1 µs 225.4 µs -16.56%
🆕 Simulation chain_pipeline_fsst[FewLong] N/A 5.1 ms N/A
🆕 Simulation filter_op_only_view[few_long/selective_10pct] N/A 12.1 µs N/A
🆕 Simulation chain_pipeline_fsst[ManyShort] N/A 45.5 ms N/A
🆕 Simulation chain_pipeline_view[ManyShort] N/A 36.4 ms N/A
🆕 Simulation filter_op_only_view[many_short/nonselective_90pct] N/A 11.5 µs N/A
🆕 Simulation filter_pipeline_fsst[few_long/nonselective_90pct] N/A 3.7 ms N/A
🆕 Simulation filter_pipeline_fsst[few_long/selective_10pct] N/A 554.4 µs N/A
🆕 Simulation filter_pipeline_view[few_long/nonselective_90pct/auto] N/A 3.3 ms N/A
🆕 Simulation filter_pipeline_view[few_long/nonselective_90pct/per_element] N/A 3.4 ms N/A
🆕 Simulation filter_pipeline_view[few_long/selective_10pct/auto] N/A 624.8 µs N/A
🆕 Simulation chain_pipeline_view[FewLong] N/A 3.1 ms N/A
🆕 Simulation combo_pipeline_fsst[FewLong] N/A 647.6 µs N/A
🆕 Simulation combo_pipeline_fsst[ManyShort] N/A 4.5 ms N/A
🆕 Simulation combo_pipeline_view[FewLong] N/A 671.9 µs N/A
🆕 Simulation combo_pipeline_view[ManyShort] N/A 7.7 ms N/A
🆕 Simulation filter_op_only_view[few_long/nonselective_90pct] N/A 10.6 µs N/A
🆕 Simulation filter_op_only_view[many_short/selective_10pct] N/A 12.5 µs N/A
🆕 Simulation filter_pipeline_fsst[many_short/nonselective_90pct] N/A 13.2 ms N/A
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/fsstview-array-listview-TdW45 (cd88533) with develop (23ebab1)

Open in CodSpeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants