Add FSSTView encoding: a ListView-style FSST array#8167
Closed
joseph-isaacs wants to merge 8 commits into
Closed
Conversation
FSSTView addresses its FSST-compressed codes with separate `offsets` and `sizes` arrays (like ListView) instead of FSST's single monotonic offsets array (like List/VarBin). Decoupling start from length means offsets need not be monotonic or contiguous, so filter/take/slice become metadata-only: they rewrite only the small offsets/sizes/lengths/validity arrays and reuse the compressed byte heap and symbol table untouched. This avoids the heap rewrite that plain FSST incurs on filter/take (which delegate to VarBin), giving the same speed win ListView has over List. - New `vortex.fsstview` encoding in the fsst crate, reusing FSSTData for the symbol table + compressed byte heap. Children are declared with the `#[array_slots(FSSTView)]` proc macro (uncompressed_lengths, codes_offsets, codes_sizes, codes_validity). - Metadata-only FilterKernel, TakeExecute, and SliceReduce. - scalar_at decodes a single element via its offset+size slice. - Canonicalization gathers the live codes (possibly out-of-order) and bulk-decompresses into a VarBinView. - `fsstview_from_fsst` zero-copy conversion from an FSST array. - Registered in `register_default_encodings`. - Tests: canonical/filter/take/slice equivalence vs FSST, scalar_at, and filter/take/consistency conformance for nullable and non-nullable data. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…bench
Adds the second hop and the canonicalization decision for the FSSTView
pipeline, plus a benchmark that measures the trade-off directly.
- `fsst_filter_to_view` / `fsst_take_to_view`: reinterpret an FSSTArray as an
FSSTView (sharing symbols + codes bytes) and apply the metadata-only kernel,
so filtering/taking an FSSTArray never rewrites the compressed byte heap.
- Canonicalization now chooses a compaction strategy (FsstViewCompaction):
- Direct: live codes still contiguous/in-order (untouched or sliced view) ->
one bulk decompress, no copy.
- GatherBulk ("compact"): copy the scattered live codes contiguous, then one
bulk decompress. Wins when strings are short/numerous (per-call overhead
dominates otherwise; the gather is cheap and unlocks bulk SIMD).
- PerElement ("no compact"): decompress each element's slice in place, no
copy. Wins when strings are long/few (the gather copy dominates).
Auto picks Direct when contiguous, else GatherBulk/PerElement by average
compressed bytes/element. `canonicalize_fsstview_with` exposes each strategy
for benchmarking.
- benches/fsst_view_compute.rs: calls kernels directly (no dispatch) and
measures each part. filter (selective/non-selective), take (shuffle /
selective / dense), and a filter+take combo, over two ~2 MiB inputs (many
short strings, fewer long strings). fsst pipeline compacts into a fresh
FSSTArray each step then canonicalizes; fsstview pipeline stays metadata-only
then canonicalizes under each compaction strategy.
- Tests: from_fsst helpers vs canonical, and all compaction strategies agree
on both contiguous and scattered views.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The fsst_view_compute benchmark (two ~2 MiB inputs, ~12-byte and ~256-byte strings) shows GatherBulk beats PerElement across the entire tested range, not just for short strings as originally guessed. FSST's decoder has a fast 8-wide body and a slow byte-by-byte tail; PerElement pays that tail once per element while GatherBulk pays it once for the whole heap, which dominates the gather memcpy even at 256-byte strings. Selected medians (canonicalize after the metadata-only hop): take few_long/shuffle: gather 459us vs per_element 623us take few_long/dense: gather 838us vs per_element 981us filter many_short/nonsel: gather 5.38ms vs per_element 5.92ms And the metadata-only hop itself is far cheaper than compacting FSST: take_step many_short/shuffle: view 650us vs fsst 2.84ms (~4x) take_step many_short/dense: view 604us vs fsst 4.15ms (~7x) So Auto now picks Direct when the live codes are contiguous and GatherBulk otherwise; it never selects PerElement (kept selectable for measurement, wins only in the few-very-long-strings extreme outside real columns). Drops the SHORT_STRING_THRESHOLD heuristic and updates the docs to the measured behavior. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
- fsstview_from_fsst now reuses the FSST offsets buffer for codes_offsets via a
zero-copy slice of its first `len` elements, instead of re-copying into a new
Vec. Only the derived sizes array is freshly allocated.
- Add chain_pipeline_{fsst,view} benches: a 5-op alternating filter/take chain
ending in a canonicalize. This is where the view model is meant to win — each
fsst op re-compacts the byte heap (cost compounds with chain length), while
the view converts once and chains metadata-only ops, deferring the single
gather+decode to the end.
Measured medians (100 samples):
FewLong: fsst 765us -> view 481us (1.6x)
ManyShort: fsst 14.49ms -> view 9.64ms (1.5x)
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
… finding Implements the "compact like a list / export paired slices into a VarBinView" idea: decode contiguous heap runs straight into a heap-ordered buffer and point VarBinView views back into it out of order, with no gather copy and duplicate dedup. Wired as FsstViewCompaction::RunCoalesce, hash-free (sort-based), handles nulls/empties/duplicates; covered by an adversarial gaps+shuffle+nullable test and the all-strategies-agree test. Benchmark verdict: it loses to GatherBulk everywhere, badly for short strings (take many_short/shuffle ~18ms vs ~5.6ms). The random access you avoid at decode time reappears at view-build time: views are built in element order over a heap-ordered output, so make_view does N cache-missing random reads (and random inlining copies for <=12-byte strings), plus an O(N log N) sort. GatherBulk's output is element-ordered, so its view-build is sequential; the cheap sequential gather memcpy beats the scattered view construction. So Auto keeps using Direct (contiguous) / GatherBulk (otherwise) and never picks RunCoalesce; it's retained as a selectable, measurable baseline. Docs updated with the full reasoning. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…nches Callgrind on a shuffle take showed the kernel's cost was dominated by running take -> fill_null -> cast -> optimize three times (offsets/sizes/lengths). The fill_null + cast is only needed when a null index could introduce a null, i.e. when the indices are nullable. For non-nullable indices (the common case) the children stay non-nullable, so we now skip fill_null entirely. Re-profiling confirms fill_null (~450K ir) and its cast (~252K ir) drop out and the take kernel falls from ~612K to ~474K instructions per call. Also add take_op_only_view / filter_op_only_view benches that hoist the one-time FSST->view conversion out of the timed loop, isolating the metadata-only op. These show the op is constant-time regardless of size or selectivity (~457 ns filter, ~657 ns take), like a ListView op — the earlier "view loses on selective" was purely the O(n) conversion being charged to every op, which only the first op of a chain actually pays. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ge idea
Adds FsstViewByteStats / fsstview_byte_stats reporting, in both compressed (code)
and uncompressed (decoded) space: live vs run-spanned vs whole-heap bytes,
distinct spans, run count, and the dead-byte waste a gap-merged decode would
carry. A byte_stats_report test prints it for a selective filter and a shuffle
take (run with --nocapture).
This quantifies why merging across gaps to keep decode runs long doesn't pay:
filter_10pct (keep ~10% of 65536):
runs=5945 over 6616 survivors (avg ~1.1 elem/run -> survivors are isolated)
compressed: live=25.8KB, heap=255KB
full-heap-merge waste = 89.9% (you'd decode ~10x the needed compressed bytes)
shuffle_take (reorder all):
runs=1, waste=0% (RunCoalesce's ideal) -- yet it still loses on time to
GatherBulk because the random access just moves to view-build.
So the dead-value budget the gap-merge idea needs is blown immediately on a
selective filter (90% dead), and on the one input where merging is free
(shuffle, 0% dead) GatherBulk still wins. There's also a hard blocker: after a
filter the dead elements' uncompressed_lengths are gone and FSST decode only
returns a total written count, so a single gap-merged decode can't even locate
post-gap survivors. Conclusion: GatherBulk (zero waste) / Direct (contiguous)
remain the right canonicalization; the stats make the trade-off measurable.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The only overhead GatherBulk carries over the theoretical minimum is the gather memcpy, and it was copying every element's span individually. For an order-preserving filter, surviving neighbours are still heap-adjacent, so a run of k survivors can be copied in one memcpy instead of k. The gather now accumulates a contiguous [run_start, run_end) heap range and flushes it once per run, making the copy cost proportional to the number of runs rather than the number of elements. This is a strict win where survivors form long runs (non-selective filter: many_short/nonselective canonicalize ~5.38ms -> ~4.75ms) and a no-op for a shuffle (no adjacency -> one copy per element as before, behind a cheap branch). Combined with Direct (single contiguous run, zero copy), the export is now optimal: gather work scales with run count, then one bulk decode, then a sequential element-ordered view-build. Correctness: spans are still emitted in element order, so the decoded buffer stays element-ordered; coalescing only fires on genuine zero-gap adjacency. Covered by the existing all-strategies-agree and gaps+shuffle+nullable tests. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will not alter performance
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FSSTView addresses its FSST-compressed codes with separate
offsetsandsizesarrays (like ListView) instead of FSST's single monotonic offsetsarray (like List/VarBin). Decoupling start from length means offsets need
not be monotonic or contiguous, so filter/take/slice become metadata-only:
they rewrite only the small offsets/sizes/lengths/validity arrays and reuse
the compressed byte heap and symbol table untouched.
This avoids the heap rewrite that plain FSST incurs on filter/take (which
delegate to VarBin), giving the same speed win ListView has over List.
vortex.fsstviewencoding in the fsst crate, reusing FSSTData for thesymbol table + compressed byte heap. Children are declared with the
#[array_slots(FSSTView)]proc macro (uncompressed_lengths, codes_offsets,codes_sizes, codes_validity).
bulk-decompresses into a VarBinView.
fsstview_from_fsstzero-copy conversion from an FSST array.register_default_encodings.filter/take/consistency conformance for nullable and non-nullable data.
Signed-off-by: Joe Isaacs joe.isaacs@live.co.uk