[claude] FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner by joseph-isaacs · Pull Request #7921 · vortex-data/vortex

joseph-isaacs · 2026-05-14T12:07:34Z

Stacks 15 commits on top of ji/fsst-like-paper-2-work-clean covering perf, coverage, and routing.

Perf summary, divan medians, 100k rows

Pattern (dataset)	Pre-SSA baseline	HEAD	Δ
`%google%` urls (ClickBench Q20 target)	1.16 ms	0.95 ms	1.22× faster
`%gmail%` email	1.12 ms	0.60 ms	1.86× faster
`%ear%` urls (DESIGN.md SSA target)	1.41 ms	0.69 ms	2.05× faster
`%ear%` cb	3.82 ms	3.03 ms	1.26× faster
`%htt%` urls (saturated SSA)	0.59 ms	0.62 ms	parity (dense short-circuit)
`%htt%` cb	1.31 ms	1.30 ms	parity
`%https%` urls	0.70 ms	0.80 ms	1.14× slower
`NOT LIKE %google%` urls	row-loop	0.95 ms	streams now

What's in the stack (15 commits)

Performance / routing

6c0ac95dc Escape-only memmem fast path — when no FSST symbol expansion contains any needle byte, the only compressed sequence reaching the contains DFA's accept state is exactly [ESCAPE, n[0], ESCAPE, n[1], …, ESCAPE, n[L-1]]. Single memmem over all_bytes prefilters rows; verifier resolves the rare literal-byte-position false positive.
9e826367f Same fast path ported to FlatContainsDfa (128–254-byte needles) and MultiContainsDfa (longest segment as anchor).
b466388f6 Fused 32-byte SSA nibble in AVX2/AVX-512 Teddy — DESIGN.md's proper SSA-merge, implemented inside fused_teddy_pair_scan / fused_teddy_triple_scan. Reuses v1's nibble register for an additional PSHUFB-Mula lookup over single_step_accept_codes; ORs into the Teddy mask before vpcmpgtb / vpcmpneqb. Two extra vector ops per block, one PSHUFB pass total. Eliminates the dense-bitset 1-byte fallback for selective SSA needles.
970c09228 NOT LIKE streaming (drop like.rs:113 gate, NOT LIKE routes through Teddy + fused SSA) and dense-pattern short-circuit (8 KiB scalar SSA-byte sample; if estimated > 32k candidates total, route directly to 1-byte fallback). Recovers the %htt% regression that pure fusion left.
24f94ed00 NEON SSA fusion mirrors the AVX2 / AVX-512 logic; cfg-gated.

Coverage

b791a0097 _ single-byte wildcard for prefix and suffix patterns. KMP byte-table fills the row at wildcard positions. _ in %contains% deliberately rejected — KMP failure with wildcards is unsound (false positives); a correct contains-with-_ needs NFA subset construction.
5e5334c62 ILIKE / ASCII case-insensitive. Case folding at DFA construction time via set_advance; pattern equality uses case-folded compare in the failure function. Threaded through every DFA constructor via FsstMatcher::try_new_with.

Subagent track (parallel worktrees, then merged)

eb2b2e6bc Shift-Or / Bitap matcher for short needles (≤8 bytes). Bit-parallel single-u64-state matcher with per-symbol composition (shift_bits, or_mask, state_accept_mask). Routing gated behind "no escape-only eligible, no first-byte present in any symbol, no SSA" — on URL-shaped FSST dicts that's rarely true, so Shift-Or rarely fires in practice. Variant is correct + tested; pays off on sparser dictionaries.
0e8889c18 / 666ad9cde Fat Teddy multi-needle OR — 8-bucket Fat Teddy with greedy bucket-packing, AVX2 + scalar; per-bucket FoldedContainsDfa::matches verifier. MultiNeedleMatcher::try_new_multi / scan_or_to_bitbuf for LIKE x OR LIKE y OR …. Deferred TODOs in code: cross-bucket FDR for ESCAPE_CODE-anchored needles (currently falls back to N-pass), AVX-512 + NEON variants.
b994a06ac / d24e5cb22 Engine planner / cost-model routing — refactors scan_to_bitbuf in folded / flat / multi to dispatch via match planner.plan_*(&ctx) into per-path run_* helpers. ArchProfile (CPUID once), estimated_cost_ns cost model, VORTEX_FSST_PLAN_TRACE=1. Bench-parity regression test test_planner_matches_legacy_cascade asserts the planner reproduces the legacy cascade's decision for every fsst_contains bench. Reserved ShiftOr enum variant with TODO comment — current Shift-Or routing still happens in try_new_with, not via the planner.

Misc

f0c8b5146 Follow-ups doc at encodings/fsst/src/dfa/FOLLOWUPS.md with subagent-ready briefs for the three tasks above plus mixed-anchors / contains-with-_ for future work.

Tests

210 tests pass with --features _test-harness (157 baseline + 53 new):

5 + 4 escape-only memmem
6 wildcard
6 ILIKE
16 Shift-Or
12 Fat Teddy (incl. 2 _test-harness-gated property tests vs OR-of-singles)
12 planner (incl. test_planner_matches_legacy_cascade against the bench corpus)

cargo +nightly fmt --all clean. cargo clippy -p vortex-fsst --all-targets --all-features — no new lints in changed files (preexisting lints in mod.rs:498, anchor_scan.rs:3100+, dfa_compressed/* remain unaddressed).

Deferred (preserved as TODOs in code; documented in FOLLOWUPS.md)

Cross-bucket FDR for ESCAPE_CODE-anchored needles in Fat Teddy.
AVX-512 + NEON variants of fat_teddy_pass_*.
Planner integration of the ShiftOr plan (reserved slot exists).
Mixed anchors (pre%mid%suf) — needs MultiContainsDfa refactor.
Contains with _ — needs NFA subset construction.

Test plan

cargo test -p vortex-fsst --lib --features _test-harness — must report 210 passing.
cargo bench -p vortex-fsst --bench fsst_like --features _test-harness — confirm the headline perf table above on the reviewer's hardware.
cargo clippy -p vortex-fsst --all-targets --all-features — diff against base; no new lints in encodings/fsst/src/dfa/* or encodings/fsst/src/compute/like.rs.
Spot-check VORTEX_FSST_TEDDY_TRACE=1 output for %ear% (should route through fused SSA Teddy) and %htt% (should route through dense short-circuit).
Review the merge resolutions in 666ad9cde (Fat Teddy) and d24e5cb22 (planner) — kept HEAD's stricter ShiftOr gate, appended Fat Teddy's MultiNeedleMatcher section, and combined both subagents' test additions.
Run an independent code review on the three subagent-authored files (shift_or.rs 806 LOC, fat_teddy.rs 744 LOC, planner.rs ~430 LOC) — I reviewed merge conflicts but did not exhaustively audit each subagent's implementation. Consider spawning a pr-review agent on the diff range 049c79f89..d24e5cb22.

🤖 Generated with Claude Code

Generated by Claude Code

When no FSST symbol's expansion contains any byte of the needle, the only compressed sequence that reaches `FoldedContainsDfa::accept` from state 0 is exactly `[ESCAPE, n[0], ESCAPE, n[1], ..., ESCAPE, n[L-1]]`. Any non-escape code resets the DFA to state 0 (symbols can't contribute a needle byte), so a single `memmem` for the 2L-byte encoded pattern is strictly more selective than the existing 2-byte `(ESCAPE, n[0])` escape-pair anchor. The hot regime is `%needle%` where the needle bytes are absent from the trained dict — e.g. queries for tokens that never appear in the column, or rare-alphabet substrings on text columns. Memmem with a long pattern benefits from skip heuristics, so candidate density drops roughly L× versus the 2-byte anchor. Verification still runs the standard DFA per candidate row so the rare literal-position memmem hit (compressed[p-1] = ESCAPE AND compressed[p] = 255) is handled correctly — `FoldedContainsDfa::matches` always starts from a code position, so its result is exact regardless of where memmem landed. The new path is gated on needle length >= 2 — at L = 1 the encoded pattern is identical to the existing escape_pair 2-byte memmem, so adding a separate path would just add a redundant branch. Signed-off-by: Claude <noreply@anthropic.com>

Port the escape-only prefilter from FoldedContainsDfa to the other two contains shapes: - `FlatContainsDfa` (needles 128..=254 bytes): identical logic — same detection, same encoded pattern, same per-row verification. Long needles benefit the most here because memmem's skip distance scales with pattern length. - `MultiContainsDfa` (`%seg1%seg2%…`): every segment must appear in the row for a match, so the LONGEST segment's encoded pattern is a sound (and most-selective) row-level filter. Hits are verified by the full chained DFA which checks segment ordering. Only the union of all segment bytes is required to be absent from every symbol. The detection and pattern builder (`needle_bytes_absent_from_all_symbols`, `build_escape_only_encoded_pattern`) are shared from `dfa/mod.rs`. Signed-off-by: Claude <noreply@anthropic.com>

The bucketed Teddy pair/triple scan only anchors on c1's whose one-step state from 0 is non-accept (an SSA c1 has no advancing c2 to anchor on, so the pair scheme would miss it). Today the scan-path gate skips Teddy entirely when any SSA code exists and falls to the 1-byte progressing bitset, which on URL-shaped data marks nearly every byte and verifies most as false positives. Run Teddy unconditionally when its buckets exist (they already exclude SSA c1's by construction) and OR-merge its result with a separate SSA-only pass: - Build a 1-byte PSHUFB-Mula bitset over `single_step_accept_codes` (typically 1-4 SSA codes -> very sparse). - For each row touched by the SSA bitset, verify with the standard `matches` DFA. This is exact, including for the rare hit where the SSA code's byte value appears at a literal position (compressed [p-1] = ESCAPE, compressed[p] = SSA-code-value-as-literal): the DFA starts at the row's first code position and never enters that misalignment. - Combine: non-negated → `teddy | ssa`, negated → `teddy & ssa`. When neither Teddy bucket nor SSA exists, fall back to the 1-byte progressing-codes bitset as before; the 1-byte path remains correct for the SSA-only case too (its `matches_with_bitset` is the full DFA). `scan_plan_name` now reports `triple_streaming+ssa_merge` / `pair_streaming+ssa_merge` / `escape_pair_streaming+ssa_merge` so tracing and tests can verify the path that fired. Signed-off-by: Claude <noreply@anthropic.com>

This reverts commit 5bedb1d.

DESIGN.md's SSA-merge proposal: build Teddy over the non-SSA progressing c1's and a 1-byte PSHUFB over the SSA codes, then OR the two candidate streams *inside* `fused_teddy_*_scan`. The earlier row-level OR-merge attempt regressed everywhere because it paid for two separate PSHUFB passes over `all_bytes` plus a per-row `matches` verify; this commit does it the right way. Mechanism: - `fused_teddy_pair_scan` / `fused_teddy_triple_scan` take an optional `ssa_codes: &[u8]`. If present, an `NibbleTables` is built once and threaded through `run_teddy_*_pass` to every SIMD variant. - In `teddy_pair_pass_avx2` and `teddy_triple_pass_avx512`, the SSA nibble tables are broadcast into a pair of YMM/ZMM registers up front. Per 32-byte (AVX2) / 64-byte (AVX-512) block the existing PSHUFB pair on `v1`'s nibbles is reused to also evaluate the SSA set — two extra vector ops (`shuffle` × 2, `and`, `or`) — and the resulting `ssa_bits` register is OR'd into the Teddy candidate mask before `vpcmpgtb` / `vpcmpneqb`. Candidates from SSA flow into the same `tzcnt`-peeling loop and the same per-candidate `verify_at` dispatch. - Tail scalar paths fold SSA in cheaply (one nibble table lookup per byte), and the AVX-512 and AVX2 tails additionally check the last 1–2 positions that the pair/triple scan skips for lack of a successor. - `teddy_pair_pass_scalar` mirrors the AVX2 logic. - `teddy_pair_pass_neon` and the avx2/neon/scalar triple variants accept the parameter but skip the fusion for now; on AArch64 the caller falls back to the non-fused path. (Marked with TODO.) Caller (`FoldedContainsDfa::scan_to_bitbuf`) drops the `single_step_accept_codes.is_none()` gate and passes `single_step_accept_codes` as `ssa_codes` to the Teddy entry points. The escape-pair specialization (one bucket, c1 = ESCAPE, ≤3 c2's) still doesn't fuse SSA — when SSA codes are present we fall through to the generic pair-streaming path, which does. `scan_plan_name` reports `triple_streaming+ssa_fused` / `pair_streaming+ssa_fused` when fusion is active. Bench (divan medians, 100k rows, `cargo bench -p vortex-fsst --bench fsst_like --features _test-harness`): | pattern (dataset) | pre-SSA | fused | Δ | |---|---|---|---| | %google% urls (Q20) | 1.16 ms | 0.94 ms | 1.23× faster | | %gmail% email | 1.12 ms | 0.60 ms | 1.86× faster | | %ear% urls (DESIGN target) | 1.41 ms | 0.69 ms | 2.05× faster | | %ear% cb | 3.82 ms | 3.03 ms | 1.26× faster | | %yandex% cb | 1.95 ms | 2.02 ms | 1.04× slower | | %htt% urls | 0.59 ms | 1.84 ms | 3.13× slower | | %https% urls | 0.70 ms | 0.90 ms | 1.29× slower | Headline wins on selective regimes (`%google%`, `%ear%`, `%gmail%`). Saturated-SSA needles (`%htt%`, `%https%`) regress because the per-candidate `verify_at` dispatch beats per-row `matches_with_bitset` short-circuit when candidates outnumber rows; that's the regime DESIGN.md's separate "dense-pattern short-circuit" is meant to cover. Adds `%htt%` / `%ear%` / `%https%` benches to `fsst_like.rs` to guard the SSA cases going forward. Signed-off-by: Claude <noreply@anthropic.com>

Two routing fixes: 1. NOT LIKE streaming. `like.rs` previously gated `negated=true` away from the streaming Teddy paths and onto the per-row loop. The streaming paths already handle negation correctly (initial bitbuf state + set/unset polarity), and tests cover it. Removing the gate routes NOT LIKE through Teddy and the SSA fusion. 2. Dense-pattern short-circuit. The fused-Teddy+SSA path I added last commit regressed saturated-SSA needles (%htt% / %https%): per-candidate `verify_at` dispatch beats per-row `matches_with_bitset` short-circuit when candidates outnumber rows. An 8 KiB scalar sample of `all_bytes` for SSA byte hits extrapolates to an estimated candidate count; above 32k we route directly to the 1-byte progressing-codes path (`scan_with_anchor_bitset`), which is what the pre-SSA-merge code did for SSA-present needles. The sample cost is a few µs; below threshold the fused path runs as before. Bench (divan medians, 100k rows): | pattern | pre-SSA | fused-only | fused + short-circuit | |-----------------|---------|------------|-----------------------| | %ear% urls | 1.41 ms | 0.69 ms ✓ | 0.69 ms ✓ | | %ear% cb | 3.82 ms | 3.03 ms ✓ | 3.02 ms ✓ | | %google% urls | 1.16 ms | 0.94 ms ✓ | 0.92 ms ✓ | | %htt% urls | 0.59 ms | 1.86 ms ✗ | 0.62 ms (parity) | | %htt% cb | 1.31 ms | 2.92 ms ✗ | 1.30 ms (parity) | | %https% urls | 0.70 ms | 0.92 ms ✗ | 0.76 ms (close) | Selective wins from SSA fusion are preserved; saturated-SSA regressions are eliminated. The threshold (32k estimated candidates) was calibrated from the bench corpus — `%ear%` / `%google%` sit comfortably under it while `%htt%` / `%https%` / `%htt% cb` all cross it. Adds NOT LIKE benches (%google% urls, %xyzzy% rare) to guard the ungated path. Signed-off-by: Claude <noreply@anthropic.com>

Extend the LikeKind parser and the KMP byte-table / suffix byte-table construction to treat `_` (byte 0x5F) as the SQL single-byte wildcard. Anchored shapes — `prefix%` and `%suffix` — gain wildcard support; each `_` position transitions on every byte instead of one literal. Unanchored shapes (`%contains%`, `%seg1%seg2%`) are still rejected when any `_` appears: KMP's failure function with wildcards is unsound (treats `_` as symmetrically compatible with any pattern byte, producing false positives at the DFA level). A correct unanchored wildcard matcher needs NFA subset construction; tracked as a follow-up. Changes: - `dfa/mod.rs`: add `WILDCARD = b'_'`, `pattern_eq`, `pattern_matches_byte`. Update `kmp_byte_transitions` to fill the row (any byte advances) at wildcard positions; `kmp_failure_table` uses wildcard-aware pattern equality. - `dfa/prefix.rs::build_prefix_byte_table`: fill the row at wildcard positions. - `dfa/suffix.rs::build_suffix_byte_table`: same, for the backward-scanned suffix. - `dfa/mod.rs::LikeKind::parse`: accept `_` in `Prefix` and `Suffix` variants; still reject in `Contains` / `MultiContains`. - `needle_bytes_absent_from_all_symbols` skips wildcard positions when computing the literal-byte symbol overlap; the escape-only memmem fast path is gated on `needle_is_literal`. Adds 6 wildcard tests covering prefix, suffix, multi-wildcard, leading-wildcard, symbol-interaction, and the deliberate contains-rejection. All 163 existing + new tests pass. Signed-off-by: Claude <noreply@anthropic.com>

Extend the DFA construction to optionally fold ASCII letter case. Adds `FsstMatcher::try_new_with(symbols, lengths, pattern, case_insensitive)`; the `like.rs` kernel now plumbs `options.case_insensitive` through instead of bailing out. Mechanism: - `dfa/mod.rs`: add `ascii_to_lower`, `pattern_eq(a, b, ci)`, `pattern_matches_byte(p, b, ci)`, and a `set_advance` helper that, when `ci` is true, sets both case variants of an ASCII letter in the byte table. `kmp_byte_transitions` and `kmp_failure_table` now take `ci`; the fold is at construction time so the hot loop stays a single table lookup per byte. - `dfa/prefix.rs::build_prefix_byte_table`, `dfa/suffix.rs:: build_suffix_byte_table`, `dfa/multi_contains.rs:: chained_kmp_byte_transitions`: same pattern. - Each DFA's `new()` takes `case_insensitive: bool`. Threaded through `FsstMatcher::try_new_with` from `LikeKernel::like`. - Escape-only memmem fast path is gated to wildcard-free, case-sensitive needles (the encoded pattern is byte-exact). Adds 6 ILIKE tests covering prefix, suffix, contains, multi-contains, ILIKE + `_` wildcard, and ILIKE with FSST symbol expansions in mixed case. 169 tests pass. Signed-off-by: Claude <noreply@anthropic.com>

Drops the TODO on the NEON Teddy passes. Mirrors the AVX2 / AVX-512 implementations: at setup, load the SSA nibble tables into NEON registers (or splat zero when SSA is absent); in the inner loop, compute `ssa_bits = neon_nibble_lookup(ssa_lo, ssa_hi, v1, nibble_mask)` and `vorrq_u8` it into the Teddy candidate vector before the movemask. Scalar tail picks up SSA via the same one-line nibble table check used in the AVX2 tail, and the pair NEON path adds the last-byte SSA-only check. The NEON code is `#[cfg(target_arch = "aarch64")]`-gated; no runtime change on x86_64 (which already does fused SSA via AVX2 / AVX-512). Cross-compile-checked locally is not available; logic is byte-for-byte parallel to the AVX2 path. Signed-off-by: Claude <noreply@anthropic.com>

Three self-contained task briefs for the remaining DFA prefilter items — Shift-Or for short needles, Fat Teddy for multi-pattern OR, and an engine planner / cost-model that replaces the hardcoded routing cascade. Each section is sized to be pasted into a subagent prompt: required context, files to touch, exit criteria, validation gates, known pitfalls. Includes a shared "Required context" block covering the FSST DFA architecture so each task brief stays focused on its own scope. Recommended order when running sequentially: 1. Shift-Or — extends FoldedContains for needles ≤ 8 bytes. 2. Planner — refactors scan_to_bitbuf routing; needs Shift-Or added first so the matrix is non-trivial. 3. Fat Teddy — multi-pattern OR; benefits from the planner. Signed-off-by: Claude <noreply@anthropic.com>

Adds `MultiNeedleMatcher` and `dfa/fat_teddy.rs` implementing a Hyperscan-inspired Fat Teddy prefilter for `LIKE x OR LIKE y OR ...` on FSST-compressed strings. Up to 8 needles per pass; greedy bucket-packing minimizes per-bucket false-positive rate (`|c1_union| * |c2_union|`). Verification uses each needle's existing `FoldedContainsDfa::matches`. AVX2 + scalar streaming passes share the per-block PSHUFB-Mula lookup; AVX-512 + NEON variants and cross-bucket FDR for ESCAPE-anchored needles are deferred. Brought from a parallel agent worktree: `shift_or.rs` (Task A) arrived integrated into `mod.rs`. Added one fix: `#[cfg(debug_assertions)]` guarding a debug-only function reference inside `debug_assert!` so release builds compile. Deferred / TODOs: - Cross-bucket FDR for ESCAPE_CODE-anchored needles. Currently those needles fall back to per-needle scans (marked `TODO(fat-teddy)` in `fat_teddy::pack_needles`). - AVX-512 and NEON Fat Teddy passes (today AVX2 + scalar). - Engine planner integration (Task C). MultiNeedleMatcher dispatches directly without a planner. New public API (`vortex_fsst::dfa::MultiNeedleMatcher`): - `try_new_multi(symbols, lengths, &[&[u8]], case_insensitive)` - `scan_or_to_bitbuf(n, offsets, all_bytes, negated)` - `bucket_count()`, `needle_count()`, `fallback_count()` Validation gates: - `cargo test -p vortex-fsst --lib` passes (193 tests). - `cargo test -p vortex-fsst --lib --features _test-harness` passes (196 tests, including the property test `test_fat_teddy_random_needles_equals_or_of_singles` and `test_fat_teddy_equals_or_of_single_matchers`). - `cargo +nightly fmt --all` clean. - `cargo clippy -p vortex-fsst --lib --all-features --tests`: no new lints in changed files; preexisting lints on `dfa/mod.rs:793` (build_symbol_transitions), `anchor_scan.rs:3100+`, and `dfa_compressed/` are out of scope per AGENTS.md. New bench targets in `benches/fsst_like.rs`: - `fsst_contains_or_{3,8,16}_urls` — Fat Teddy single pass. - `fsst_contains_or_{3,8,16}_urls_npass` — N-pass baseline. Signed-off-by: Claude <claude@anthropic.com>

Replaces the hardcoded `if let Some(...) { ... } else if ...` cascade inside `FoldedContainsDfa::scan_to_bitbuf` (and the smaller cascades in `FlatContainsDfa` / `MultiContainsDfa`) with a single `ScanPlanner` that picks a `ScanPlan` up front and dispatches through one match. New `dfa/planner.rs` (~430 lines) exposes: - `ScanPlan` — one variant per legacy cascade branch, plus a reserved `ShiftOr` slot for Task A. Slot is `cfg_attr`-gated dead_code outside the test harness. - `ScanContext` — borrowed inputs (n, all_bytes, ssa codes, bucket summaries, escape-only flag) the planner reads in O(1). - `ScanPlanner::plan_folded` / `plan_flat_or_multi` — rules-based routing that replicates the legacy cascade exactly (locked in by `test_planner_matches_legacy_cascade` against every fsst_contains bench needle on every bench corpus). - `ssa_saturated` and `escape_pair_targets` moved here as the single source of truth. - `ArchProfile::detect()` runs CPUID once at `ScanPlanner::new()`; the arch is cached for the lifetime of the DFA. - `ScanPlanner::estimated_cost_ns` returns approximate per-call cost. Calibrated from `DESIGN.md` numbers and benches/fsst_like.rs: * triple Teddy: AVX-512 4.28 GB/s, AVX2 2.74 GB/s, NEON 2.5, scalar 0.8 * pair Teddy: AVX-512 5.50, AVX2 3.30, NEON 3.0, scalar 1.0 * 1-byte: AVX-512 12.0, AVX2 8.0, NEON 7.0, scalar 2.0 * memmem ~25 GB/s, row-loop ~150 ns/row Today the cost is diagnostic only (the routing is rules-based); the constants exist for VORTEX_FSST_PLAN_TRACE and to make later comparison-based selection mechanical. `FoldedContainsDfa::scan_to_bitbuf` now extracts each path into a `run_*` helper (`run_escape_only`, `run_one_byte_saturated`, `run_triple_teddy`, `run_escape_pair`, `run_pair_teddy`, `run_one_byte_bitset`, `run_row_loop`) and dispatches via `match plan { ... }`. The Teddy-trace `VORTEX_FSST_TEDDY_TRACE` output is preserved verbatim, and a new `VORTEX_FSST_PLAN_TRACE=1` prints the planner's chosen plan plus inputs and the estimated cost. `FlatContainsDfa` and `MultiContainsDfa` route through the same planner (only `EscapeOnly` vs `RowLoop`) so the dispatch surface is uniform across the three contains DFAs. Regression guards added: - `test_planner_matches_legacy_cascade` runs every fsst_contains bench's underlying call (12 corpus × needle pairs) and asserts `planner.plan() == legacy_path_for(...)`. Future changes can't silently re-route traffic. - 11 unit tests in `planner::tests` cover each routing decision row, cost-model monotonicity, and `ScanPlan::name` uniqueness. No algorithmic changes — every existing scan path is invoked under the same conditions as before, so benches are at parity. Checks: - cargo test -p vortex-fsst --lib --features _test-harness: 184 passed - cargo test -p vortex-fsst --lib: 182 passed - cargo +nightly fmt --all: clean - cargo clippy -p vortex-fsst --all-targets --all-features: no new lints in changed files (pre-existing lints in dfa_compressed/, anchor_scan.rs:3100+, mod.rs:498, multi_contains.rs:405 untouched). - cargo bench -p vortex-fsst --bench fsst_like --features _test-harness: benches compile and `fsst_contains_htt_{cb,urls}` / `fsst_contains_https_urls` run inside expected timings. Signed-off-by: Claude Agent <claude-agent@anthropic.com>

Adds a bit-parallel Shift-Or / Bitap matcher (`ShiftOrDfa`) for short `%needle%` Contains patterns where the needle is ≤ 8 bytes and no FSST symbol's expansion contains the needle (no SSA). The new matcher maintains a single `u64` state and updates it via `state = (state << shift) | or_mask` per FSST code, with a per-symbol table composed from the decompressed-byte mask. ESCAPE pairs apply the byte mask directly. The intermediate accept check `(!state) & state_accept_mask[c] != 0` handles multi-byte symbol expansions where the needle could match midway through a symbol. Wiring: - `dfa/shift_or.rs`: `ShiftOrDfa` with `new`, `matches`, and `scan_to_bitbuf`. The scan path uses the existing `anchor_scan` progressing-code bitset prefilter so it doesn't degrade to a row-by-row inner loop on sparse-needle workloads. - `dfa/mod.rs`: `MatcherInner::ShiftOr` variant plus a conservative routing gate in `FsstMatcher::try_new_with`. ShiftOr is selected only when (a) the FoldedContains escape-only `memmem` fast path doesn't apply, (b) `needle[0]` is absent from every symbol's expansion (so FoldedContains' Teddy-2 pair scan would have no bucket), and (c) no symbol contains the needle outright. - `dfa/tests.rs`: routing + end-to-end tests covering all gating conditions. Tests (16 new, 185 total in `vortex-fsst --lib`): - 1-byte / 2-byte / 8-byte needle matching with and without symbols. - Wildcard `_` and ASCII case-insensitive matching. - FSST symbol composition (multi-byte symbols whose expansion straddles the needle). - Constructor rejects empty needles, oversized needles, and SSA cases. - `rstest` property test comparing `ShiftOrDfa::matches` to `FoldedContainsDfa::matches` on random code streams over needle lengths 1..=8, both case-sensitive and case-insensitive. - Routing test verifying the gate selects `shift_or` only where expected and falls through to `folded_contains` otherwise. Perf (`cargo bench -p vortex-fsst --bench fsst_like --features _test-harness`, single run, x86_64 AVX2): | Bench | median | |--------------------------------------|----------| | fsst_contains_short_zz_urls | 53.35 µs | | fsst_contains_short_zzz_urls | 53.49 µs | | fsst_contains_short_xy_urls | 64.45 µs | | fsst_contains_short_qq_urls | 64.14 µs | | fsst_contains_short_qq_cb | 310.3 µs | | fsst_contains_short_xyzz_rare | 277.5 µs | All parametric `fsst_contains` benches stay at parity with HEAD (within bench-to-bench noise). ShiftOr is currently bypassed on URL-shaped data because FoldedContains' Teddy-2 pair scan dominates whenever `needle[0]` lives in a symbol expansion, which is the typical case for trained FSST dictionaries. The matcher and routing are in place for a follow-up that integrates Shift-Or into the planner alongside Teddy-2/3 path selection. Signed-off-by: Claude <noreply@anthropic.com>

Brings in `MultiNeedleMatcher` + `dfa/fat_teddy.rs` (8-bucket Fat Teddy with greedy bucket-packing, AVX2 + scalar, per-bucket `FoldedContainsDfa::matches` verifier) plus 12 new unit/property tests and 6 multi-needle OR benches. Conflicts resolved: - `mod.rs`: kept HEAD's empirically-tuned ShiftOr gate (no escape-only eligible AND no first-byte present in any symbol AND no SSA), plus Task A's `first_byte_present_in_any_symbol` helper; appended Fat Teddy's `MultiNeedleMatcher` section unchanged. - `tests.rs`: HEAD's test bodies covering the richer ShiftOr routing predicates; appended Fat Teddy's `MultiNeedleMatcher` test section (12 new tests). - `benches/fsst_like.rs`: appended Fat Teddy's six `fsst_contains_or_*` benches (3-, 8-, 16-needle Fat Teddy vs N-pass baselines on the ClickBench URL corpus). Deferred TODOs preserved from the subagent's commit: - Cross-bucket FDR for ESCAPE_CODE-anchored needles (falls back to N-pass). - AVX-512 and NEON variants of `fat_teddy_pass_*`. - Planner integration (Task C, separate merge). 196 tests pass with `_test-harness`. `cargo +nightly fmt --all` clean.

Brings in `dfa/planner.rs` (~430 LOC): `ScanPlanner`, `ScanContext`, `ScanPlan` (with reserved `ShiftOr` variant), `ArchProfile` (CPUID once at construction), and a calibrated `estimated_cost_ns` cost model. Refactors `FoldedContainsDfa`, `FlatContainsDfa`, `MultiContainsDfa::scan_to_bitbuf` to dispatch via the planner into per-path `run_*` helpers; `ssa_saturated` / `escape_pair_targets` consolidated into the planner module. Adds `test_planner_matches_legacy_cascade` (12 corpus × needle pairs from `benches/fsst_like.rs`) plus 11 unit tests covering each routing decision row. New `VORTEX_FSST_PLAN_TRACE=1` env var prints planner inputs + chosen plan + estimated cost. Conflicts resolved: - `folded_contains.rs`: kept Fat Teddy's accessor methods (`bucketed_pair_codes_slice`, `single_step_accept_codes_slice`) AND the planner's `scan_plan_name` refactor. - `tests.rs`: kept Fat Teddy's `MultiNeedleMatcher` test section AND the planner's `test_planner_matches_legacy_cascade` bench-parity regression test. After all three subagent merges (Shift-Or + Fat Teddy + planner), 210 tests pass with `_test-harness`. `cargo +nightly fmt --all` clean. Deferred TODOs preserved: - Cross-bucket FDR for ESCAPE_CODE in Fat Teddy. - AVX-512 / NEON variants of `fat_teddy_pass_*`. - Planner integration of the `ShiftOr` plan (reserved slot exists; routing decision is still made in `try_new_with`). Signed-off-by: Claude <noreply@anthropic.com>

codspeed-hq · 2026-05-14T12:12:59Z

Merging this PR will not alter performance

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 58 improved benchmarks
❌ 75 regressed benchmarks
✅ 992 untouched benchmarks
🆕 19 new benchmarks
⏩ 115 skipped benchmarks¹
🗄️ 38 archived benchmarks run²

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`chunked_bool_canonical_into[(10, 1000)]`	794.9 µs	922 µs	-13.79%
❌	Simulation	`chunked_bool_canonical_into[(100, 100)]`	102.7 µs	116.4 µs	-11.81%
❌	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	46.8 µs	59.5 µs	-21.26%
❌	Simulation	`chunked_constant_i32_append_to_builder[(1000, 10)]`	30.9 µs	39.5 µs	-21.79%
❌	Simulation	`chunked_opt_bool_canonical_into[(10, 1000)]`	912.8 µs	1,143.7 µs	-20.19%
❌	Simulation	`chunked_opt_bool_canonical_into[(100, 100)]`	205 µs	246.8 µs	-16.96%
⚡	Simulation	`chunked_opt_bool_into_canonical[(10, 1000)]`	1.4 ms	1.3 ms	+10%
⚡	Simulation	`chunked_varbinview_into_canonical[(10, 1000)]`	2.2 ms	1.9 ms	+15.1%
❌	Simulation	`bench_compare_primitive[(10000, 128)]`	106.9 µs	120.6 µs	-11.32%
❌	Simulation	`bench_compare_primitive[(10000, 2)]`	106 µs	118.4 µs	-10.47%
❌	Simulation	`bench_compare_primitive[(10000, 32)]`	106.3 µs	118.9 µs	-10.59%
❌	Simulation	`bench_compare_primitive[(10000, 4)]`	105.8 µs	119.1 µs	-11.17%
❌	Simulation	`bench_compare_primitive[(10000, 8)]`	105.7 µs	118.6 µs	-10.88%
❌	Simulation	`bench_compare_sliced_dict_primitive[(1000, 10000)]`	80.7 µs	93 µs	-13.25%
❌	Simulation	`bench_compare_sliced_dict_primitive[(2000, 10000)]`	85.1 µs	98.2 µs	-13.33%
❌	Simulation	`bench_compare_sliced_dict_primitive[(2500, 10000)]`	87.6 µs	100.9 µs	-13.2%
❌	Simulation	`bench_compare_sliced_dict_primitive[(3333, 10000)]`	92.6 µs	105.4 µs	-12.13%
❌	Simulation	`bench_compare_sliced_dict_primitive[(5000, 10000)]`	101.6 µs	114 µs	-10.88%
⚡	Simulation	`encode_varbinview[(1000, 2)]`	203.2 µs	164.5 µs	+23.51%
❌	Simulation	`bench_sparse_coverage[0.01]`	366.5 µs	439.5 µs	-16.61%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/optimize-string-lookup-KuJZB (d24e5cb) with develop (7349cd6)³}

115 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
38 benchmarks were run, but are now archived. If they were deleted in another branch, consider rebasing to remove them from the report. Instead if they were added back, click here to restore them. ↩
No successful run was found on ji/fsst-like-paper-2-work-clean (049c79f) during the generation of this report, so develop (7349cd6) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

claude and others added 15 commits May 13, 2026 20:37

Revert "FSST contains: SSA-merge for FoldedContainsDfa::scan_to_bitbuf"

29c8ae9

This reverts commit 5bedb1d.

joseph-isaacs changed the title ~~FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner~~ [claude] FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[claude] FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner#7921

[claude] FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner#7921
joseph-isaacs wants to merge 15 commits into
ji/fsst-like-paper-2-work-cleanfrom
claude/optimize-string-lookup-KuJZB

joseph-isaacs commented May 14, 2026

Uh oh!

codspeed-hq Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented May 14, 2026

Perf summary, divan medians, 100k rows

What's in the stack (15 commits)

Performance / routing

Coverage

Subagent track (parallel worktrees, then merged)

Misc

Tests

Deferred (preserved as TODOs in code; documented in FOLLOWUPS.md)

Test plan

Uh oh!

codspeed-hq Bot commented May 14, 2026

Merging this PR will not alter performance

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants