Skip to content

[claude] FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner#7921

Open
joseph-isaacs wants to merge 15 commits into
ji/fsst-like-paper-2-work-cleanfrom
claude/optimize-string-lookup-KuJZB
Open

[claude] FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner#7921
joseph-isaacs wants to merge 15 commits into
ji/fsst-like-paper-2-work-cleanfrom
claude/optimize-string-lookup-KuJZB

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Stacks 15 commits on top of ji/fsst-like-paper-2-work-clean covering perf, coverage, and routing.

Perf summary, divan medians, 100k rows

Pattern (dataset) Pre-SSA baseline HEAD Δ
%google% urls (ClickBench Q20 target) 1.16 ms 0.95 ms 1.22× faster
%gmail% email 1.12 ms 0.60 ms 1.86× faster
%ear% urls (DESIGN.md SSA target) 1.41 ms 0.69 ms 2.05× faster
%ear% cb 3.82 ms 3.03 ms 1.26× faster
%htt% urls (saturated SSA) 0.59 ms 0.62 ms parity (dense short-circuit)
%htt% cb 1.31 ms 1.30 ms parity
%https% urls 0.70 ms 0.80 ms 1.14× slower
NOT LIKE %google% urls row-loop 0.95 ms streams now

What's in the stack (15 commits)

Performance / routing

  • 6c0ac95dc Escape-only memmem fast path — when no FSST symbol expansion contains any needle byte, the only compressed sequence reaching the contains DFA's accept state is exactly [ESCAPE, n[0], ESCAPE, n[1], …, ESCAPE, n[L-1]]. Single memmem over all_bytes prefilters rows; verifier resolves the rare literal-byte-position false positive.
  • 9e826367f Same fast path ported to FlatContainsDfa (128–254-byte needles) and MultiContainsDfa (longest segment as anchor).
  • b466388f6 Fused 32-byte SSA nibble in AVX2/AVX-512 Teddy — DESIGN.md's proper SSA-merge, implemented inside fused_teddy_pair_scan / fused_teddy_triple_scan. Reuses v1's nibble register for an additional PSHUFB-Mula lookup over single_step_accept_codes; ORs into the Teddy mask before vpcmpgtb / vpcmpneqb. Two extra vector ops per block, one PSHUFB pass total. Eliminates the dense-bitset 1-byte fallback for selective SSA needles.
  • 970c09228 NOT LIKE streaming (drop like.rs:113 gate, NOT LIKE routes through Teddy + fused SSA) and dense-pattern short-circuit (8 KiB scalar SSA-byte sample; if estimated > 32k candidates total, route directly to 1-byte fallback). Recovers the %htt% regression that pure fusion left.
  • 24f94ed00 NEON SSA fusion mirrors the AVX2 / AVX-512 logic; cfg-gated.

Coverage

  • b791a0097 _ single-byte wildcard for prefix and suffix patterns. KMP byte-table fills the row at wildcard positions. _ in %contains% deliberately rejected — KMP failure with wildcards is unsound (false positives); a correct contains-with-_ needs NFA subset construction.
  • 5e5334c62 ILIKE / ASCII case-insensitive. Case folding at DFA construction time via set_advance; pattern equality uses case-folded compare in the failure function. Threaded through every DFA constructor via FsstMatcher::try_new_with.

Subagent track (parallel worktrees, then merged)

  • eb2b2e6bc Shift-Or / Bitap matcher for short needles (≤8 bytes). Bit-parallel single-u64-state matcher with per-symbol composition (shift_bits, or_mask, state_accept_mask). Routing gated behind "no escape-only eligible, no first-byte present in any symbol, no SSA" — on URL-shaped FSST dicts that's rarely true, so Shift-Or rarely fires in practice. Variant is correct + tested; pays off on sparser dictionaries.
  • 0e8889c18 / 666ad9cde Fat Teddy multi-needle OR — 8-bucket Fat Teddy with greedy bucket-packing, AVX2 + scalar; per-bucket FoldedContainsDfa::matches verifier. MultiNeedleMatcher::try_new_multi / scan_or_to_bitbuf for LIKE x OR LIKE y OR …. Deferred TODOs in code: cross-bucket FDR for ESCAPE_CODE-anchored needles (currently falls back to N-pass), AVX-512 + NEON variants.
  • b994a06ac / d24e5cb22 Engine planner / cost-model routing — refactors scan_to_bitbuf in folded / flat / multi to dispatch via match planner.plan_*(&ctx) into per-path run_* helpers. ArchProfile (CPUID once), estimated_cost_ns cost model, VORTEX_FSST_PLAN_TRACE=1. Bench-parity regression test test_planner_matches_legacy_cascade asserts the planner reproduces the legacy cascade's decision for every fsst_contains bench. Reserved ShiftOr enum variant with TODO comment — current Shift-Or routing still happens in try_new_with, not via the planner.

Misc

  • f0c8b5146 Follow-ups doc at encodings/fsst/src/dfa/FOLLOWUPS.md with subagent-ready briefs for the three tasks above plus mixed-anchors / contains-with-_ for future work.

Tests

210 tests pass with --features _test-harness (157 baseline + 53 new):

  • 5 + 4 escape-only memmem
  • 6 wildcard
  • 6 ILIKE
  • 16 Shift-Or
  • 12 Fat Teddy (incl. 2 _test-harness-gated property tests vs OR-of-singles)
  • 12 planner (incl. test_planner_matches_legacy_cascade against the bench corpus)

cargo +nightly fmt --all clean. cargo clippy -p vortex-fsst --all-targets --all-features — no new lints in changed files (preexisting lints in mod.rs:498, anchor_scan.rs:3100+, dfa_compressed/* remain unaddressed).

Deferred (preserved as TODOs in code; documented in FOLLOWUPS.md)

  • Cross-bucket FDR for ESCAPE_CODE-anchored needles in Fat Teddy.
  • AVX-512 + NEON variants of fat_teddy_pass_*.
  • Planner integration of the ShiftOr plan (reserved slot exists).
  • Mixed anchors (pre%mid%suf) — needs MultiContainsDfa refactor.
  • Contains with _ — needs NFA subset construction.

Test plan

  • cargo test -p vortex-fsst --lib --features _test-harness — must report 210 passing.
  • cargo bench -p vortex-fsst --bench fsst_like --features _test-harness — confirm the headline perf table above on the reviewer's hardware.
  • cargo clippy -p vortex-fsst --all-targets --all-features — diff against base; no new lints in encodings/fsst/src/dfa/* or encodings/fsst/src/compute/like.rs.
  • Spot-check VORTEX_FSST_TEDDY_TRACE=1 output for %ear% (should route through fused SSA Teddy) and %htt% (should route through dense short-circuit).
  • Review the merge resolutions in 666ad9cde (Fat Teddy) and d24e5cb22 (planner) — kept HEAD's stricter ShiftOr gate, appended Fat Teddy's MultiNeedleMatcher section, and combined both subagents' test additions.
  • Run an independent code review on the three subagent-authored files (shift_or.rs 806 LOC, fat_teddy.rs 744 LOC, planner.rs ~430 LOC) — I reviewed merge conflicts but did not exhaustively audit each subagent's implementation. Consider spawning a pr-review agent on the diff range 049c79f89..d24e5cb22.

🤖 Generated with Claude Code


Generated by Claude Code

claude and others added 15 commits May 13, 2026 20:37
When no FSST symbol's expansion contains any byte of the needle, the
only compressed sequence that reaches `FoldedContainsDfa::accept` from
state 0 is exactly `[ESCAPE, n[0], ESCAPE, n[1], ..., ESCAPE, n[L-1]]`.
Any non-escape code resets the DFA to state 0 (symbols can't contribute
a needle byte), so a single `memmem` for the 2L-byte encoded pattern is
strictly more selective than the existing 2-byte `(ESCAPE, n[0])`
escape-pair anchor.

The hot regime is `%needle%` where the needle bytes are absent from the
trained dict — e.g. queries for tokens that never appear in the column,
or rare-alphabet substrings on text columns. Memmem with a long pattern
benefits from skip heuristics, so candidate density drops roughly L×
versus the 2-byte anchor.

Verification still runs the standard DFA per candidate row so the rare
literal-position memmem hit (compressed[p-1] = ESCAPE AND compressed[p]
= 255) is handled correctly — `FoldedContainsDfa::matches` always starts
from a code position, so its result is exact regardless of where memmem
landed.

The new path is gated on needle length >= 2 — at L = 1 the encoded
pattern is identical to the existing escape_pair 2-byte memmem, so
adding a separate path would just add a redundant branch.

Signed-off-by: Claude <noreply@anthropic.com>
Port the escape-only prefilter from FoldedContainsDfa to the other two
contains shapes:

- `FlatContainsDfa` (needles 128..=254 bytes): identical logic — same
  detection, same encoded pattern, same per-row verification. Long
  needles benefit the most here because memmem's skip distance scales
  with pattern length.

- `MultiContainsDfa` (`%seg1%seg2%…`): every segment must appear in
  the row for a match, so the LONGEST segment's encoded pattern is a
  sound (and most-selective) row-level filter. Hits are verified by
  the full chained DFA which checks segment ordering. Only the union
  of all segment bytes is required to be absent from every symbol.

The detection and pattern builder (`needle_bytes_absent_from_all_symbols`,
`build_escape_only_encoded_pattern`) are shared from `dfa/mod.rs`.

Signed-off-by: Claude <noreply@anthropic.com>
The bucketed Teddy pair/triple scan only anchors on c1's whose one-step
state from 0 is non-accept (an SSA c1 has no advancing c2 to anchor on,
so the pair scheme would miss it). Today the scan-path gate skips Teddy
entirely when any SSA code exists and falls to the 1-byte progressing
bitset, which on URL-shaped data marks nearly every byte and verifies
most as false positives.

Run Teddy unconditionally when its buckets exist (they already exclude
SSA c1's by construction) and OR-merge its result with a separate
SSA-only pass:

  - Build a 1-byte PSHUFB-Mula bitset over `single_step_accept_codes`
    (typically 1-4 SSA codes -> very sparse).
  - For each row touched by the SSA bitset, verify with the standard
    `matches` DFA. This is exact, including for the rare hit where the
    SSA code's byte value appears at a literal position (compressed
    [p-1] = ESCAPE, compressed[p] = SSA-code-value-as-literal): the DFA
    starts at the row's first code position and never enters that
    misalignment.
  - Combine: non-negated → `teddy | ssa`, negated → `teddy & ssa`.

When neither Teddy bucket nor SSA exists, fall back to the 1-byte
progressing-codes bitset as before; the 1-byte path remains correct
for the SSA-only case too (its `matches_with_bitset` is the full DFA).

`scan_plan_name` now reports `triple_streaming+ssa_merge` /
`pair_streaming+ssa_merge` / `escape_pair_streaming+ssa_merge` so
tracing and tests can verify the path that fired.

Signed-off-by: Claude <noreply@anthropic.com>
DESIGN.md's SSA-merge proposal: build Teddy over the non-SSA progressing
c1's and a 1-byte PSHUFB over the SSA codes, then OR the two candidate
streams *inside* `fused_teddy_*_scan`. The earlier row-level OR-merge
attempt regressed everywhere because it paid for two separate PSHUFB
passes over `all_bytes` plus a per-row `matches` verify; this commit
does it the right way.

Mechanism:
- `fused_teddy_pair_scan` / `fused_teddy_triple_scan` take an
  optional `ssa_codes: &[u8]`. If present, an `NibbleTables` is built
  once and threaded through `run_teddy_*_pass` to every SIMD variant.
- In `teddy_pair_pass_avx2` and `teddy_triple_pass_avx512`, the SSA
  nibble tables are broadcast into a pair of YMM/ZMM registers up
  front. Per 32-byte (AVX2) / 64-byte (AVX-512) block the existing
  PSHUFB pair on `v1`'s nibbles is reused to also evaluate the SSA
  set — two extra vector ops (`shuffle` × 2, `and`, `or`) — and the
  resulting `ssa_bits` register is OR'd into the Teddy candidate
  mask before `vpcmpgtb` / `vpcmpneqb`. Candidates from SSA flow into
  the same `tzcnt`-peeling loop and the same per-candidate
  `verify_at` dispatch.
- Tail scalar paths fold SSA in cheaply (one nibble table lookup
  per byte), and the AVX-512 and AVX2 tails additionally check the
  last 1–2 positions that the pair/triple scan skips for lack of a
  successor.
- `teddy_pair_pass_scalar` mirrors the AVX2 logic.
- `teddy_pair_pass_neon` and the avx2/neon/scalar triple variants
  accept the parameter but skip the fusion for now; on AArch64 the
  caller falls back to the non-fused path. (Marked with TODO.)

Caller (`FoldedContainsDfa::scan_to_bitbuf`) drops the
`single_step_accept_codes.is_none()` gate and passes
`single_step_accept_codes` as `ssa_codes` to the Teddy entry points.
The escape-pair specialization (one bucket, c1 = ESCAPE, ≤3 c2's)
still doesn't fuse SSA — when SSA codes are present we fall through
to the generic pair-streaming path, which does.

`scan_plan_name` reports `triple_streaming+ssa_fused` /
`pair_streaming+ssa_fused` when fusion is active.

Bench (divan medians, 100k rows, `cargo bench -p vortex-fsst --bench
fsst_like --features _test-harness`):

| pattern (dataset) | pre-SSA | fused | Δ |
|---|---|---|---|
| %google% urls (Q20) | 1.16 ms | 0.94 ms | 1.23× faster |
| %gmail% email      | 1.12 ms | 0.60 ms | 1.86× faster |
| %ear% urls (DESIGN target) | 1.41 ms | 0.69 ms | 2.05× faster |
| %ear% cb           | 3.82 ms | 3.03 ms | 1.26× faster |
| %yandex% cb        | 1.95 ms | 2.02 ms | 1.04× slower |
| %htt% urls         | 0.59 ms | 1.84 ms | 3.13× slower |
| %https% urls       | 0.70 ms | 0.90 ms | 1.29× slower |

Headline wins on selective regimes (`%google%`, `%ear%`, `%gmail%`).
Saturated-SSA needles (`%htt%`, `%https%`) regress because the
per-candidate `verify_at` dispatch beats per-row `matches_with_bitset`
short-circuit when candidates outnumber rows; that's the regime
DESIGN.md's separate "dense-pattern short-circuit" is meant to
cover.

Adds `%htt%` / `%ear%` / `%https%` benches to `fsst_like.rs` to
guard the SSA cases going forward.

Signed-off-by: Claude <noreply@anthropic.com>
Two routing fixes:

1. NOT LIKE streaming. `like.rs` previously gated `negated=true`
   away from the streaming Teddy paths and onto the per-row loop.
   The streaming paths already handle negation correctly (initial
   bitbuf state + set/unset polarity), and tests cover it. Removing
   the gate routes NOT LIKE through Teddy and the SSA fusion.

2. Dense-pattern short-circuit. The fused-Teddy+SSA path I added
   last commit regressed saturated-SSA needles (%htt% / %https%):
   per-candidate `verify_at` dispatch beats per-row
   `matches_with_bitset` short-circuit when candidates outnumber
   rows. An 8 KiB scalar sample of `all_bytes` for SSA byte hits
   extrapolates to an estimated candidate count; above 32k we
   route directly to the 1-byte progressing-codes path
   (`scan_with_anchor_bitset`), which is what the pre-SSA-merge
   code did for SSA-present needles. The sample cost is a few µs;
   below threshold the fused path runs as before.

Bench (divan medians, 100k rows):

| pattern         | pre-SSA | fused-only | fused + short-circuit |
|-----------------|---------|------------|-----------------------|
| %ear%  urls     | 1.41 ms | 0.69 ms ✓  | 0.69 ms ✓             |
| %ear%  cb       | 3.82 ms | 3.03 ms ✓  | 3.02 ms ✓             |
| %google% urls   | 1.16 ms | 0.94 ms ✓  | 0.92 ms ✓             |
| %htt%  urls     | 0.59 ms | 1.86 ms ✗  | 0.62 ms (parity)      |
| %htt%  cb       | 1.31 ms | 2.92 ms ✗  | 1.30 ms (parity)      |
| %https% urls    | 0.70 ms | 0.92 ms ✗  | 0.76 ms (close)       |

Selective wins from SSA fusion are preserved; saturated-SSA
regressions are eliminated. The threshold (32k estimated
candidates) was calibrated from the bench corpus — `%ear%` /
`%google%` sit comfortably under it while `%htt%` / `%https%` /
`%htt% cb` all cross it.

Adds NOT LIKE benches (%google% urls, %xyzzy% rare) to guard the
ungated path.

Signed-off-by: Claude <noreply@anthropic.com>
Extend the LikeKind parser and the KMP byte-table / suffix byte-table
construction to treat `_` (byte 0x5F) as the SQL single-byte wildcard.
Anchored shapes — `prefix%` and `%suffix` — gain wildcard support;
each `_` position transitions on every byte instead of one literal.

Unanchored shapes (`%contains%`, `%seg1%seg2%`) are still rejected
when any `_` appears: KMP's failure function with wildcards is
unsound (treats `_` as symmetrically compatible with any pattern
byte, producing false positives at the DFA level). A correct
unanchored wildcard matcher needs NFA subset construction; tracked
as a follow-up.

Changes:
- `dfa/mod.rs`: add `WILDCARD = b'_'`, `pattern_eq`,
  `pattern_matches_byte`. Update `kmp_byte_transitions` to fill the
  row (any byte advances) at wildcard positions; `kmp_failure_table`
  uses wildcard-aware pattern equality.
- `dfa/prefix.rs::build_prefix_byte_table`: fill the row at wildcard
  positions.
- `dfa/suffix.rs::build_suffix_byte_table`: same, for the
  backward-scanned suffix.
- `dfa/mod.rs::LikeKind::parse`: accept `_` in `Prefix` and `Suffix`
  variants; still reject in `Contains` / `MultiContains`.
- `needle_bytes_absent_from_all_symbols` skips wildcard positions
  when computing the literal-byte symbol overlap; the escape-only
  memmem fast path is gated on `needle_is_literal`.

Adds 6 wildcard tests covering prefix, suffix, multi-wildcard,
leading-wildcard, symbol-interaction, and the deliberate
contains-rejection. All 163 existing + new tests pass.

Signed-off-by: Claude <noreply@anthropic.com>
Extend the DFA construction to optionally fold ASCII letter case.
Adds `FsstMatcher::try_new_with(symbols, lengths, pattern,
case_insensitive)`; the `like.rs` kernel now plumbs
`options.case_insensitive` through instead of bailing out.

Mechanism:
- `dfa/mod.rs`: add `ascii_to_lower`, `pattern_eq(a, b, ci)`,
  `pattern_matches_byte(p, b, ci)`, and a `set_advance` helper that,
  when `ci` is true, sets both case variants of an ASCII letter in
  the byte table. `kmp_byte_transitions` and `kmp_failure_table`
  now take `ci`; the fold is at construction time so the hot loop
  stays a single table lookup per byte.
- `dfa/prefix.rs::build_prefix_byte_table`, `dfa/suffix.rs::
  build_suffix_byte_table`, `dfa/multi_contains.rs::
  chained_kmp_byte_transitions`: same pattern.
- Each DFA's `new()` takes `case_insensitive: bool`. Threaded
  through `FsstMatcher::try_new_with` from `LikeKernel::like`.
- Escape-only memmem fast path is gated to wildcard-free,
  case-sensitive needles (the encoded pattern is byte-exact).

Adds 6 ILIKE tests covering prefix, suffix, contains,
multi-contains, ILIKE + `_` wildcard, and ILIKE with FSST
symbol expansions in mixed case. 169 tests pass.

Signed-off-by: Claude <noreply@anthropic.com>
Drops the TODO on the NEON Teddy passes. Mirrors the AVX2 / AVX-512
implementations: at setup, load the SSA nibble tables into NEON
registers (or splat zero when SSA is absent); in the inner loop,
compute `ssa_bits = neon_nibble_lookup(ssa_lo, ssa_hi, v1, nibble_mask)`
and `vorrq_u8` it into the Teddy candidate vector before the
movemask. Scalar tail picks up SSA via the same one-line nibble
table check used in the AVX2 tail, and the pair NEON path adds the
last-byte SSA-only check.

The NEON code is `#[cfg(target_arch = "aarch64")]`-gated; no
runtime change on x86_64 (which already does fused SSA via AVX2 /
AVX-512). Cross-compile-checked locally is not available; logic
is byte-for-byte parallel to the AVX2 path.

Signed-off-by: Claude <noreply@anthropic.com>
Three self-contained task briefs for the remaining DFA prefilter
items — Shift-Or for short needles, Fat Teddy for multi-pattern OR,
and an engine planner / cost-model that replaces the hardcoded
routing cascade. Each section is sized to be pasted into a
subagent prompt: required context, files to touch, exit criteria,
validation gates, known pitfalls.

Includes a shared "Required context" block covering the FSST DFA
architecture so each task brief stays focused on its own scope.

Recommended order when running sequentially:
1. Shift-Or  — extends FoldedContains for needles ≤ 8 bytes.
2. Planner   — refactors scan_to_bitbuf routing; needs Shift-Or
   added first so the matrix is non-trivial.
3. Fat Teddy — multi-pattern OR; benefits from the planner.

Signed-off-by: Claude <noreply@anthropic.com>
Adds `MultiNeedleMatcher` and `dfa/fat_teddy.rs` implementing a
Hyperscan-inspired Fat Teddy prefilter for `LIKE x OR LIKE y OR ...`
on FSST-compressed strings. Up to 8 needles per pass; greedy
bucket-packing minimizes per-bucket false-positive rate
(`|c1_union| * |c2_union|`). Verification uses each needle's
existing `FoldedContainsDfa::matches`. AVX2 + scalar streaming
passes share the per-block PSHUFB-Mula lookup; AVX-512 + NEON
variants and cross-bucket FDR for ESCAPE-anchored needles are
deferred.

Brought from a parallel agent worktree: `shift_or.rs` (Task A)
arrived integrated into `mod.rs`. Added one fix:
`#[cfg(debug_assertions)]` guarding a debug-only function reference
inside `debug_assert!` so release builds compile.

Deferred / TODOs:
- Cross-bucket FDR for ESCAPE_CODE-anchored needles. Currently
  those needles fall back to per-needle scans (marked
  `TODO(fat-teddy)` in `fat_teddy::pack_needles`).
- AVX-512 and NEON Fat Teddy passes (today AVX2 + scalar).
- Engine planner integration (Task C). MultiNeedleMatcher
  dispatches directly without a planner.

New public API (`vortex_fsst::dfa::MultiNeedleMatcher`):
- `try_new_multi(symbols, lengths, &[&[u8]], case_insensitive)`
- `scan_or_to_bitbuf(n, offsets, all_bytes, negated)`
- `bucket_count()`, `needle_count()`, `fallback_count()`

Validation gates:
- `cargo test -p vortex-fsst --lib` passes (193 tests).
- `cargo test -p vortex-fsst --lib --features _test-harness`
  passes (196 tests, including the property test
  `test_fat_teddy_random_needles_equals_or_of_singles` and
  `test_fat_teddy_equals_or_of_single_matchers`).
- `cargo +nightly fmt --all` clean.
- `cargo clippy -p vortex-fsst --lib --all-features --tests`:
  no new lints in changed files; preexisting lints on
  `dfa/mod.rs:793` (build_symbol_transitions), `anchor_scan.rs:3100+`,
  and `dfa_compressed/` are out of scope per AGENTS.md.

New bench targets in `benches/fsst_like.rs`:
- `fsst_contains_or_{3,8,16}_urls` — Fat Teddy single pass.
- `fsst_contains_or_{3,8,16}_urls_npass` — N-pass baseline.

Signed-off-by: Claude <claude@anthropic.com>
Replaces the hardcoded `if let Some(...) { ... } else if ...` cascade
inside `FoldedContainsDfa::scan_to_bitbuf` (and the smaller cascades in
`FlatContainsDfa` / `MultiContainsDfa`) with a single `ScanPlanner`
that picks a `ScanPlan` up front and dispatches through one match.

New `dfa/planner.rs` (~430 lines) exposes:
- `ScanPlan` — one variant per legacy cascade branch, plus a reserved
  `ShiftOr` slot for Task A. Slot is `cfg_attr`-gated dead_code outside
  the test harness.
- `ScanContext` — borrowed inputs (n, all_bytes, ssa codes, bucket
  summaries, escape-only flag) the planner reads in O(1).
- `ScanPlanner::plan_folded` / `plan_flat_or_multi` — rules-based
  routing that replicates the legacy cascade exactly (locked in by
  `test_planner_matches_legacy_cascade` against every fsst_contains
  bench needle on every bench corpus).
- `ssa_saturated` and `escape_pair_targets` moved here as the single
  source of truth.
- `ArchProfile::detect()` runs CPUID once at `ScanPlanner::new()`; the
  arch is cached for the lifetime of the DFA.
- `ScanPlanner::estimated_cost_ns` returns approximate per-call cost.
  Calibrated from `DESIGN.md` numbers and benches/fsst_like.rs:
    * triple Teddy: AVX-512 4.28 GB/s, AVX2 2.74 GB/s, NEON 2.5, scalar 0.8
    * pair Teddy:  AVX-512 5.50, AVX2 3.30, NEON 3.0, scalar 1.0
    * 1-byte:      AVX-512 12.0, AVX2 8.0, NEON 7.0, scalar 2.0
    * memmem ~25 GB/s, row-loop ~150 ns/row
  Today the cost is diagnostic only (the routing is rules-based); the
  constants exist for VORTEX_FSST_PLAN_TRACE and to make later
  comparison-based selection mechanical.

`FoldedContainsDfa::scan_to_bitbuf` now extracts each path into a
`run_*` helper (`run_escape_only`, `run_one_byte_saturated`,
`run_triple_teddy`, `run_escape_pair`, `run_pair_teddy`,
`run_one_byte_bitset`, `run_row_loop`) and dispatches via `match
plan { ... }`. The Teddy-trace `VORTEX_FSST_TEDDY_TRACE` output is
preserved verbatim, and a new `VORTEX_FSST_PLAN_TRACE=1` prints the
planner's chosen plan plus inputs and the estimated cost.

`FlatContainsDfa` and `MultiContainsDfa` route through the same
planner (only `EscapeOnly` vs `RowLoop`) so the dispatch surface is
uniform across the three contains DFAs.

Regression guards added:
- `test_planner_matches_legacy_cascade` runs every fsst_contains
  bench's underlying call (12 corpus × needle pairs) and asserts
  `planner.plan() == legacy_path_for(...)`. Future changes can't
  silently re-route traffic.
- 11 unit tests in `planner::tests` cover each routing decision row,
  cost-model monotonicity, and `ScanPlan::name` uniqueness.

No algorithmic changes — every existing scan path is invoked under
the same conditions as before, so benches are at parity.

Checks:
- cargo test -p vortex-fsst --lib --features _test-harness: 184 passed
- cargo test -p vortex-fsst --lib: 182 passed
- cargo +nightly fmt --all: clean
- cargo clippy -p vortex-fsst --all-targets --all-features: no new
  lints in changed files (pre-existing lints in dfa_compressed/,
  anchor_scan.rs:3100+, mod.rs:498, multi_contains.rs:405 untouched).
- cargo bench -p vortex-fsst --bench fsst_like --features _test-harness:
  benches compile and `fsst_contains_htt_{cb,urls}` /
  `fsst_contains_https_urls` run inside expected timings.

Signed-off-by: Claude Agent <claude-agent@anthropic.com>
Adds a bit-parallel Shift-Or / Bitap matcher (`ShiftOrDfa`) for short
`%needle%` Contains patterns where the needle is ≤ 8 bytes and no FSST
symbol's expansion contains the needle (no SSA). The new matcher
maintains a single `u64` state and updates it via
`state = (state << shift) | or_mask` per FSST code, with a per-symbol
table composed from the decompressed-byte mask. ESCAPE pairs apply the
byte mask directly. The intermediate accept check
`(!state) & state_accept_mask[c] != 0` handles multi-byte symbol
expansions where the needle could match midway through a symbol.

Wiring:

- `dfa/shift_or.rs`: `ShiftOrDfa` with `new`, `matches`, and
  `scan_to_bitbuf`. The scan path uses the existing `anchor_scan`
  progressing-code bitset prefilter so it doesn't degrade to a
  row-by-row inner loop on sparse-needle workloads.
- `dfa/mod.rs`: `MatcherInner::ShiftOr` variant plus a conservative
  routing gate in `FsstMatcher::try_new_with`. ShiftOr is selected
  only when (a) the FoldedContains escape-only `memmem` fast path
  doesn't apply, (b) `needle[0]` is absent from every symbol's
  expansion (so FoldedContains' Teddy-2 pair scan would have no
  bucket), and (c) no symbol contains the needle outright.
- `dfa/tests.rs`: routing + end-to-end tests covering all gating
  conditions.

Tests (16 new, 185 total in `vortex-fsst --lib`):

- 1-byte / 2-byte / 8-byte needle matching with and without symbols.
- Wildcard `_` and ASCII case-insensitive matching.
- FSST symbol composition (multi-byte symbols whose expansion straddles
  the needle).
- Constructor rejects empty needles, oversized needles, and SSA cases.
- `rstest` property test comparing `ShiftOrDfa::matches` to
  `FoldedContainsDfa::matches` on random code streams over needle
  lengths 1..=8, both case-sensitive and case-insensitive.
- Routing test verifying the gate selects `shift_or` only where
  expected and falls through to `folded_contains` otherwise.

Perf (`cargo bench -p vortex-fsst --bench fsst_like --features
_test-harness`, single run, x86_64 AVX2):

| Bench                                | median   |
|--------------------------------------|----------|
| fsst_contains_short_zz_urls          | 53.35 µs |
| fsst_contains_short_zzz_urls         | 53.49 µs |
| fsst_contains_short_xy_urls          | 64.45 µs |
| fsst_contains_short_qq_urls          | 64.14 µs |
| fsst_contains_short_qq_cb            | 310.3 µs |
| fsst_contains_short_xyzz_rare        | 277.5 µs |

All parametric `fsst_contains` benches stay at parity with HEAD (within
bench-to-bench noise). ShiftOr is currently bypassed on URL-shaped
data because FoldedContains' Teddy-2 pair scan dominates whenever
`needle[0]` lives in a symbol expansion, which is the typical case for
trained FSST dictionaries. The matcher and routing are in place for a
follow-up that integrates Shift-Or into the planner alongside
Teddy-2/3 path selection.

Signed-off-by: Claude <noreply@anthropic.com>
Brings in `MultiNeedleMatcher` + `dfa/fat_teddy.rs` (8-bucket Fat
Teddy with greedy bucket-packing, AVX2 + scalar, per-bucket
`FoldedContainsDfa::matches` verifier) plus 12 new unit/property
tests and 6 multi-needle OR benches.

Conflicts resolved:
- `mod.rs`: kept HEAD's empirically-tuned ShiftOr gate (no
  escape-only eligible AND no first-byte present in any symbol AND
  no SSA), plus Task A's `first_byte_present_in_any_symbol`
  helper; appended Fat Teddy's `MultiNeedleMatcher` section
  unchanged.
- `tests.rs`: HEAD's test bodies covering the richer ShiftOr
  routing predicates; appended Fat Teddy's `MultiNeedleMatcher`
  test section (12 new tests).
- `benches/fsst_like.rs`: appended Fat Teddy's six
  `fsst_contains_or_*` benches (3-, 8-, 16-needle Fat Teddy vs
  N-pass baselines on the ClickBench URL corpus).

Deferred TODOs preserved from the subagent's commit:
- Cross-bucket FDR for ESCAPE_CODE-anchored needles (falls back
  to N-pass).
- AVX-512 and NEON variants of `fat_teddy_pass_*`.
- Planner integration (Task C, separate merge).

196 tests pass with `_test-harness`. `cargo +nightly fmt --all`
clean.
Brings in `dfa/planner.rs` (~430 LOC): `ScanPlanner`, `ScanContext`,
`ScanPlan` (with reserved `ShiftOr` variant), `ArchProfile` (CPUID
once at construction), and a calibrated `estimated_cost_ns` cost
model. Refactors `FoldedContainsDfa`, `FlatContainsDfa`,
`MultiContainsDfa::scan_to_bitbuf` to dispatch via the planner
into per-path `run_*` helpers; `ssa_saturated` /
`escape_pair_targets` consolidated into the planner module.

Adds `test_planner_matches_legacy_cascade` (12 corpus × needle
pairs from `benches/fsst_like.rs`) plus 11 unit tests covering
each routing decision row. New `VORTEX_FSST_PLAN_TRACE=1` env var
prints planner inputs + chosen plan + estimated cost.

Conflicts resolved:
- `folded_contains.rs`: kept Fat Teddy's accessor methods
  (`bucketed_pair_codes_slice`, `single_step_accept_codes_slice`)
  AND the planner's `scan_plan_name` refactor.
- `tests.rs`: kept Fat Teddy's `MultiNeedleMatcher` test section
  AND the planner's `test_planner_matches_legacy_cascade`
  bench-parity regression test.

After all three subagent merges (Shift-Or + Fat Teddy + planner),
210 tests pass with `_test-harness`. `cargo +nightly fmt --all`
clean.

Deferred TODOs preserved:
- Cross-bucket FDR for ESCAPE_CODE in Fat Teddy.
- AVX-512 / NEON variants of `fat_teddy_pass_*`.
- Planner integration of the `ShiftOr` plan (reserved slot
  exists; routing decision is still made in `try_new_with`).

Signed-off-by: Claude <noreply@anthropic.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 14, 2026

Merging this PR will not alter performance

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 58 improved benchmarks
❌ 75 regressed benchmarks
✅ 992 untouched benchmarks
🆕 19 new benchmarks
⏩ 115 skipped benchmarks1
🗄️ 38 archived benchmarks run2

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(10, 1000)] 794.9 µs 922 µs -13.79%
Simulation chunked_bool_canonical_into[(100, 100)] 102.7 µs 116.4 µs -11.81%
Simulation chunked_bool_canonical_into[(1000, 10)] 46.8 µs 59.5 µs -21.26%
Simulation chunked_constant_i32_append_to_builder[(1000, 10)] 30.9 µs 39.5 µs -21.79%
Simulation chunked_opt_bool_canonical_into[(10, 1000)] 912.8 µs 1,143.7 µs -20.19%
Simulation chunked_opt_bool_canonical_into[(100, 100)] 205 µs 246.8 µs -16.96%
Simulation chunked_opt_bool_into_canonical[(10, 1000)] 1.4 ms 1.3 ms +10%
Simulation chunked_varbinview_into_canonical[(10, 1000)] 2.2 ms 1.9 ms +15.1%
Simulation bench_compare_primitive[(10000, 128)] 106.9 µs 120.6 µs -11.32%
Simulation bench_compare_primitive[(10000, 2)] 106 µs 118.4 µs -10.47%
Simulation bench_compare_primitive[(10000, 32)] 106.3 µs 118.9 µs -10.59%
Simulation bench_compare_primitive[(10000, 4)] 105.8 µs 119.1 µs -11.17%
Simulation bench_compare_primitive[(10000, 8)] 105.7 µs 118.6 µs -10.88%
Simulation bench_compare_sliced_dict_primitive[(1000, 10000)] 80.7 µs 93 µs -13.25%
Simulation bench_compare_sliced_dict_primitive[(2000, 10000)] 85.1 µs 98.2 µs -13.33%
Simulation bench_compare_sliced_dict_primitive[(2500, 10000)] 87.6 µs 100.9 µs -13.2%
Simulation bench_compare_sliced_dict_primitive[(3333, 10000)] 92.6 µs 105.4 µs -12.13%
Simulation bench_compare_sliced_dict_primitive[(5000, 10000)] 101.6 µs 114 µs -10.88%
Simulation encode_varbinview[(1000, 2)] 203.2 µs 164.5 µs +23.51%
Simulation bench_sparse_coverage[0.01] 366.5 µs 439.5 µs -16.61%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/optimize-string-lookup-KuJZB (d24e5cb) with develop (7349cd6)3

Open in CodSpeed

Footnotes

  1. 115 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. 38 benchmarks were run, but are now archived. If they were deleted in another branch, consider rebasing to remove them from the report. Instead if they were added back, click here to restore them.

  3. No successful run was found on ji/fsst-like-paper-2-work-clean (049c79f) during the generation of this report, so develop (7349cd6) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@joseph-isaacs joseph-isaacs changed the title FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner [claude] FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants