[claude] FSST LIKE: SSA fusion, dense short-circuit, _ wildcard, ILIKE, Shift-Or, Fat Teddy, planner#7921
Conversation
When no FSST symbol's expansion contains any byte of the needle, the only compressed sequence that reaches `FoldedContainsDfa::accept` from state 0 is exactly `[ESCAPE, n[0], ESCAPE, n[1], ..., ESCAPE, n[L-1]]`. Any non-escape code resets the DFA to state 0 (symbols can't contribute a needle byte), so a single `memmem` for the 2L-byte encoded pattern is strictly more selective than the existing 2-byte `(ESCAPE, n[0])` escape-pair anchor. The hot regime is `%needle%` where the needle bytes are absent from the trained dict — e.g. queries for tokens that never appear in the column, or rare-alphabet substrings on text columns. Memmem with a long pattern benefits from skip heuristics, so candidate density drops roughly L× versus the 2-byte anchor. Verification still runs the standard DFA per candidate row so the rare literal-position memmem hit (compressed[p-1] = ESCAPE AND compressed[p] = 255) is handled correctly — `FoldedContainsDfa::matches` always starts from a code position, so its result is exact regardless of where memmem landed. The new path is gated on needle length >= 2 — at L = 1 the encoded pattern is identical to the existing escape_pair 2-byte memmem, so adding a separate path would just add a redundant branch. Signed-off-by: Claude <noreply@anthropic.com>
Port the escape-only prefilter from FoldedContainsDfa to the other two contains shapes: - `FlatContainsDfa` (needles 128..=254 bytes): identical logic — same detection, same encoded pattern, same per-row verification. Long needles benefit the most here because memmem's skip distance scales with pattern length. - `MultiContainsDfa` (`%seg1%seg2%…`): every segment must appear in the row for a match, so the LONGEST segment's encoded pattern is a sound (and most-selective) row-level filter. Hits are verified by the full chained DFA which checks segment ordering. Only the union of all segment bytes is required to be absent from every symbol. The detection and pattern builder (`needle_bytes_absent_from_all_symbols`, `build_escape_only_encoded_pattern`) are shared from `dfa/mod.rs`. Signed-off-by: Claude <noreply@anthropic.com>
The bucketed Teddy pair/triple scan only anchors on c1's whose one-step
state from 0 is non-accept (an SSA c1 has no advancing c2 to anchor on,
so the pair scheme would miss it). Today the scan-path gate skips Teddy
entirely when any SSA code exists and falls to the 1-byte progressing
bitset, which on URL-shaped data marks nearly every byte and verifies
most as false positives.
Run Teddy unconditionally when its buckets exist (they already exclude
SSA c1's by construction) and OR-merge its result with a separate
SSA-only pass:
- Build a 1-byte PSHUFB-Mula bitset over `single_step_accept_codes`
(typically 1-4 SSA codes -> very sparse).
- For each row touched by the SSA bitset, verify with the standard
`matches` DFA. This is exact, including for the rare hit where the
SSA code's byte value appears at a literal position (compressed
[p-1] = ESCAPE, compressed[p] = SSA-code-value-as-literal): the DFA
starts at the row's first code position and never enters that
misalignment.
- Combine: non-negated → `teddy | ssa`, negated → `teddy & ssa`.
When neither Teddy bucket nor SSA exists, fall back to the 1-byte
progressing-codes bitset as before; the 1-byte path remains correct
for the SSA-only case too (its `matches_with_bitset` is the full DFA).
`scan_plan_name` now reports `triple_streaming+ssa_merge` /
`pair_streaming+ssa_merge` / `escape_pair_streaming+ssa_merge` so
tracing and tests can verify the path that fired.
Signed-off-by: Claude <noreply@anthropic.com>
This reverts commit 5bedb1d.
DESIGN.md's SSA-merge proposal: build Teddy over the non-SSA progressing c1's and a 1-byte PSHUFB over the SSA codes, then OR the two candidate streams *inside* `fused_teddy_*_scan`. The earlier row-level OR-merge attempt regressed everywhere because it paid for two separate PSHUFB passes over `all_bytes` plus a per-row `matches` verify; this commit does it the right way. Mechanism: - `fused_teddy_pair_scan` / `fused_teddy_triple_scan` take an optional `ssa_codes: &[u8]`. If present, an `NibbleTables` is built once and threaded through `run_teddy_*_pass` to every SIMD variant. - In `teddy_pair_pass_avx2` and `teddy_triple_pass_avx512`, the SSA nibble tables are broadcast into a pair of YMM/ZMM registers up front. Per 32-byte (AVX2) / 64-byte (AVX-512) block the existing PSHUFB pair on `v1`'s nibbles is reused to also evaluate the SSA set — two extra vector ops (`shuffle` × 2, `and`, `or`) — and the resulting `ssa_bits` register is OR'd into the Teddy candidate mask before `vpcmpgtb` / `vpcmpneqb`. Candidates from SSA flow into the same `tzcnt`-peeling loop and the same per-candidate `verify_at` dispatch. - Tail scalar paths fold SSA in cheaply (one nibble table lookup per byte), and the AVX-512 and AVX2 tails additionally check the last 1–2 positions that the pair/triple scan skips for lack of a successor. - `teddy_pair_pass_scalar` mirrors the AVX2 logic. - `teddy_pair_pass_neon` and the avx2/neon/scalar triple variants accept the parameter but skip the fusion for now; on AArch64 the caller falls back to the non-fused path. (Marked with TODO.) Caller (`FoldedContainsDfa::scan_to_bitbuf`) drops the `single_step_accept_codes.is_none()` gate and passes `single_step_accept_codes` as `ssa_codes` to the Teddy entry points. The escape-pair specialization (one bucket, c1 = ESCAPE, ≤3 c2's) still doesn't fuse SSA — when SSA codes are present we fall through to the generic pair-streaming path, which does. `scan_plan_name` reports `triple_streaming+ssa_fused` / `pair_streaming+ssa_fused` when fusion is active. Bench (divan medians, 100k rows, `cargo bench -p vortex-fsst --bench fsst_like --features _test-harness`): | pattern (dataset) | pre-SSA | fused | Δ | |---|---|---|---| | %google% urls (Q20) | 1.16 ms | 0.94 ms | 1.23× faster | | %gmail% email | 1.12 ms | 0.60 ms | 1.86× faster | | %ear% urls (DESIGN target) | 1.41 ms | 0.69 ms | 2.05× faster | | %ear% cb | 3.82 ms | 3.03 ms | 1.26× faster | | %yandex% cb | 1.95 ms | 2.02 ms | 1.04× slower | | %htt% urls | 0.59 ms | 1.84 ms | 3.13× slower | | %https% urls | 0.70 ms | 0.90 ms | 1.29× slower | Headline wins on selective regimes (`%google%`, `%ear%`, `%gmail%`). Saturated-SSA needles (`%htt%`, `%https%`) regress because the per-candidate `verify_at` dispatch beats per-row `matches_with_bitset` short-circuit when candidates outnumber rows; that's the regime DESIGN.md's separate "dense-pattern short-circuit" is meant to cover. Adds `%htt%` / `%ear%` / `%https%` benches to `fsst_like.rs` to guard the SSA cases going forward. Signed-off-by: Claude <noreply@anthropic.com>
Two routing fixes: 1. NOT LIKE streaming. `like.rs` previously gated `negated=true` away from the streaming Teddy paths and onto the per-row loop. The streaming paths already handle negation correctly (initial bitbuf state + set/unset polarity), and tests cover it. Removing the gate routes NOT LIKE through Teddy and the SSA fusion. 2. Dense-pattern short-circuit. The fused-Teddy+SSA path I added last commit regressed saturated-SSA needles (%htt% / %https%): per-candidate `verify_at` dispatch beats per-row `matches_with_bitset` short-circuit when candidates outnumber rows. An 8 KiB scalar sample of `all_bytes` for SSA byte hits extrapolates to an estimated candidate count; above 32k we route directly to the 1-byte progressing-codes path (`scan_with_anchor_bitset`), which is what the pre-SSA-merge code did for SSA-present needles. The sample cost is a few µs; below threshold the fused path runs as before. Bench (divan medians, 100k rows): | pattern | pre-SSA | fused-only | fused + short-circuit | |-----------------|---------|------------|-----------------------| | %ear% urls | 1.41 ms | 0.69 ms ✓ | 0.69 ms ✓ | | %ear% cb | 3.82 ms | 3.03 ms ✓ | 3.02 ms ✓ | | %google% urls | 1.16 ms | 0.94 ms ✓ | 0.92 ms ✓ | | %htt% urls | 0.59 ms | 1.86 ms ✗ | 0.62 ms (parity) | | %htt% cb | 1.31 ms | 2.92 ms ✗ | 1.30 ms (parity) | | %https% urls | 0.70 ms | 0.92 ms ✗ | 0.76 ms (close) | Selective wins from SSA fusion are preserved; saturated-SSA regressions are eliminated. The threshold (32k estimated candidates) was calibrated from the bench corpus — `%ear%` / `%google%` sit comfortably under it while `%htt%` / `%https%` / `%htt% cb` all cross it. Adds NOT LIKE benches (%google% urls, %xyzzy% rare) to guard the ungated path. Signed-off-by: Claude <noreply@anthropic.com>
Extend the LikeKind parser and the KMP byte-table / suffix byte-table construction to treat `_` (byte 0x5F) as the SQL single-byte wildcard. Anchored shapes — `prefix%` and `%suffix` — gain wildcard support; each `_` position transitions on every byte instead of one literal. Unanchored shapes (`%contains%`, `%seg1%seg2%`) are still rejected when any `_` appears: KMP's failure function with wildcards is unsound (treats `_` as symmetrically compatible with any pattern byte, producing false positives at the DFA level). A correct unanchored wildcard matcher needs NFA subset construction; tracked as a follow-up. Changes: - `dfa/mod.rs`: add `WILDCARD = b'_'`, `pattern_eq`, `pattern_matches_byte`. Update `kmp_byte_transitions` to fill the row (any byte advances) at wildcard positions; `kmp_failure_table` uses wildcard-aware pattern equality. - `dfa/prefix.rs::build_prefix_byte_table`: fill the row at wildcard positions. - `dfa/suffix.rs::build_suffix_byte_table`: same, for the backward-scanned suffix. - `dfa/mod.rs::LikeKind::parse`: accept `_` in `Prefix` and `Suffix` variants; still reject in `Contains` / `MultiContains`. - `needle_bytes_absent_from_all_symbols` skips wildcard positions when computing the literal-byte symbol overlap; the escape-only memmem fast path is gated on `needle_is_literal`. Adds 6 wildcard tests covering prefix, suffix, multi-wildcard, leading-wildcard, symbol-interaction, and the deliberate contains-rejection. All 163 existing + new tests pass. Signed-off-by: Claude <noreply@anthropic.com>
Extend the DFA construction to optionally fold ASCII letter case. Adds `FsstMatcher::try_new_with(symbols, lengths, pattern, case_insensitive)`; the `like.rs` kernel now plumbs `options.case_insensitive` through instead of bailing out. Mechanism: - `dfa/mod.rs`: add `ascii_to_lower`, `pattern_eq(a, b, ci)`, `pattern_matches_byte(p, b, ci)`, and a `set_advance` helper that, when `ci` is true, sets both case variants of an ASCII letter in the byte table. `kmp_byte_transitions` and `kmp_failure_table` now take `ci`; the fold is at construction time so the hot loop stays a single table lookup per byte. - `dfa/prefix.rs::build_prefix_byte_table`, `dfa/suffix.rs:: build_suffix_byte_table`, `dfa/multi_contains.rs:: chained_kmp_byte_transitions`: same pattern. - Each DFA's `new()` takes `case_insensitive: bool`. Threaded through `FsstMatcher::try_new_with` from `LikeKernel::like`. - Escape-only memmem fast path is gated to wildcard-free, case-sensitive needles (the encoded pattern is byte-exact). Adds 6 ILIKE tests covering prefix, suffix, contains, multi-contains, ILIKE + `_` wildcard, and ILIKE with FSST symbol expansions in mixed case. 169 tests pass. Signed-off-by: Claude <noreply@anthropic.com>
Drops the TODO on the NEON Teddy passes. Mirrors the AVX2 / AVX-512 implementations: at setup, load the SSA nibble tables into NEON registers (or splat zero when SSA is absent); in the inner loop, compute `ssa_bits = neon_nibble_lookup(ssa_lo, ssa_hi, v1, nibble_mask)` and `vorrq_u8` it into the Teddy candidate vector before the movemask. Scalar tail picks up SSA via the same one-line nibble table check used in the AVX2 tail, and the pair NEON path adds the last-byte SSA-only check. The NEON code is `#[cfg(target_arch = "aarch64")]`-gated; no runtime change on x86_64 (which already does fused SSA via AVX2 / AVX-512). Cross-compile-checked locally is not available; logic is byte-for-byte parallel to the AVX2 path. Signed-off-by: Claude <noreply@anthropic.com>
Three self-contained task briefs for the remaining DFA prefilter items — Shift-Or for short needles, Fat Teddy for multi-pattern OR, and an engine planner / cost-model that replaces the hardcoded routing cascade. Each section is sized to be pasted into a subagent prompt: required context, files to touch, exit criteria, validation gates, known pitfalls. Includes a shared "Required context" block covering the FSST DFA architecture so each task brief stays focused on its own scope. Recommended order when running sequentially: 1. Shift-Or — extends FoldedContains for needles ≤ 8 bytes. 2. Planner — refactors scan_to_bitbuf routing; needs Shift-Or added first so the matrix is non-trivial. 3. Fat Teddy — multi-pattern OR; benefits from the planner. Signed-off-by: Claude <noreply@anthropic.com>
Adds `MultiNeedleMatcher` and `dfa/fat_teddy.rs` implementing a
Hyperscan-inspired Fat Teddy prefilter for `LIKE x OR LIKE y OR ...`
on FSST-compressed strings. Up to 8 needles per pass; greedy
bucket-packing minimizes per-bucket false-positive rate
(`|c1_union| * |c2_union|`). Verification uses each needle's
existing `FoldedContainsDfa::matches`. AVX2 + scalar streaming
passes share the per-block PSHUFB-Mula lookup; AVX-512 + NEON
variants and cross-bucket FDR for ESCAPE-anchored needles are
deferred.
Brought from a parallel agent worktree: `shift_or.rs` (Task A)
arrived integrated into `mod.rs`. Added one fix:
`#[cfg(debug_assertions)]` guarding a debug-only function reference
inside `debug_assert!` so release builds compile.
Deferred / TODOs:
- Cross-bucket FDR for ESCAPE_CODE-anchored needles. Currently
those needles fall back to per-needle scans (marked
`TODO(fat-teddy)` in `fat_teddy::pack_needles`).
- AVX-512 and NEON Fat Teddy passes (today AVX2 + scalar).
- Engine planner integration (Task C). MultiNeedleMatcher
dispatches directly without a planner.
New public API (`vortex_fsst::dfa::MultiNeedleMatcher`):
- `try_new_multi(symbols, lengths, &[&[u8]], case_insensitive)`
- `scan_or_to_bitbuf(n, offsets, all_bytes, negated)`
- `bucket_count()`, `needle_count()`, `fallback_count()`
Validation gates:
- `cargo test -p vortex-fsst --lib` passes (193 tests).
- `cargo test -p vortex-fsst --lib --features _test-harness`
passes (196 tests, including the property test
`test_fat_teddy_random_needles_equals_or_of_singles` and
`test_fat_teddy_equals_or_of_single_matchers`).
- `cargo +nightly fmt --all` clean.
- `cargo clippy -p vortex-fsst --lib --all-features --tests`:
no new lints in changed files; preexisting lints on
`dfa/mod.rs:793` (build_symbol_transitions), `anchor_scan.rs:3100+`,
and `dfa_compressed/` are out of scope per AGENTS.md.
New bench targets in `benches/fsst_like.rs`:
- `fsst_contains_or_{3,8,16}_urls` — Fat Teddy single pass.
- `fsst_contains_or_{3,8,16}_urls_npass` — N-pass baseline.
Signed-off-by: Claude <claude@anthropic.com>
Replaces the hardcoded `if let Some(...) { ... } else if ...` cascade
inside `FoldedContainsDfa::scan_to_bitbuf` (and the smaller cascades in
`FlatContainsDfa` / `MultiContainsDfa`) with a single `ScanPlanner`
that picks a `ScanPlan` up front and dispatches through one match.
New `dfa/planner.rs` (~430 lines) exposes:
- `ScanPlan` — one variant per legacy cascade branch, plus a reserved
`ShiftOr` slot for Task A. Slot is `cfg_attr`-gated dead_code outside
the test harness.
- `ScanContext` — borrowed inputs (n, all_bytes, ssa codes, bucket
summaries, escape-only flag) the planner reads in O(1).
- `ScanPlanner::plan_folded` / `plan_flat_or_multi` — rules-based
routing that replicates the legacy cascade exactly (locked in by
`test_planner_matches_legacy_cascade` against every fsst_contains
bench needle on every bench corpus).
- `ssa_saturated` and `escape_pair_targets` moved here as the single
source of truth.
- `ArchProfile::detect()` runs CPUID once at `ScanPlanner::new()`; the
arch is cached for the lifetime of the DFA.
- `ScanPlanner::estimated_cost_ns` returns approximate per-call cost.
Calibrated from `DESIGN.md` numbers and benches/fsst_like.rs:
* triple Teddy: AVX-512 4.28 GB/s, AVX2 2.74 GB/s, NEON 2.5, scalar 0.8
* pair Teddy: AVX-512 5.50, AVX2 3.30, NEON 3.0, scalar 1.0
* 1-byte: AVX-512 12.0, AVX2 8.0, NEON 7.0, scalar 2.0
* memmem ~25 GB/s, row-loop ~150 ns/row
Today the cost is diagnostic only (the routing is rules-based); the
constants exist for VORTEX_FSST_PLAN_TRACE and to make later
comparison-based selection mechanical.
`FoldedContainsDfa::scan_to_bitbuf` now extracts each path into a
`run_*` helper (`run_escape_only`, `run_one_byte_saturated`,
`run_triple_teddy`, `run_escape_pair`, `run_pair_teddy`,
`run_one_byte_bitset`, `run_row_loop`) and dispatches via `match
plan { ... }`. The Teddy-trace `VORTEX_FSST_TEDDY_TRACE` output is
preserved verbatim, and a new `VORTEX_FSST_PLAN_TRACE=1` prints the
planner's chosen plan plus inputs and the estimated cost.
`FlatContainsDfa` and `MultiContainsDfa` route through the same
planner (only `EscapeOnly` vs `RowLoop`) so the dispatch surface is
uniform across the three contains DFAs.
Regression guards added:
- `test_planner_matches_legacy_cascade` runs every fsst_contains
bench's underlying call (12 corpus × needle pairs) and asserts
`planner.plan() == legacy_path_for(...)`. Future changes can't
silently re-route traffic.
- 11 unit tests in `planner::tests` cover each routing decision row,
cost-model monotonicity, and `ScanPlan::name` uniqueness.
No algorithmic changes — every existing scan path is invoked under
the same conditions as before, so benches are at parity.
Checks:
- cargo test -p vortex-fsst --lib --features _test-harness: 184 passed
- cargo test -p vortex-fsst --lib: 182 passed
- cargo +nightly fmt --all: clean
- cargo clippy -p vortex-fsst --all-targets --all-features: no new
lints in changed files (pre-existing lints in dfa_compressed/,
anchor_scan.rs:3100+, mod.rs:498, multi_contains.rs:405 untouched).
- cargo bench -p vortex-fsst --bench fsst_like --features _test-harness:
benches compile and `fsst_contains_htt_{cb,urls}` /
`fsst_contains_https_urls` run inside expected timings.
Signed-off-by: Claude Agent <claude-agent@anthropic.com>
Adds a bit-parallel Shift-Or / Bitap matcher (`ShiftOrDfa`) for short `%needle%` Contains patterns where the needle is ≤ 8 bytes and no FSST symbol's expansion contains the needle (no SSA). The new matcher maintains a single `u64` state and updates it via `state = (state << shift) | or_mask` per FSST code, with a per-symbol table composed from the decompressed-byte mask. ESCAPE pairs apply the byte mask directly. The intermediate accept check `(!state) & state_accept_mask[c] != 0` handles multi-byte symbol expansions where the needle could match midway through a symbol. Wiring: - `dfa/shift_or.rs`: `ShiftOrDfa` with `new`, `matches`, and `scan_to_bitbuf`. The scan path uses the existing `anchor_scan` progressing-code bitset prefilter so it doesn't degrade to a row-by-row inner loop on sparse-needle workloads. - `dfa/mod.rs`: `MatcherInner::ShiftOr` variant plus a conservative routing gate in `FsstMatcher::try_new_with`. ShiftOr is selected only when (a) the FoldedContains escape-only `memmem` fast path doesn't apply, (b) `needle[0]` is absent from every symbol's expansion (so FoldedContains' Teddy-2 pair scan would have no bucket), and (c) no symbol contains the needle outright. - `dfa/tests.rs`: routing + end-to-end tests covering all gating conditions. Tests (16 new, 185 total in `vortex-fsst --lib`): - 1-byte / 2-byte / 8-byte needle matching with and without symbols. - Wildcard `_` and ASCII case-insensitive matching. - FSST symbol composition (multi-byte symbols whose expansion straddles the needle). - Constructor rejects empty needles, oversized needles, and SSA cases. - `rstest` property test comparing `ShiftOrDfa::matches` to `FoldedContainsDfa::matches` on random code streams over needle lengths 1..=8, both case-sensitive and case-insensitive. - Routing test verifying the gate selects `shift_or` only where expected and falls through to `folded_contains` otherwise. Perf (`cargo bench -p vortex-fsst --bench fsst_like --features _test-harness`, single run, x86_64 AVX2): | Bench | median | |--------------------------------------|----------| | fsst_contains_short_zz_urls | 53.35 µs | | fsst_contains_short_zzz_urls | 53.49 µs | | fsst_contains_short_xy_urls | 64.45 µs | | fsst_contains_short_qq_urls | 64.14 µs | | fsst_contains_short_qq_cb | 310.3 µs | | fsst_contains_short_xyzz_rare | 277.5 µs | All parametric `fsst_contains` benches stay at parity with HEAD (within bench-to-bench noise). ShiftOr is currently bypassed on URL-shaped data because FoldedContains' Teddy-2 pair scan dominates whenever `needle[0]` lives in a symbol expansion, which is the typical case for trained FSST dictionaries. The matcher and routing are in place for a follow-up that integrates Shift-Or into the planner alongside Teddy-2/3 path selection. Signed-off-by: Claude <noreply@anthropic.com>
Brings in `MultiNeedleMatcher` + `dfa/fat_teddy.rs` (8-bucket Fat Teddy with greedy bucket-packing, AVX2 + scalar, per-bucket `FoldedContainsDfa::matches` verifier) plus 12 new unit/property tests and 6 multi-needle OR benches. Conflicts resolved: - `mod.rs`: kept HEAD's empirically-tuned ShiftOr gate (no escape-only eligible AND no first-byte present in any symbol AND no SSA), plus Task A's `first_byte_present_in_any_symbol` helper; appended Fat Teddy's `MultiNeedleMatcher` section unchanged. - `tests.rs`: HEAD's test bodies covering the richer ShiftOr routing predicates; appended Fat Teddy's `MultiNeedleMatcher` test section (12 new tests). - `benches/fsst_like.rs`: appended Fat Teddy's six `fsst_contains_or_*` benches (3-, 8-, 16-needle Fat Teddy vs N-pass baselines on the ClickBench URL corpus). Deferred TODOs preserved from the subagent's commit: - Cross-bucket FDR for ESCAPE_CODE-anchored needles (falls back to N-pass). - AVX-512 and NEON variants of `fat_teddy_pass_*`. - Planner integration (Task C, separate merge). 196 tests pass with `_test-harness`. `cargo +nightly fmt --all` clean.
Brings in `dfa/planner.rs` (~430 LOC): `ScanPlanner`, `ScanContext`, `ScanPlan` (with reserved `ShiftOr` variant), `ArchProfile` (CPUID once at construction), and a calibrated `estimated_cost_ns` cost model. Refactors `FoldedContainsDfa`, `FlatContainsDfa`, `MultiContainsDfa::scan_to_bitbuf` to dispatch via the planner into per-path `run_*` helpers; `ssa_saturated` / `escape_pair_targets` consolidated into the planner module. Adds `test_planner_matches_legacy_cascade` (12 corpus × needle pairs from `benches/fsst_like.rs`) plus 11 unit tests covering each routing decision row. New `VORTEX_FSST_PLAN_TRACE=1` env var prints planner inputs + chosen plan + estimated cost. Conflicts resolved: - `folded_contains.rs`: kept Fat Teddy's accessor methods (`bucketed_pair_codes_slice`, `single_step_accept_codes_slice`) AND the planner's `scan_plan_name` refactor. - `tests.rs`: kept Fat Teddy's `MultiNeedleMatcher` test section AND the planner's `test_planner_matches_legacy_cascade` bench-parity regression test. After all three subagent merges (Shift-Or + Fat Teddy + planner), 210 tests pass with `_test-harness`. `cargo +nightly fmt --all` clean. Deferred TODOs preserved: - Cross-bucket FDR for ESCAPE_CODE in Fat Teddy. - AVX-512 / NEON variants of `fat_teddy_pass_*`. - Planner integration of the `ShiftOr` plan (reserved slot exists; routing decision is still made in `try_new_with`). Signed-off-by: Claude <noreply@anthropic.com>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_bool_canonical_into[(10, 1000)] |
794.9 µs | 922 µs | -13.79% |
| ❌ | Simulation | chunked_bool_canonical_into[(100, 100)] |
102.7 µs | 116.4 µs | -11.81% |
| ❌ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
46.8 µs | 59.5 µs | -21.26% |
| ❌ | Simulation | chunked_constant_i32_append_to_builder[(1000, 10)] |
30.9 µs | 39.5 µs | -21.79% |
| ❌ | Simulation | chunked_opt_bool_canonical_into[(10, 1000)] |
912.8 µs | 1,143.7 µs | -20.19% |
| ❌ | Simulation | chunked_opt_bool_canonical_into[(100, 100)] |
205 µs | 246.8 µs | -16.96% |
| ⚡ | Simulation | chunked_opt_bool_into_canonical[(10, 1000)] |
1.4 ms | 1.3 ms | +10% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(10, 1000)] |
2.2 ms | 1.9 ms | +15.1% |
| ❌ | Simulation | bench_compare_primitive[(10000, 128)] |
106.9 µs | 120.6 µs | -11.32% |
| ❌ | Simulation | bench_compare_primitive[(10000, 2)] |
106 µs | 118.4 µs | -10.47% |
| ❌ | Simulation | bench_compare_primitive[(10000, 32)] |
106.3 µs | 118.9 µs | -10.59% |
| ❌ | Simulation | bench_compare_primitive[(10000, 4)] |
105.8 µs | 119.1 µs | -11.17% |
| ❌ | Simulation | bench_compare_primitive[(10000, 8)] |
105.7 µs | 118.6 µs | -10.88% |
| ❌ | Simulation | bench_compare_sliced_dict_primitive[(1000, 10000)] |
80.7 µs | 93 µs | -13.25% |
| ❌ | Simulation | bench_compare_sliced_dict_primitive[(2000, 10000)] |
85.1 µs | 98.2 µs | -13.33% |
| ❌ | Simulation | bench_compare_sliced_dict_primitive[(2500, 10000)] |
87.6 µs | 100.9 µs | -13.2% |
| ❌ | Simulation | bench_compare_sliced_dict_primitive[(3333, 10000)] |
92.6 µs | 105.4 µs | -12.13% |
| ❌ | Simulation | bench_compare_sliced_dict_primitive[(5000, 10000)] |
101.6 µs | 114 µs | -10.88% |
| ⚡ | Simulation | encode_varbinview[(1000, 2)] |
203.2 µs | 164.5 µs | +23.51% |
| ❌ | Simulation | bench_sparse_coverage[0.01] |
366.5 µs | 439.5 µs | -16.61% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/optimize-string-lookup-KuJZB (d24e5cb) with develop (7349cd6)3
Footnotes
-
115 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
-
38 benchmarks were run, but are now archived. If they were deleted in another branch, consider rebasing to remove them from the report. Instead if they were added back, click here to restore them. ↩
-
No successful run was found on
ji/fsst-like-paper-2-work-clean(049c79f) during the generation of this report, sodevelop(7349cd6) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩
Stacks 15 commits on top of
ji/fsst-like-paper-2-work-cleancovering perf, coverage, and routing.Perf summary, divan medians, 100k rows
%google%urls (ClickBench Q20 target)%gmail%email%ear%urls (DESIGN.md SSA target)%ear%cb%htt%urls (saturated SSA)%htt%cb%https%urlsNOT LIKE %google%urlsWhat's in the stack (15 commits)
Performance / routing
6c0ac95dcEscape-only memmem fast path — when no FSST symbol expansion contains any needle byte, the only compressed sequence reaching the contains DFA's accept state is exactly[ESCAPE, n[0], ESCAPE, n[1], …, ESCAPE, n[L-1]]. Singlememmemoverall_bytesprefilters rows; verifier resolves the rare literal-byte-position false positive.9e826367fSame fast path ported toFlatContainsDfa(128–254-byte needles) andMultiContainsDfa(longest segment as anchor).b466388f6Fused 32-byte SSA nibble in AVX2/AVX-512 Teddy — DESIGN.md's proper SSA-merge, implemented insidefused_teddy_pair_scan/fused_teddy_triple_scan. Reusesv1's nibble register for an additional PSHUFB-Mula lookup oversingle_step_accept_codes;ORs into the Teddy mask beforevpcmpgtb/vpcmpneqb. Two extra vector ops per block, one PSHUFB pass total. Eliminates the dense-bitset 1-byte fallback for selective SSA needles.970c09228NOT LIKE streaming (droplike.rs:113gate, NOT LIKE routes through Teddy + fused SSA) and dense-pattern short-circuit (8 KiB scalar SSA-byte sample; if estimated > 32k candidates total, route directly to 1-byte fallback). Recovers the%htt%regression that pure fusion left.24f94ed00NEON SSA fusion mirrors the AVX2 / AVX-512 logic; cfg-gated.Coverage
b791a0097_single-byte wildcard for prefix and suffix patterns. KMP byte-table fills the row at wildcard positions._in%contains%deliberately rejected — KMP failure with wildcards is unsound (false positives); a correct contains-with-_needs NFA subset construction.5e5334c62ILIKE / ASCII case-insensitive. Case folding at DFA construction time viaset_advance; pattern equality uses case-folded compare in the failure function. Threaded through every DFA constructor viaFsstMatcher::try_new_with.Subagent track (parallel worktrees, then merged)
eb2b2e6bcShift-Or / Bitap matcher for short needles (≤8 bytes). Bit-parallel single-u64-state matcher with per-symbol composition(shift_bits, or_mask, state_accept_mask). Routing gated behind "no escape-only eligible, no first-byte present in any symbol, no SSA" — on URL-shaped FSST dicts that's rarely true, so Shift-Or rarely fires in practice. Variant is correct + tested; pays off on sparser dictionaries.0e8889c18/666ad9cdeFat Teddy multi-needle OR — 8-bucket Fat Teddy with greedy bucket-packing, AVX2 + scalar; per-bucketFoldedContainsDfa::matchesverifier.MultiNeedleMatcher::try_new_multi/scan_or_to_bitbufforLIKE x OR LIKE y OR …. Deferred TODOs in code: cross-bucket FDR for ESCAPE_CODE-anchored needles (currently falls back to N-pass), AVX-512 + NEON variants.b994a06ac/d24e5cb22Engine planner / cost-model routing — refactorsscan_to_bitbufin folded / flat / multi to dispatch viamatch planner.plan_*(&ctx)into per-pathrun_*helpers.ArchProfile(CPUID once),estimated_cost_nscost model,VORTEX_FSST_PLAN_TRACE=1. Bench-parity regression testtest_planner_matches_legacy_cascadeasserts the planner reproduces the legacy cascade's decision for every fsst_contains bench. ReservedShiftOrenum variant with TODO comment — current Shift-Or routing still happens intry_new_with, not via the planner.Misc
f0c8b5146Follow-ups doc atencodings/fsst/src/dfa/FOLLOWUPS.mdwith subagent-ready briefs for the three tasks above plus mixed-anchors / contains-with-_for future work.Tests
210 tests pass with
--features _test-harness(157 baseline + 53 new):_test-harness-gated property tests vs OR-of-singles)test_planner_matches_legacy_cascadeagainst the bench corpus)cargo +nightly fmt --allclean.cargo clippy -p vortex-fsst --all-targets --all-features— no new lints in changed files (preexisting lints inmod.rs:498,anchor_scan.rs:3100+,dfa_compressed/*remain unaddressed).Deferred (preserved as TODOs in code; documented in FOLLOWUPS.md)
fat_teddy_pass_*.ShiftOrplan (reserved slot exists).pre%mid%suf) — needs MultiContainsDfa refactor._— needs NFA subset construction.Test plan
cargo test -p vortex-fsst --lib --features _test-harness— must report 210 passing.cargo bench -p vortex-fsst --bench fsst_like --features _test-harness— confirm the headline perf table above on the reviewer's hardware.cargo clippy -p vortex-fsst --all-targets --all-features— diff against base; no new lints inencodings/fsst/src/dfa/*orencodings/fsst/src/compute/like.rs.VORTEX_FSST_TEDDY_TRACE=1output for%ear%(should route through fused SSA Teddy) and%htt%(should route through dense short-circuit).666ad9cde(Fat Teddy) andd24e5cb22(planner) — kept HEAD's stricter ShiftOr gate, appended Fat Teddy'sMultiNeedleMatchersection, and combined both subagents' test additions.shift_or.rs806 LOC,fat_teddy.rs744 LOC,planner.rs~430 LOC) — I reviewed merge conflicts but did not exhaustively audit each subagent's implementation. Consider spawning apr-reviewagent on the diff range049c79f89..d24e5cb22.🤖 Generated with Claude Code
Generated by Claude Code