TurboQuant: correct L2 Norm by connortsui20 · Pull Request #8067 · vortex-data/vortex

connortsui20 · 2026-05-22T15:53:08Z

Summary

Tracking issue: #7830

TODO

Testing

Basic unit tests.

Doc-comment pass over the parts of vortex-turboquant that don't depend on the in-flight inv_direction_norms work: SplitMix64, the vector module, DecodeInputs's pre-existing fields, TurboQuant + TurboQuantMetadata + their fields, the TurboQuantMetadataProto wire-format note, validate_tq_metadata, validate_tq_storage_dtype, and the TurboQuantParsedStorage struct + existing fields. No semantic change. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

Register a session-scoped execute-parent kernel that intercepts `L2Norm(TQDecode(_))` and returns the stored `norms` child directly, skipping the inverse SORF, dimension truncation, and per-row rescaling that `TQDecode` would otherwise run. Make decode norm-preserving in flight: after the inverse SORF, compute the L2 norm of the decoded-direction prefix, take its reciprocal (guarded by `is_normal` so denormal cancellation produces an all-zero row), and multiply through alongside the stored row norm. This keeps the kernel's fast path equal to the canonical slow path within floating-point precision: `|TQDecode(x)| == stored_norm[x]` by construction, with no extra storage child. Reject non-finite input L2 norms at encode so a finite row whose squared sum overflows f32 (or contains a `NaN`) cannot produce a `+inf` stored norm that would silently disagree with the in-flight correction at decode time. Add regression coverage for the kernel (bit-exact correctness, null/masked validity, parent-nullability coercion, cross-check against canonical at dim 128/129/257), the encode finite-norm guard, the decode L2-norm preservation invariant across (f16, f32, f64) x (128, 129, 257) dimensions, and the file-roundtrip invariant. Collapse the test-session helpers down to a single `test_session` that initializes both `vortex_tensor` and `vortex_turboquant`. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

Apply 3 must-fixes from the cycle-1 phase-3 gauntlet review of the L2Norm-decode readthrough commit. - parse_storage_norms_only (storage.rs): execute the `codes` FSL wrapper and validate its row-level validity against the struct's, so the L2Norm(TQDecode(_)) fast path rejects malformed-validity inputs the same way the canonical TQDecode path does. The inner `u8` element buffer is still skipped, so the per-row decoding work is preserved as the fast-path's optimization. Per-byte centroid-index range is not checked on either path (kernel does not need the codes, canonical path panics on out-of-range bytes which is acceptable per existing malformed-input coverage). - TQEncode::is_fallible (encode.rs): return true. The new non-finite- norm rejection in vector::normalize is a data-dependent defined-behavior failure, so optimizers that speculatively evaluate unreferenced inputs (e.g., dictionary pushdown) must treat encode as fallible to avoid surfacing errors on rows the consumer would never reach. - Degenerate-direction decode (decode.rs): when the dequantized direction's squared L2 norm is zero or subnormal, emit [norm, 0, 0, ..., 0] instead of an all-zero row, so L2Norm(decoded_row) == stored_norm[row]. The all-zero fallback silently disagreed with the L2Norm(TQDecode(_)) kernel (kernel returns stored_norm while canonical returned 0). The kernel's fast/slow-path equivalence now holds for degenerate rows too, matching the documented `|TQDecode(x)| == stored_norm[x]` invariant. Cross-reference the is_fallible rationale at the normalize-side check so both sites are discoverable from each other. Deferred to follow-up: f16/f64 parameterization of the kernel-fast-path test (only f32 is currently exercised at the kernel level), a chunked- TurboQuant kernel test, parity test re-parameterization over dim=128, and a constructed regression test for the degenerate-direction decode path. All are should-fix coverage improvements, not behavior bugs. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

Apply 3 must-fixes from the cycle-2 phase-3 gauntlet review of the cycle-1 fix-commit (7cf7433dd). - Kernel negative-norms safety (l2_norm.rs): canonical `L2Norm` always returns non-negative, but the kernel was returning the stored `norms` buffer verbatim. For hand-constructed TurboQuant storage carrying a negative norm value, kernel and canonical disagreed by a sign flip. The kernel now scans the parsed norms once and falls back to the canonical path (`Ok(None)`) when any value is strictly negative. Well-formed storage (encode-produced norms via `L2Norm = sqrt(sum_sq) >= 0`) is unaffected; the cost is one cache-warm pass over a buffer that was already materialized for the return value. - Pin `non-finite` error message in `encode_rejects_nan_input` test (encode_decode.rs) so a regression that fails encode for an unrelated reason cannot silently pass; matches the sibling `encode_rejects_non_finite_norms` test's pattern. - Parameterize `l2_norm_over_tq_decode_returns_stored_norms` over `PType::{F16, F32, F64}` (kernels.rs) so the kernel's per-ptype buffer-handle plumbing is exercised at non-f32 ptypes too. Closes the scope-drift gap between the cycle-0 commit message's claimed full ptype matrix and the test that actually shipped. Also replace em dashes added in cycle 1's `is_fallible` rationale comment with comma/period punctuation, per the user's standing no-em-dashes feedback. Cycle-2 should-fix items intentionally deferred: - masked-kernel test asserting numeric values (not just validity bitmap) - direct unit pin of `TQEncode::is_fallible == true` - shared-helper extraction between `parse_storage` and `parse_storage_norms_only` - chunked-TurboQuant kernel coverage - constructed regression for the degenerate-direction decode branch - parity test widened to dim 128 / 257 Cycle-2 must-fixes intentionally deferred with rationale in commit body: - per-byte centroid-index range validation in the fast path would re-introduce the O(rows * padded_dim) work the kernel exists to avoid; canonical decode already panics on out-of-range bytes, and the kernel succeeding-without-decoding vs canonical-panicking is at best a wash. - `TQEncode::is_fallible` optimizer-pushdown regression test requires dict-pushdown plumbing we do not have a simple way to construct. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

Apply 3 fixes from the cycle-3 phase-3 gauntlet review of the cycle-2 fix-commit (8c02fe56d). - Cover `-0.0` in the negative-stored-norm guard (l2_norm.rs): cycle-2 used `*n < T::zero()` which is `false` for `-0.0` per IEEE 754, so a hand-constructed `-0.0` stored norm slipped through and the kernel returned `-0.0` while canonical computed `sqrt(sum_sq) == +0.0`. Switch to `n.is_sign_negative()`, which covers both strictly-negative values and `-0.0`. The comment now documents the IEEE 754 subtlety and the cache-warm `O(rows)` cost of the scan. - Add `l2_norm_over_tq_decode_with_negative_stored_norm_falls_back` (kernels.rs), parameterized over `-5.0` and `-0.0`, that hand-builds a TurboQuant array with a sign-negative stored norm and asserts the result is non-negative and finite (proving the kernel fell back to the canonical path, which always returns `|stored_norm|`). - Add `l2_norm_over_tq_decode_rejects_codes_validity_narrower_than_struct` (kernels.rs) that hand-builds a TurboQuant array whose `codes` child has row validity narrower than the outer struct's, mirroring the existing canonical-path `decode_rejects_child_masks_that_disagree_with_struct_validity` test. Asserts `L2Norm(TQDecode(_))` errors via `parse_storage_norms_only`'s validation, pinning the fast/slow-path validation parity that the cycle-1 fix-commit added. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

codspeed-hq · 2026-05-22T16:05:55Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 2 improved benchmarks
❌ 1 regressed benchmark
✅ 1248 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_varbinview_into_canonical[(100, 100)]`	358.3 µs	323.1 µs	+10.88%
⚡	Simulation	`chunked_varbinview_into_canonical[(1000, 10)]`	211.7 µs	176 µs	+20.28%
❌	Simulation	`chunked_varbinview_opt_canonical_into[(1000, 10)]`	187.6 µs	224.8 µs	-16.58%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing ct/tq-l2-norm (83901ac) with develop (90dea93)}

connortsui20 added 5 commits May 22, 2026 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant: correct L2 Norm#8067

TurboQuant: correct L2 Norm#8067
connortsui20 wants to merge 5 commits into
developfrom
ct/tq-l2-norm

connortsui20 commented May 22, 2026

Uh oh!

codspeed-hq Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

connortsui20 commented May 22, 2026

Summary

Testing

Uh oh!

codspeed-hq Bot commented May 22, 2026

Merging this PR will not alter performance

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant