TurboQuant: correct L2 Norm#8067
Draft
connortsui20 wants to merge 5 commits into
Draft
Conversation
Doc-comment pass over the parts of vortex-turboquant that don't depend on the in-flight inv_direction_norms work: SplitMix64, the vector module, DecodeInputs's pre-existing fields, TurboQuant + TurboQuantMetadata + their fields, the TurboQuantMetadataProto wire-format note, validate_tq_metadata, validate_tq_storage_dtype, and the TurboQuantParsedStorage struct + existing fields. No semantic change. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Register a session-scoped execute-parent kernel that intercepts `L2Norm(TQDecode(_))` and returns the stored `norms` child directly, skipping the inverse SORF, dimension truncation, and per-row rescaling that `TQDecode` would otherwise run. Make decode norm-preserving in flight: after the inverse SORF, compute the L2 norm of the decoded-direction prefix, take its reciprocal (guarded by `is_normal` so denormal cancellation produces an all-zero row), and multiply through alongside the stored row norm. This keeps the kernel's fast path equal to the canonical slow path within floating-point precision: `|TQDecode(x)| == stored_norm[x]` by construction, with no extra storage child. Reject non-finite input L2 norms at encode so a finite row whose squared sum overflows f32 (or contains a `NaN`) cannot produce a `+inf` stored norm that would silently disagree with the in-flight correction at decode time. Add regression coverage for the kernel (bit-exact correctness, null/masked validity, parent-nullability coercion, cross-check against canonical at dim 128/129/257), the encode finite-norm guard, the decode L2-norm preservation invariant across (f16, f32, f64) x (128, 129, 257) dimensions, and the file-roundtrip invariant. Collapse the test-session helpers down to a single `test_session` that initializes both `vortex_tensor` and `vortex_turboquant`. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Apply 3 must-fixes from the cycle-1 phase-3 gauntlet review of the L2Norm-decode readthrough commit. - parse_storage_norms_only (storage.rs): execute the `codes` FSL wrapper and validate its row-level validity against the struct's, so the L2Norm(TQDecode(_)) fast path rejects malformed-validity inputs the same way the canonical TQDecode path does. The inner `u8` element buffer is still skipped, so the per-row decoding work is preserved as the fast-path's optimization. Per-byte centroid-index range is not checked on either path (kernel does not need the codes, canonical path panics on out-of-range bytes which is acceptable per existing malformed-input coverage). - TQEncode::is_fallible (encode.rs): return true. The new non-finite- norm rejection in vector::normalize is a data-dependent defined-behavior failure, so optimizers that speculatively evaluate unreferenced inputs (e.g., dictionary pushdown) must treat encode as fallible to avoid surfacing errors on rows the consumer would never reach. - Degenerate-direction decode (decode.rs): when the dequantized direction's squared L2 norm is zero or subnormal, emit [norm, 0, 0, ..., 0] instead of an all-zero row, so L2Norm(decoded_row) == stored_norm[row]. The all-zero fallback silently disagreed with the L2Norm(TQDecode(_)) kernel (kernel returns stored_norm while canonical returned 0). The kernel's fast/slow-path equivalence now holds for degenerate rows too, matching the documented `|TQDecode(x)| == stored_norm[x]` invariant. Cross-reference the is_fallible rationale at the normalize-side check so both sites are discoverable from each other. Deferred to follow-up: f16/f64 parameterization of the kernel-fast-path test (only f32 is currently exercised at the kernel level), a chunked- TurboQuant kernel test, parity test re-parameterization over dim=128, and a constructed regression test for the degenerate-direction decode path. All are should-fix coverage improvements, not behavior bugs. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Apply 3 must-fixes from the cycle-2 phase-3 gauntlet review of the
cycle-1 fix-commit (7cf7433dd).
- Kernel negative-norms safety (l2_norm.rs): canonical `L2Norm` always
returns non-negative, but the kernel was returning the stored `norms`
buffer verbatim. For hand-constructed TurboQuant storage carrying a
negative norm value, kernel and canonical disagreed by a sign flip.
The kernel now scans the parsed norms once and falls back to the
canonical path (`Ok(None)`) when any value is strictly negative.
Well-formed storage (encode-produced norms via `L2Norm = sqrt(sum_sq)
>= 0`) is unaffected; the cost is one cache-warm pass over a buffer
that was already materialized for the return value.
- Pin `non-finite` error message in `encode_rejects_nan_input` test
(encode_decode.rs) so a regression that fails encode for an unrelated
reason cannot silently pass; matches the sibling
`encode_rejects_non_finite_norms` test's pattern.
- Parameterize `l2_norm_over_tq_decode_returns_stored_norms` over
`PType::{F16, F32, F64}` (kernels.rs) so the kernel's per-ptype
buffer-handle plumbing is exercised at non-f32 ptypes too. Closes
the scope-drift gap between the cycle-0 commit message's claimed
full ptype matrix and the test that actually shipped.
Also replace em dashes added in cycle 1's `is_fallible` rationale
comment with comma/period punctuation, per the user's standing
no-em-dashes feedback.
Cycle-2 should-fix items intentionally deferred:
- masked-kernel test asserting numeric values (not just validity bitmap)
- direct unit pin of `TQEncode::is_fallible == true`
- shared-helper extraction between `parse_storage` and
`parse_storage_norms_only`
- chunked-TurboQuant kernel coverage
- constructed regression for the degenerate-direction decode branch
- parity test widened to dim 128 / 257
Cycle-2 must-fixes intentionally deferred with rationale in commit body:
- per-byte centroid-index range validation in the fast path would
re-introduce the O(rows * padded_dim) work the kernel exists to
avoid; canonical decode already panics on out-of-range bytes, and
the kernel succeeding-without-decoding vs canonical-panicking is at
best a wash.
- `TQEncode::is_fallible` optimizer-pushdown regression test requires
dict-pushdown plumbing we do not have a simple way to construct.
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Apply 3 fixes from the cycle-3 phase-3 gauntlet review of the cycle-2 fix-commit (8c02fe56d). - Cover `-0.0` in the negative-stored-norm guard (l2_norm.rs): cycle-2 used `*n < T::zero()` which is `false` for `-0.0` per IEEE 754, so a hand-constructed `-0.0` stored norm slipped through and the kernel returned `-0.0` while canonical computed `sqrt(sum_sq) == +0.0`. Switch to `n.is_sign_negative()`, which covers both strictly-negative values and `-0.0`. The comment now documents the IEEE 754 subtlety and the cache-warm `O(rows)` cost of the scan. - Add `l2_norm_over_tq_decode_with_negative_stored_norm_falls_back` (kernels.rs), parameterized over `-5.0` and `-0.0`, that hand-builds a TurboQuant array with a sign-negative stored norm and asserts the result is non-negative and finite (proving the kernel fell back to the canonical path, which always returns `|stored_norm|`). - Add `l2_norm_over_tq_decode_rejects_codes_validity_narrower_than_struct` (kernels.rs) that hand-builds a TurboQuant array whose `codes` child has row validity narrower than the outer struct's, mirroring the existing canonical-path `decode_rejects_child_masks_that_disagree_with_struct_validity` test. Asserts `L2Norm(TQDecode(_))` errors via `parse_storage_norms_only`'s validation, pinning the fast/slow-path validation parity that the cycle-1 fix-commit added. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_varbinview_into_canonical[(100, 100)] |
358.3 µs | 323.1 µs | +10.88% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
211.7 µs | 176 µs | +20.28% |
| ❌ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
187.6 µs | 224.8 µs | -16.58% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ct/tq-l2-norm (83901ac) with develop (90dea93)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tracking issue: #7830
TODO
Testing
Basic unit tests.