Skip to content

TurboQuant: correct L2 Norm#8067

Draft
connortsui20 wants to merge 5 commits into
developfrom
ct/tq-l2-norm
Draft

TurboQuant: correct L2 Norm#8067
connortsui20 wants to merge 5 commits into
developfrom
ct/tq-l2-norm

Conversation

@connortsui20
Copy link
Copy Markdown
Contributor

Summary

Tracking issue: #7830

TODO

Testing

Basic unit tests.

Doc-comment pass over the parts of vortex-turboquant that don't depend on the in-flight
inv_direction_norms work: SplitMix64, the vector module, DecodeInputs's pre-existing fields,
TurboQuant + TurboQuantMetadata + their fields, the TurboQuantMetadataProto wire-format note,
validate_tq_metadata, validate_tq_storage_dtype, and the TurboQuantParsedStorage struct +
existing fields. No semantic change.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Register a session-scoped execute-parent kernel that intercepts
`L2Norm(TQDecode(_))` and returns the stored `norms` child directly,
skipping the inverse SORF, dimension truncation, and per-row rescaling
that `TQDecode` would otherwise run.

Make decode norm-preserving in flight: after the inverse SORF, compute
the L2 norm of the decoded-direction prefix, take its reciprocal
(guarded by `is_normal` so denormal cancellation produces an all-zero
row), and multiply through alongside the stored row norm. This keeps
the kernel's fast path equal to the canonical slow path within
floating-point precision: `|TQDecode(x)| == stored_norm[x]` by
construction, with no extra storage child.

Reject non-finite input L2 norms at encode so a finite row whose
squared sum overflows f32 (or contains a `NaN`) cannot produce a
`+inf` stored norm that would silently disagree with the in-flight
correction at decode time.

Add regression coverage for the kernel (bit-exact correctness,
null/masked validity, parent-nullability coercion, cross-check against
canonical at dim 128/129/257), the encode finite-norm guard, the
decode L2-norm preservation invariant across (f16, f32, f64) x (128,
129, 257) dimensions, and the file-roundtrip invariant. Collapse the
test-session helpers down to a single `test_session` that initializes
both `vortex_tensor` and `vortex_turboquant`.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Apply 3 must-fixes from the cycle-1 phase-3 gauntlet review of the
L2Norm-decode readthrough commit.

- parse_storage_norms_only (storage.rs): execute the `codes` FSL wrapper
  and validate its row-level validity against the struct's, so the
  L2Norm(TQDecode(_)) fast path rejects malformed-validity inputs the
  same way the canonical TQDecode path does. The inner `u8` element
  buffer is still skipped, so the per-row decoding work is preserved as
  the fast-path's optimization. Per-byte centroid-index range is not
  checked on either path (kernel does not need the codes, canonical
  path panics on out-of-range bytes which is acceptable per existing
  malformed-input coverage).

- TQEncode::is_fallible (encode.rs): return true. The new non-finite-
  norm rejection in vector::normalize is a data-dependent
  defined-behavior failure, so optimizers that speculatively evaluate
  unreferenced inputs (e.g., dictionary pushdown) must treat encode as
  fallible to avoid surfacing errors on rows the consumer would never
  reach.

- Degenerate-direction decode (decode.rs): when the dequantized
  direction's squared L2 norm is zero or subnormal, emit
  [norm, 0, 0, ..., 0] instead of an all-zero row, so
  L2Norm(decoded_row) == stored_norm[row]. The all-zero fallback
  silently disagreed with the L2Norm(TQDecode(_)) kernel (kernel
  returns stored_norm while canonical returned 0). The kernel's
  fast/slow-path equivalence now holds for degenerate rows too,
  matching the documented `|TQDecode(x)| == stored_norm[x]` invariant.

Cross-reference the is_fallible rationale at the normalize-side check
so both sites are discoverable from each other.

Deferred to follow-up: f16/f64 parameterization of the kernel-fast-path
test (only f32 is currently exercised at the kernel level), a chunked-
TurboQuant kernel test, parity test re-parameterization over dim=128,
and a constructed regression test for the degenerate-direction decode
path. All are should-fix coverage improvements, not behavior bugs.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Apply 3 must-fixes from the cycle-2 phase-3 gauntlet review of the
cycle-1 fix-commit (7cf7433dd).

- Kernel negative-norms safety (l2_norm.rs): canonical `L2Norm` always
  returns non-negative, but the kernel was returning the stored `norms`
  buffer verbatim. For hand-constructed TurboQuant storage carrying a
  negative norm value, kernel and canonical disagreed by a sign flip.
  The kernel now scans the parsed norms once and falls back to the
  canonical path (`Ok(None)`) when any value is strictly negative.
  Well-formed storage (encode-produced norms via `L2Norm = sqrt(sum_sq)
  >= 0`) is unaffected; the cost is one cache-warm pass over a buffer
  that was already materialized for the return value.

- Pin `non-finite` error message in `encode_rejects_nan_input` test
  (encode_decode.rs) so a regression that fails encode for an unrelated
  reason cannot silently pass; matches the sibling
  `encode_rejects_non_finite_norms` test's pattern.

- Parameterize `l2_norm_over_tq_decode_returns_stored_norms` over
  `PType::{F16, F32, F64}` (kernels.rs) so the kernel's per-ptype
  buffer-handle plumbing is exercised at non-f32 ptypes too. Closes
  the scope-drift gap between the cycle-0 commit message's claimed
  full ptype matrix and the test that actually shipped.

Also replace em dashes added in cycle 1's `is_fallible` rationale
comment with comma/period punctuation, per the user's standing
no-em-dashes feedback.

Cycle-2 should-fix items intentionally deferred:
- masked-kernel test asserting numeric values (not just validity bitmap)
- direct unit pin of `TQEncode::is_fallible == true`
- shared-helper extraction between `parse_storage` and
  `parse_storage_norms_only`
- chunked-TurboQuant kernel coverage
- constructed regression for the degenerate-direction decode branch
- parity test widened to dim 128 / 257

Cycle-2 must-fixes intentionally deferred with rationale in commit body:
- per-byte centroid-index range validation in the fast path would
  re-introduce the O(rows * padded_dim) work the kernel exists to
  avoid; canonical decode already panics on out-of-range bytes, and
  the kernel succeeding-without-decoding vs canonical-panicking is at
  best a wash.
- `TQEncode::is_fallible` optimizer-pushdown regression test requires
  dict-pushdown plumbing we do not have a simple way to construct.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Apply 3 fixes from the cycle-3 phase-3 gauntlet review of the cycle-2
fix-commit (8c02fe56d).

- Cover `-0.0` in the negative-stored-norm guard (l2_norm.rs): cycle-2
  used `*n < T::zero()` which is `false` for `-0.0` per IEEE 754, so a
  hand-constructed `-0.0` stored norm slipped through and the kernel
  returned `-0.0` while canonical computed `sqrt(sum_sq) == +0.0`.
  Switch to `n.is_sign_negative()`, which covers both strictly-negative
  values and `-0.0`. The comment now documents the IEEE 754 subtlety
  and the cache-warm `O(rows)` cost of the scan.

- Add `l2_norm_over_tq_decode_with_negative_stored_norm_falls_back`
  (kernels.rs), parameterized over `-5.0` and `-0.0`, that hand-builds
  a TurboQuant array with a sign-negative stored norm and asserts the
  result is non-negative and finite (proving the kernel fell back to
  the canonical path, which always returns `|stored_norm|`).

- Add `l2_norm_over_tq_decode_rejects_codes_validity_narrower_than_struct`
  (kernels.rs) that hand-builds a TurboQuant array whose `codes` child
  has row validity narrower than the outer struct's, mirroring the
  existing canonical-path
  `decode_rejects_child_masks_that_disagree_with_struct_validity` test.
  Asserts `L2Norm(TQDecode(_))` errors via `parse_storage_norms_only`'s
  validation, pinning the fast/slow-path validation parity that the
  cycle-1 fix-commit added.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 22, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 2 improved benchmarks
❌ 1 regressed benchmark
✅ 1248 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_into_canonical[(100, 100)] 358.3 µs 323.1 µs +10.88%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 211.7 µs 176 µs +20.28%
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 187.6 µs 224.8 µs -16.58%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ct/tq-l2-norm (83901ac) with develop (90dea93)

Open in CodSpeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant