feat: turboquant encoding for vectors by lwwmanning · Pull Request #7167 · vortex-data/vortex

lwwmanning · 2026-03-25T21:35:19Z

Summary

Lossy quantization for vector data (e.g., embeddings) based on TurboQuant

Closes: #000

Testing

Implement the TurboQuant algorithm (arXiv:2504.19874) as a new lossy encoding for high-dimensional vector data. This supports both the MSE-optimal and inner-product-optimal (Prod) variants at 1-4 bits per coordinate. Key components: - Max-Lloyd centroid computation on Beta(d/2,d/2) distribution - Deterministic random rotation via nalgebra QR decomposition - FastLanes BitPackedArray for index storage - QJL residual correction for unbiased inner product estimation (Prod) The encoding operates on FixedSizeList arrays of floats, which is the storage format for Vector and FixedShapeTensor extension types. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ntegration Add a CompressorPlugin wrapper that intercepts Vector and FixedShapeTensor extension columns, applies TurboQuant encoding, and recursively compresses the resulting children (norms, codes) via the inner compressor. Expose this via WriteStrategyBuilder::with_vector_quantization(config), which composes with existing encoding modes (default, compact, cuda). TODO: restructure into BtrBlocks canonical_compressor directly (like DateTimeParts) rather than the wrapper CompressorPlugin approach. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move TurboQuant compression logic from a standalone CompressorPlugin wrapper into the BtrBlocks canonical compressor, following the same pattern as DateTimeParts. This gives TurboQuant access to the full BtrBlocks recursive compression pipeline for its children (norms, codes, etc.). Changes: - Add `turboquant_config: Option<TurboQuantConfig>` to BtrBlocksCompressor - Add `with_turboquant(config)` to BtrBlocksCompressorBuilder - Add tensor extension detection + compress_turboquant() in the Canonical::Extension arm of canonical_compressor - Update WriteStrategyBuilder::with_vector_quantization to configure BtrBlocks directly instead of wrapping - Remove TurboQuantCompressor wrapper and vortex-layout dep from vortex-turboquant Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add TurboQuant benchmarks to the single_encoding_throughput suite, covering compress and decompress for dim=128 and dim=768 at 2-bit and 4-bit widths. Uses 1000 random N(0,1) vectors per benchmark. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nsform Replace the O(d²) dense matrix rotation (previously nalgebra, then faer) with a Structured Random Hadamard Transform (SRHT) that runs in O(d log d). The SRHT applies D₃·H·D₂·H·D₁ where H is the Walsh-Hadamard transform and Dₖ are random diagonal ±1 sign matrices. This eliminates both the nalgebra and faer dependencies — the SRHT is fully self-contained with no external linear algebra library needed. Benchmark results (1000 vectors, mean throughput): | Benchmark | Before (nalgebra) | After (SRHT) | |----------------------------|---------:|----------:| | compress dim128 2-bit | 222 MB/s | 242 MB/s | | compress dim768 2-bit | 32 MB/s | 181 MB/s | | decompress dim128 2-bit | 87 MB/s | 614 MB/s | | decompress dim768 2-bit | 6 MB/s | 458 MB/s | For non-power-of-2 dimensions (e.g., 768), input is zero-padded to the next power of 2 (1024) and all padded coordinates are quantized. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tests Replace the loose "normalized MSE < 1.0" check with rigorous tests: - mse_within_theoretical_bound: Verifies per-vector normalized MSE is within 10x the paper's Theorem 1 bound (sqrt(3)*pi/2 / 4^b). Tests across dim={128,256} x bits={1,2,3,4}. - prod_inner_product_bias: Verifies the Prod variant produces approximately unbiased inner products by computing <query, x_hat> vs <query, x> over 500 random pairs and checking mean relative error < 0.3. - mse_decreases_with_bits: Verifies MSE monotonically decreases with increasing bit-width for both Mse and Prod variants. Total: 49 tests (up from 39). Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Hoist per-row allocations (residual, projected) out of encode_prod loop - Use BufferMut<u8> directly for sign_buf instead of Vec + copy - Remove unused num-traits dependency - Remove dead unreachable!() branch (bit_width >= 2 validated at entry) - Fix orphaned doc comment blank line - Generate public-api.lock files for new/modified crates Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address code review findings: - Tighten SRHT roundtrip test tolerance from 1e-3 to 1e-5 (verified exact to ~4e-7 relative error across dim 32-1024). Consolidate into parameterized rstest covering power-of-2 and non-power-of-2 dims. - Rename `pd` -> `padded_dim` throughout compress.rs and decompress.rs for clarity. - Add early dimension validation (>= 2) in turboquant_encode with clear error message. - Add edge case tests: single-row roundtrip (Mse + Prod), empty array Prod variant, dimension-below-2 rejection. - Tighten norm preservation test to 1e-5 relative tolerance. Total: 59 tests (up from 49). Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ror bounds Add comprehensive crate documentation including: - Theoretical MSE bounds per bit-width from the paper's Theorem 1 - Compression ratio table for common dimensions (256-1536), accounting for power-of-2 padding overhead on non-power-of-2 dims (768, 1536) - Working doctest demonstrating encode usage and size verification Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codspeed-hq · 2026-03-25T21:39:27Z

Merging this PR will not alter performance

✅ 1106 untouched benchmarks
🆕 14 new benchmarks
⏩ 1522 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
🆕	Simulation	`turboquant_compress_dim128_2bit`	N/A	15.2 ms	N/A
🆕	Simulation	`turboquant_compress_dim128_4bit`	N/A	16.1 ms	N/A
🆕	Simulation	`turboquant_decompress_dim768_2bit`	N/A	104.6 ms	N/A
🆕	Simulation	`turboquant_decompress_dim128_4bit`	N/A	10.3 ms	N/A
🆕	Simulation	`turboquant_compress_dim768_2bit`	N/A	143 ms	N/A
🆕	Simulation	`turboquant_decompress_dim128_2bit`	N/A	10.3 ms	N/A
🆕	Simulation	`turboquant_decompress_dim1536_2bit`	N/A	226.5 ms	N/A
🆕	Simulation	`turboquant_compress_dim1024_2bit`	N/A	144.4 ms	N/A
🆕	Simulation	`turboquant_decompress_dim1536_4bit`	N/A	226.8 ms	N/A
🆕	Simulation	`turboquant_decompress_dim1024_4bit`	N/A	105.2 ms	N/A
🆕	Simulation	`turboquant_compress_dim1536_4bit`	N/A	319.8 ms	N/A
🆕	Simulation	`turboquant_compress_dim1536_2bit`	N/A	307.1 ms	N/A
🆕	Simulation	`turboquant_compress_dim1024_4bit`	N/A	151.3 ms	N/A
🆕	Simulation	`turboquant_decompress_dim1024_2bit`	N/A	105.1 ms	N/A

_{Comparing claude/admiring-lichterman (de598ad) with develop (fa6931a)}

1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Extend bit_width range from 1-4 to 1-8. At 8 bits (256 centroids), codes are stored as raw u8 instead of bit-packed since BitPackedArray doesn't support width >= 8. This gives ~4x compression from f32 with near-lossless quality (MSE bound 4.15e-05). Changes: - Update all validation sites (compress, array, centroids) to accept 1-8 bits (MSE) and 2-8 bits (Prod) - Skip bitpack_encode for 8-bit codes, store PrimitiveArray<u8> directly - Extend crate docs with full 1-8 bit bound/ratio tables - Add 6-bit and 8-bit test cases for roundtrip, MSE bounds, Prod bias, and monotonic MSE decrease. High bit-width tests verify MSE < 4-bit MSE and MSE < 1% (since the theoretical bound becomes unrealistically tight at 5+ bits due to SRHT finite-dimension effects) - Regenerate public-api.lock Total: 69 unit tests + 1 doctest. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allow Prod variant bit_width up to 9, where the MSE component uses 8-bit codes (raw u8) plus 1-bit QJL correction. The 8-bit MSE codes can be fed directly into int8 GEMM kernels on tensor cores without unpacking. - Update Prod validation to 2-9, MSE remains 1-8 - Restructure top-level validation into per-variant match - Add 9-bit roundtrip, inner product bias, and monotonicity tests - Document tensor core use case in crate docs Total: 71 unit tests + 1 doctest. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Expand TurboQuant throughput benchmarks to cover common embedding dimensions: - dim=128 (2-bit, 4-bit) — small embeddings - dim=768 (2-bit) — BERT / sentence-transformers - dim=1024 (2-bit, 4-bit) — larger embedding models - dim=1536 (2-bit, 4-bit) — OpenAI ada-002, exercises non-power-of-2 padding overhead All benchmarks use i.i.d. N(0,1) vectors with fixed seed — a conservative worst-case for TurboQuant since real neural embeddings have structure that the SRHT exploits for better quantization. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add methods to persist and restore SRHT rotation signs as BoolArray, eliminating the need to regenerate from seed during decompression: - `export_inverse_signs_bool_array()`: Exports 3 × padded_dim sign bits as a single BoolArray in inverse-application order [D₃|D₂|D₁] so decompression iterates sequentially. - `from_bool_array(signs, dim)`: Reconstructs RotationMatrix from stored signs without needing the seed. - `apply_inverse_srht_from_bits(buf, signs_bytes, padded_dim, norm_factor)`: Hot-path free function that applies inverse SRHT directly from raw sign bytes, avoiding intermediate Vec<f32> reconstruction. Convention: bit=1 means +1, bit=0 means -1 (negate). Tests verify: - Export→import roundtrip produces identical rotation (3 dims) - Hot-path function matches struct-based inverse_rotate exactly Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add two new cascading array types that replace the monolithic TurboQuantArray: TurboQuantMSEArray (4 children): - codes (BitPackedArray or PrimitiveArray<u8>) - norms (PrimitiveArray<f32>) - centroids (PrimitiveArray<f32>, stored codebook) - rotation_signs (BoolArray, 3 * padded_dim bits, inverse order) TurboQuantQJLArray (4 children): - mse_inner (TurboQuantMSEArray at bit_width - 1) - qjl_signs (BoolArray, num_rows * padded_dim) - residual_norms (PrimitiveArray<f32>) - rotation_signs (BoolArray, QJL rotation, inverse order) Both store all precomputed data (centroids, rotation signs) as children to eliminate recomputation during decompression. Validity is pushed down to the codes child via ValidityVTableFromChild at each level. Includes decompression implementations for both new types that use stored centroids/signs and the hot-path apply_inverse_srht_from_bits. The old TurboQuantArray and its decode paths are retained for now. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add `turboquant_encode_mse()` and `turboquant_encode_qjl()` that produce the new cascaded array types: - turboquant_encode_mse: produces TurboQuantMSEArray with stored centroids (PrimitiveArray<f32>) and rotation signs (BoolArray) - turboquant_encode_qjl: produces TurboQuantQJLArray wrapping an inner TurboQuantMSEArray at bit_width-1, with QJL signs (BoolArray) and QJL rotation signs (BoolArray) Tests verify: - Roundtrip encode/decode for both new types at various dims/bit_widths - New MSE path matches legacy path exactly (bit-for-bit) - Edge cases: empty arrays and single-row arrays for both types Total: 90 unit tests + 1 doctest. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update the BtrBlocks TurboQuant compressor to produce the new cascaded TurboQuantQJLArray(TurboQuantMSEArray) structure. The compressor no longer manually compresses each child — it produces the TurboQuant array and lets the layout writer's recursive descent handle child compression naturally. This removes the explicit per-child compress_canonical calls and the BtrBlocksCompressor self-reference, making the compressor stateless. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds public API entries for TurboQuantMSE, TurboQuantMSEArray, TurboQuantQJL, TurboQuantQJLArray, and the new encode functions. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ead code Restructure the turboquant crate to follow the fastlanes encoding pattern where each encoding type gets its own subdirectory with array/ and vtable/ subdirectories: mse/ mod.rs — marker struct + re-exports array/mod.rs — TurboQuantMSEArray struct + accessors vtable/mod.rs — VTable + ValidityChild impls qjl/ mod.rs — marker struct + re-exports array/mod.rs — TurboQuantQJLArray struct + accessors vtable/mod.rs — VTable + ValidityChild impls Delete all dead code: - Remove old monolithic array.rs (TurboQuantArray, TurboQuantVariant) - Remove old mse_array.rs, qjl_array.rs flat files - Remove old rules.rs - Remove legacy decode functions from decompress.rs - Remove TurboQuantVariant from TurboQuantConfig (now just bit_width + seed) Update all consumers: - BtrBlocks compressor (already using new API) - Benchmarks: turboquant_encode → turboquant_encode_mse - lib.rs: use glob re-exports (pub use mse::*, pub use qjl::*) - Docstring example updated for new API Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add 8 new tests addressing gaps identified in review: Validation: - qjl_rejects_dimension_below_2: QJL path also rejects dim < 2 Stored metadata verification: - stored_centroids_match_computed: stored codebook == get_centroids() - stored_rotation_signs_produce_correct_decode: stored signs match seed-derived signs bit-for-bit QJL quality: - qjl_mse_within_theoretical_bound: QJL MSE satisfies (b-1)-bit bound (3 parametrized cases: dim 128/256, bits 3-4) - high_bitwidth_qjl_is_small: 8-9 bit QJL < 4-bit QJL and < 1% MSE Also add explanatory comments for: - QJL scale factor derivation (sqrt(π/2)/padded_dim) in decompress.rs - Why QJL uses seed+1 for statistical independence in compress.rs Total: 85 unit tests + 1 doctest. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed signs The bit-packed apply_inverse_srht_from_bits path introduced a ~20% decode throughput regression vs the original f32 sign multiply path, because per-element bit extraction + conditional negate is hard for the compiler to autovectorize. Fix: expand the stored BoolArray signs into f32 ±1.0 vectors once at decode start via RotationMatrix::from_bool_array(), then use the original inverse_rotate() with its SIMD-friendly apply_signs() inner loop. The expansion costs 3 × padded_dim × 4 bytes of temporary memory (12KB for dim=1024), amortized over all rows. We still store signs as 1-bit BoolArray on disk (32x space savings), but recover full autovectorized throughput at decode time. The apply_inverse_srht_from_bits function is retained (with tests) for potential future use with explicit SIMD bit-extraction intrinsics. Signed-off-by: Will Manning <will@spiraldb.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…chterman

lwwmanning and others added 9 commits March 25, 2026 15:38

lwwmanning and others added 12 commits March 25, 2026 17:52

Merge remote-tracking branch 'origin/develop' into claude/admiring-li…

de598ad

…chterman

lwwmanning added the changelog/feature A new feature label Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: turboquant encoding for vectors#7167

feat: turboquant encoding for vectors#7167
lwwmanning wants to merge 21 commits intodevelopfrom
claude/admiring-lichterman

lwwmanning commented Mar 25, 2026 •

edited

Loading

Uh oh!

codspeed-hq bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lwwmanning commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

codspeed-hq bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lwwmanning commented Mar 25, 2026 •

edited

Loading

codspeed-hq bot commented Mar 25, 2026 •

edited

Loading