Skip to content

Fix cosine similarity optimization bug#7724

Merged
connortsui20 merged 1 commit intodevelopfrom
ct/fix-cosine-denorm-opt
Apr 29, 2026
Merged

Fix cosine similarity optimization bug#7724
connortsui20 merged 1 commit intodevelopfrom
ct/fix-cosine-denorm-opt

Conversation

@connortsui20
Copy link
Copy Markdown
Contributor

Summary

Tracking issue: #7297

CosineSimilarity fast paths for L2Denorm used only the decoded normalized children, so lossy encodings could return a non-zero cosine for rows whose actual stored norm was 0.

This fix makes those fast paths also execute the stored norm children and return 0 when a denorm stored norm, or the plain-side norm in the one-denorm case, is zero while preserving the existing validity mask.

Testing

Some regression tests.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20 connortsui20 added the changelog/fix A bug fix label Apr 29, 2026
@connortsui20 connortsui20 enabled auto-merge (squash) April 29, 2026 21:00
T::zero()
} else {
dots[i]
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely premature optimization, but I wonder if rustc is able to see that this can be written in a branchless way.

let either_is_zero = norms_l[i] == T::zero() || norms_r[i] == T::zero();
T::from(!either_is_zero) * dots[i]

@connortsui20 connortsui20 merged commit 260badd into develop Apr 29, 2026
65 of 67 checks passed
@connortsui20 connortsui20 deleted the ct/fix-cosine-denorm-opt branch April 29, 2026 21:06
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 29, 2026

Merging this PR will degrade performance by 42.11%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 1 improved benchmark
❌ 4 regressed benchmarks
✅ 1184 untouched benchmarks
⏩ 9 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime dynamic_dispatch_u32[10M] 162.8 µs 105.5 µs +54.37%
WallTime for[10M_u8] 73.8 µs 127.4 µs -42.11%
WallTime for[10M_u16] 95.5 µs 157.3 µs -39.33%
WallTime mix[0%_in/100%_out] 227.3 µs 278.5 µs -18.39%
Simulation new_bp_prim_test_between[i64, 32768] 176.4 µs 235.3 µs -25.04%

Comparing ct/fix-cosine-denorm-opt (3a54a19) with develop (0bb712b)

Open in CodSpeed

Footnotes

  1. 9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants