Fix float64 precision loss in sparse Pearson residual kernels#658
Conversation
The CSR and CSC Pearson residual kernels in _cuda/pr/kernels_pr.cuh
divided by `sqrtf`, the single-precision square-root intrinsic. Because
both kernels are templated on the element type `T`, a `T=double`
instantiation silently narrowed the variance term
`mu + mu * mu * inv_theta` to float32, evaluated the square root at
single precision, and promoted the result back to double. The float64
path of `pp.normalize_pearson_residuals` (and `pp.highly_variable_genes`
with `flavor='pearson_residuals'`) was therefore capped at ~7
significant digits regardless of the requested dtype. The dense kernel
`dense_norm_res_kernel` already used the overloaded `sqrt` and was
unaffected.
Replace `sqrtf` with the overloaded `sqrt` on both sparse paths. `sqrt`
dispatches to the single-precision root for `T=float` and the
double-precision root for `T=double`, so the float32 path is
byte-identical to before and only the float64 path changes.
Hardware verification (NVIDIA H100 80GB HBM3, CUDA 12.6, sm_90):
A standalone harness compiled the real `sparse_norm_res_csr_kernel`
verbatim and ran it on a 4000x4000 synthetic CSR count matrix against a
host float64 reference.
T=double, before fix: max relative error 8.83e-08 (~7.1 digits)
T=double, after fix: max relative error 3.97e-16 (~15.4 digits)
T=float, before/after: bit-identical (max abs diff 0.0)
The float64 path is now ~8 orders of magnitude more accurate; the
float32 path is provably unchanged.
Add `test_normalize_pearson_residuals_float64_precision` to
tests/test_normalization.py. It pins the float64 CSR/CSC output to an
analytic float64 reference at rtol/atol 1e-9 -- tight enough to fail on
a single-precision result and pass on a genuine float64 one -- across
theta in {100, inf}.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #658 +/- ##
==========================================
+ Coverage 88.13% 88.16% +0.02%
==========================================
Files 96 96
Lines 7045 7045
==========================================
+ Hits 6209 6211 +2
+ Misses 836 834 -2 |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Actions performedReview triggered.
|
|
Thanks for tracking this down. I tested it locally and want to fold in one tweak before merging. So I replaced with One small correction Thank again for the PR and the great find |
|
NP! Happy to help! |
Summary
The CSR and CSC Pearson residual kernels in
src/rapids_singlecell/_cuda/pr/kernels_pr.cuhdivide bysqrtf, the single-precision square-root intrinsic. Both kernels are templated on the element typeT, so aT=doubleinstantiation silently narrows the variance termmu + mu * mu * inv_thetatofloat32, evaluates the square root at single precision, and promotes the result back todouble.As a result, the float64 path of
pp.normalize_pearson_residuals(andpp.highly_variable_geneswithflavor='pearson_residuals') was capped at ~7 significant digits regardless of the requested dtype. The dense kerneldense_norm_res_kernelalready uses the overloadedsqrtand was unaffected.The fix replaces
sqrtfwith the overloadedsqrton both sparse paths.sqrtdispatches to the single-precision root forT=floatand the double-precision root forT=double, so the float32 path is byte-identical to before and only the float64 path changes.Verification
A standalone harness compiled the real
sparse_norm_res_csr_kernelverbatim and ran it on a 4000_4000 synthetic CSR count matrix against a host float64 reference (NVIDIA H100 80GB HBM3, CUDA 12.6,sm_90):double8.83e-08(~7.1 digits)3.97e-16(~15.4 digits)float0.0)The float64 path is now ~8 orders of magnitude more accurate; the float32 path is provably unchanged.
Changes
_cuda/pr/kernels_pr.cuh_sqrtf_sqrtinsparse_norm_res_csc_kernelandsparse_norm_res_csr_kernel.tests/test_normalization.py_ addstest_normalize_pearson_residuals_float64_precision, which pins the float64 CSR/CSC output to an analytic float64 reference atrtol/atol1e-9(tight enough to fail on a single-precision result, pass on a genuine float64 one) acrossthetain{100, inf}.docs/release-notes/0.15.1.md_ bug-fix entry.