Skip to content

perf: hoist _tiecorrect out of per-group loop in vs-rest Wilcoxon#4090

Merged
ilan-gold merged 9 commits intoscverse:mainfrom
zboldyga:perf-hoist-tiecorrect-vs-rest
Apr 28, 2026
Merged

perf: hoist _tiecorrect out of per-group loop in vs-rest Wilcoxon#4090
ilan-gold merged 9 commits intoscverse:mainfrom
zboldyga:perf-hoist-tiecorrect-vs-rest

Conversation

@zboldyga
Copy link
Copy Markdown
Contributor

@zboldyga zboldyga commented Apr 25, 2026

Hey there!

This is a tiny change with a massive performance improvement for the vs_rest case in wilcoxon DE when tie correction is enabled.

Here are the before/after performance benchmarks on my machine on some popular perturb-seq datasets (16 threads M4 MAX).

Dataset Cells Genes Perts Before After Speedup
wessels23 30,636 16,775 183 260.9 s 25.5 s 10.2×
tian21crispri 32,300 24,606 184 399.7 s 39.7 s 10.1×
adamson16 65,286 20,616 88 412.5 s 48.8 s 8.5×
sunshine23 54,576 23,570 193 690.1 s 56.5 s 12.2×
frangieh21 57,419 23,287 232 916.8 s 71.8 s 12.8×
  • Before: scanpy 1.12.0 (verified equivalent to current main without this patch when n_jobs is matched).
  • After: current main with this patch applied.

It's a simple premise -- for a given chunk, _tiecorrect(ranks) is repeatedly computed per-group (e.g. perturbation) but is identical each time in this case (the pool of cells is always the same in vs_rest mode). Simply moving it out of the per-group loop substantially reduces the work (once per chunk, instead of once per perturbation in the case of perturb-seq).

Tests pass for me locally.

This was borne out of my running substantial benchmarks for a perturb-seq paper. I have a few more small but impactful suggestions for wilcoxon DE speedups. One builds on this one (and I will defer to a discussion that references this PR). Another involves sparsity (I'll dig in the issues/PRs before raising that, as I assume others have thought about this). In total I have local benchmarks with these changes with 100-1000x speedups on perturb-seq datasets :)

  • Closes #
  • [Tests][x] included or not required because: not included because existing tests cover correctness of wilcoxon DE with tiecorrect = True
  • Release notes not necessary because: no change in output or API (happy to add a release note as you see fit)

flying-sheep and others added 6 commits April 23, 2026 13:46
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Philipp A. <flying-sheep@web.de>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Philipp A. <flying-sheep@web.de>
Co-authored-by: Copilot <copilot@github.com>
The tie correction coefficient is a property of the shared ranks matrix
in vs-rest mode; it does not change across perturbation groups. Previous
code computed it once per (chunk, group), duplicating the work G times
per chunk. Compute it once per chunk and broadcast to the per-group
output array. Numerically identical; no API change.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.22%. Comparing base (6702787) to head (19cf5e6).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4090   +/-   ##
=======================================
  Coverage   79.22%   79.22%           
=======================================
  Files         119      119           
  Lines       12667    12667           
=======================================
+ Hits        10035    10036    +1     
+ Misses       2632     2631    -1     
Flag Coverage Δ
hatch-test.low-vers 78.49% <100.00%> (-0.03%) ⬇️
hatch-test.pre 79.10% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/tools/_rank_genes_groups.py 92.85% <100.00%> (ø)

... and 1 file with indirect coverage changes

@zboldyga zboldyga marked this pull request as draft April 25, 2026 02:59
@zboldyga zboldyga marked this pull request as ready for review April 25, 2026 02:59
@ilan-gold ilan-gold added this to the 1.12.2 milestone Apr 27, 2026
@ilan-gold ilan-gold enabled auto-merge (squash) April 27, 2026 12:18
@ilan-gold
Copy link
Copy Markdown
Contributor

@zboldyga You should have a look at illico which we are considering bringing into our code base. It certainly would be interesting if we could get similar performance without adding a new backend.

Looking forward to seeing what you've got!

@zboldyga
Copy link
Copy Markdown
Contributor Author

zboldyga commented Apr 27, 2026

@ilan-gold thnx! I noticed this in the PRs but wasn't sure the plan. I went ahead and ran a benchmark to compare various results on perturb-seq:

dataset scanpy 1.12.0 this_scanpy_patch pdex zachs_full_scanpy_patch illico
wessels23 260.85 s 25.49 s 214.49 s 1.11 s 0.34 s
tian21crispri 399.74 s 39.67 s 331.43 s 1.73 s 0.88 s
adamson16 412.50 s 48.79 s 291.54 s 3.05 s 0.94 s
sunshine23 690.14 s 56.54 s 453.26 s 2.10 s 0.67 s
frangieh21 916.84 s 71.75 s 778.50 s 2.99 s 0.79 s

So the full patch I have will get you most of the way to illico, without really doing much tbh. The other big issue the full patch fixes is not densifying, because perturb-seq datasts are very sparse (but that's something that could be detected / flipped on if there's enough sparsity to benefit from this, as datasets can vary, + a way to manually set the strategy).

Using sparsity also significantly extends the size of datasets that practically work on scanpy. The current algo chunking approach falls apart at cell counts > ~10MM, the chunk size grows with cell count and after ~5MM cells you end up with 1 gene per chunk. Using sparsity at least significantly extends this, or circumvents it altogether (probably identical to illico here).

And the 3rd fix is a simple way to specify vs_all_perturbed_except_control, because it's common to do this and you'd want to ensure it hits this faster path and doesn't require manual filtering of the anndata or obs before running scanpy (it's conceptually the same as vs_all, doesn't need to repeat ops for every perturbation/group). In my notes I think I had written down a simple / elegant way to do this using the group_by column, can open a PR for that one after this one? (as it's only relevant if this is accepted).

I would also add -- there might be a little more performance you could squeeze out of the python implementation with small tweaks -- I could profile that more, especially for larger datasets.

And while speculative, I think there are opportunities to make this faster on big datasets by optimizing memory reads. Changing up the algo so reads aren't jumping around, something like this. It would be a bigger conceptual change, and I haven't thought through this with CSR. But it might give you another good gain with big perturb seq datasets, possibly faster than the rust implementation as it currently stands (for big datasets, where the timing matters most).

@zboldyga
Copy link
Copy Markdown
Contributor Author

@ilan-gold let me know how you'd like to proceed! I figured I'd add the other 2 changes to other PRs as the sparsity one is a bigger change, and the other one more generally adds support for that comparison method (and kind of builds on this PR)

@ilan-gold
Copy link
Copy Markdown
Contributor

@zboldyga I think in general I'd rather not duplicate illico in scanpy unless it becomes unmaintained - small perf fixes seem reasonable, but a lot of what you describe sounds like illico if not all of it (sparsity, chunking etc.).

I'd be open to just outright bringing illico into the scanpy codebase like I said on zulip/here (I think) but we already have seen some mismatches on mal-formatted data (which arguably should be rejected anyway). So we're already battle testing it a bit/using it as a reference (I think rapids single cell is poaching the tricks).

I guess the takeaway here is that small, targeted fixes are welcome, but I'm wary of wholesale changes due to duplication / potentially changing results.

Open to further discussion though!

@zboldyga
Copy link
Copy Markdown
Contributor Author

zboldyga commented Apr 27, 2026

@ilan-gold yeah that sounds reasonable. I guess ultimately a pattern I see all across the single cell space is it's clunky / takes digging around and trial and error to get reasonable performance on a lot of the routine operations (or at least, things I needed to do for preprocessing and metrics computation). This especially emerged on larger datasets. In other words, there are lots of ways to make small mistakes that lead to orders of magnitude worse performance.

So one strategy I could see working well here is to put all eggs in the illico basket and accelerate getting that to the point where it's the only implementation in scanpy? This way users don't have to know about the illico alternative / don't accidentally use something 100-1000x slower?

I'm happy to help on illico integration if you like?

@ilan-gold
Copy link
Copy Markdown
Contributor

@zboldyga Exactly. I think illico will the scanpy 2.0 default (which is coming within the year), so this problem is hopefully resolved soon.

The best thing would be to see if results match / trying it on real world data you might have. There's nothing blocking it at the moment AFAICT, but more real world testing would be amazing.

If you want to look at #4037 that would also be cool to make sure we have things right there. That will also be a scanpy 2.0 default (we need to bundle all these breaking-type changes because we are semver compliant).

Thanks!

@zboldyga
Copy link
Copy Markdown
Contributor Author

zboldyga commented Apr 27, 2026

sweet ok, will hop over there and test, review, etc!

As for this PR, given the 2.0 release won't be till later this year, are you onboard with including this in the milestone you tagged (1.12.2)?

Will also open a PR for the small other change (vs_all_perturbed, excludes control cell group), which I imagine makes sense to have regardless of the underlying implementation (but we could decide relevance in that PR)

@ilan-gold
Copy link
Copy Markdown
Contributor

As for this PR, given the 2.0 release won't be till later this year, are you onboard with including this in the milestone you tagged (1.12.2)?

Yeah, this is a pretty small change, no reason not to.

Will also open a PR for the small other change (vs_all_perturbed, excludes control cell group), which I imagine makes sense to have regardless of the underlying implementation (but we could decide relevance in that PR)

Nice yea! Being able to have a full view of what's on offer would be amazing.

@ilan-gold ilan-gold merged commit 20c1425 into scverse:main Apr 28, 2026
14 checks passed
ilan-gold pushed a commit that referenced this pull request Apr 28, 2026
…r-group loop in vs-rest Wilcoxon) (#4094)

Co-authored-by: Zach Boldyga <zboldyga@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants