perf: hoist _tiecorrect out of per-group loop in vs-rest Wilcoxon by zboldyga · Pull Request #4090 · scverse/scanpy

zboldyga · 2026-04-25T02:32:02Z

Hey there!

This is a tiny change with a massive performance improvement for the vs_rest case in wilcoxon DE when tie correction is enabled.

Here are the before/after performance benchmarks on my machine on some popular perturb-seq datasets (16 threads M4 MAX).

Dataset	Cells	Genes	Perts	Before	After	Speedup
wessels23	30,636	16,775	183	260.9 s	25.5 s	10.2×
tian21crispri	32,300	24,606	184	399.7 s	39.7 s	10.1×
adamson16	65,286	20,616	88	412.5 s	48.8 s	8.5×
sunshine23	54,576	23,570	193	690.1 s	56.5 s	12.2×
frangieh21	57,419	23,287	232	916.8 s	71.8 s	12.8×

Before: scanpy 1.12.0 (verified equivalent to current main without this patch when n_jobs is matched).
After: current main with this patch applied.

It's a simple premise -- for a given chunk, _tiecorrect(ranks) is repeatedly computed per-group (e.g. perturbation) but is identical each time in this case (the pool of cells is always the same in vs_rest mode). Simply moving it out of the per-group loop substantially reduces the work (once per chunk, instead of once per perturbation in the case of perturb-seq).

Tests pass for me locally.

This was borne out of my running substantial benchmarks for a perturb-seq paper. I have a few more small but impactful suggestions for wilcoxon DE speedups. One builds on this one (and I will defer to a discussion that references this PR). Another involves sparsity (I'll dig in the issues/PRs before raising that, as I assume others have thought about this). In total I have local benchmarks with these changes with 100-1000x speedups on perturb-seq datasets :)

Closes #
[Tests][x] included or not required because: not included because existing tests cover correctness of wilcoxon DE with tiecorrect = True

Release notes not necessary because: no change in output or API (happy to add a release note as you see fit)

Co-authored-by: Copilot <copilot@github.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Philipp A. <flying-sheep@web.de> Co-authored-by: Copilot <copilot@github.com>

Co-authored-by: Philipp A. <flying-sheep@web.de>

…se#4082) Co-authored-by: Philipp A. <flying-sheep@web.de>

Co-authored-by: Copilot <copilot@github.com>

The tie correction coefficient is a property of the shared ranks matrix in vs-rest mode; it does not change across perturbation groups. Previous code computed it once per (chunk, group), duplicating the work G times per chunk. Compute it once per chunk and broadcast to the per-group output array. Numerically identical; no API change.

codecov · 2026-04-25T02:39:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.22%. Comparing base (6702787) to head (19cf5e6).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4090   +/-   ##
=======================================
  Coverage   79.22%   79.22%           
=======================================
  Files         119      119           
  Lines       12667    12667           
=======================================
+ Hits        10035    10036    +1     
+ Misses       2632     2631    -1

Flag	Coverage Δ
hatch-test.low-vers	`78.49% <100.00%> (-0.03%)`	⬇️
hatch-test.pre	`79.10% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/scanpy/tools/_rank_genes_groups.py	`92.85% <100.00%> (ø)`

... and 1 file with indirect coverage changes

ilan-gold · 2026-04-27T12:20:05Z

@zboldyga You should have a look at illico which we are considering bringing into our code base. It certainly would be interesting if we could get similar performance without adding a new backend.

Looking forward to seeing what you've got!

zboldyga · 2026-04-27T17:21:31Z

@ilan-gold thnx! I noticed this in the PRs but wasn't sure the plan. I went ahead and ran a benchmark to compare various results on perturb-seq:

dataset	scanpy 1.12.0	this_scanpy_patch	pdex	zachs_full_scanpy_patch	illico
wessels23	260.85 s	25.49 s	214.49 s	1.11 s	0.34 s
tian21crispri	399.74 s	39.67 s	331.43 s	1.73 s	0.88 s
adamson16	412.50 s	48.79 s	291.54 s	3.05 s	0.94 s
sunshine23	690.14 s	56.54 s	453.26 s	2.10 s	0.67 s
frangieh21	916.84 s	71.75 s	778.50 s	2.99 s	0.79 s

So the full patch I have will get you most of the way to illico, without really doing much tbh. The other big issue the full patch fixes is not densifying, because perturb-seq datasts are very sparse (but that's something that could be detected / flipped on if there's enough sparsity to benefit from this, as datasets can vary, + a way to manually set the strategy).

Using sparsity also significantly extends the size of datasets that practically work on scanpy. The current algo chunking approach falls apart at cell counts > ~10MM, the chunk size grows with cell count and after ~5MM cells you end up with 1 gene per chunk. Using sparsity at least significantly extends this, or circumvents it altogether (probably identical to illico here).

And the 3rd fix is a simple way to specify vs_all_perturbed_except_control, because it's common to do this and you'd want to ensure it hits this faster path and doesn't require manual filtering of the anndata or obs before running scanpy (it's conceptually the same as vs_all, doesn't need to repeat ops for every perturbation/group). In my notes I think I had written down a simple / elegant way to do this using the group_by column, can open a PR for that one after this one? (as it's only relevant if this is accepted).

I would also add -- there might be a little more performance you could squeeze out of the python implementation with small tweaks -- I could profile that more, especially for larger datasets.

And while speculative, I think there are opportunities to make this faster on big datasets by optimizing memory reads. Changing up the algo so reads aren't jumping around, something like this. It would be a bigger conceptual change, and I haven't thought through this with CSR. But it might give you another good gain with big perturb seq datasets, possibly faster than the rust implementation as it currently stands (for big datasets, where the timing matters most).

zboldyga · 2026-04-27T17:39:54Z

@ilan-gold let me know how you'd like to proceed! I figured I'd add the other 2 changes to other PRs as the sparsity one is a bigger change, and the other one more generally adds support for that comparison method (and kind of builds on this PR)

ilan-gold · 2026-04-27T18:20:18Z

@zboldyga I think in general I'd rather not duplicate illico in scanpy unless it becomes unmaintained - small perf fixes seem reasonable, but a lot of what you describe sounds like illico if not all of it (sparsity, chunking etc.).

I'd be open to just outright bringing illico into the scanpy codebase like I said on zulip/here (I think) but we already have seen some mismatches on mal-formatted data (which arguably should be rejected anyway). So we're already battle testing it a bit/using it as a reference (I think rapids single cell is poaching the tricks).

I guess the takeaway here is that small, targeted fixes are welcome, but I'm wary of wholesale changes due to duplication / potentially changing results.

Open to further discussion though!

zboldyga · 2026-04-27T18:44:31Z

@ilan-gold yeah that sounds reasonable. I guess ultimately a pattern I see all across the single cell space is it's clunky / takes digging around and trial and error to get reasonable performance on a lot of the routine operations (or at least, things I needed to do for preprocessing and metrics computation). This especially emerged on larger datasets. In other words, there are lots of ways to make small mistakes that lead to orders of magnitude worse performance.

So one strategy I could see working well here is to put all eggs in the illico basket and accelerate getting that to the point where it's the only implementation in scanpy? This way users don't have to know about the illico alternative / don't accidentally use something 100-1000x slower?

I'm happy to help on illico integration if you like?

ilan-gold · 2026-04-27T19:09:55Z

@zboldyga Exactly. I think illico will the scanpy 2.0 default (which is coming within the year), so this problem is hopefully resolved soon.

The best thing would be to see if results match / trying it on real world data you might have. There's nothing blocking it at the moment AFAICT, but more real world testing would be amazing.

If you want to look at #4037 that would also be cool to make sure we have things right there. That will also be a scanpy 2.0 default (we need to bundle all these breaking-type changes because we are semver compliant).

Thanks!

zboldyga · 2026-04-27T19:55:18Z

sweet ok, will hop over there and test, review, etc!

As for this PR, given the 2.0 release won't be till later this year, are you onboard with including this in the milestone you tagged (1.12.2)?

Will also open a PR for the small other change (vs_all_perturbed, excludes control cell group), which I imagine makes sense to have regardless of the underlying implementation (but we could decide relevance in that PR)

ilan-gold · 2026-04-28T08:26:54Z

As for this PR, given the 2.0 release won't be till later this year, are you onboard with including this in the milestone you tagged (1.12.2)?

Yeah, this is a pretty small change, no reason not to.

Will also open a PR for the small other change (vs_all_perturbed, excludes control cell group), which I imagine makes sense to have regardless of the underlying implementation (but we could decide relevance in that PR)

Nice yea! Being able to have a full view of what's on offer would be amazing.

…r-group loop in vs-rest Wilcoxon) (#4094) Co-authored-by: Zach Boldyga <zboldyga@users.noreply.github.com>

flying-sheep and others added 6 commits April 23, 2026 13:46

docs: use upstream APIs (scverse#4083)

d15aede

Co-authored-by: Copilot <copilot@github.com>

perf: Combat perf improvements (scverse#4070)

d6f2c51

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Philipp A. <flying-sheep@web.de> Co-authored-by: Copilot <copilot@github.com>

docs: clarify method vs transformer in sc.pp.neighbors (scverse#4079)

44cfc6e

Co-authored-by: Philipp A. <flying-sheep@web.de>

fix: limit Numba threads in Wilcoxon path of rank_genes_groups (scver…

2b60935

…se#4082) Co-authored-by: Philipp A. <flying-sheep@web.de>

style: add NPY lints to prevent use of legacy np.random (scverse#4087)

ea8e481

Co-authored-by: Copilot <copilot@github.com>

zboldyga marked this pull request as draft April 25, 2026 02:59

zboldyga marked this pull request as ready for review April 25, 2026 02:59

ci: retrigger

de2369d

flying-sheep force-pushed the main branch from ea8e481 to 463b0bb Compare April 26, 2026 13:31

Merge branch 'main' into perf-hoist-tiecorrect-vs-rest

eecb956

ilan-gold added this to the 1.12.2 milestone Apr 27, 2026

ilan-gold enabled auto-merge (squash) April 27, 2026 12:18

zboldyga mentioned this pull request Apr 28, 2026

feat: allow exponentiation post agg for log-fold-change in rank_genes_groups #4037

Open

3 tasks

Merge branch 'main' into perf-hoist-tiecorrect-vs-rest

19cf5e6

ilan-gold merged commit 20c1425 into scverse:main Apr 28, 2026
14 checks passed

meeseeksmachine mentioned this pull request Apr 28, 2026

Backport PR #4090 on branch 1.12.x (perf: hoist _tiecorrect out of per-group loop in vs-rest Wilcoxon) #4094

Merged

ilan-gold pushed a commit that referenced this pull request Apr 28, 2026

Backport PR #4090 on branch 1.12.x (perf: hoist _tiecorrect out of pe…

0dfbda1

…r-group loop in vs-rest Wilcoxon) (#4094) Co-authored-by: Zach Boldyga <zboldyga@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: hoist _tiecorrect out of per-group loop in vs-rest Wilcoxon#4090

perf: hoist _tiecorrect out of per-group loop in vs-rest Wilcoxon#4090
ilan-gold merged 9 commits intoscverse:mainfrom
zboldyga:perf-hoist-tiecorrect-vs-rest

zboldyga commented Apr 25, 2026 •

edited by ilan-gold

Loading

Uh oh!

codecov Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

ilan-gold commented Apr 27, 2026

Uh oh!

zboldyga commented Apr 27, 2026 •

edited

Loading

Uh oh!

zboldyga commented Apr 27, 2026

Uh oh!

ilan-gold commented Apr 27, 2026

Uh oh!

zboldyga commented Apr 27, 2026 •

edited

Loading

Uh oh!

ilan-gold commented Apr 27, 2026

Uh oh!

zboldyga commented Apr 27, 2026 •

edited

Loading

Uh oh!

ilan-gold commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

zboldyga commented Apr 25, 2026 • edited by ilan-gold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold commented Apr 27, 2026

Uh oh!

zboldyga commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zboldyga commented Apr 27, 2026

Uh oh!

ilan-gold commented Apr 27, 2026

Uh oh!

zboldyga commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Apr 27, 2026

Uh oh!

zboldyga commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zboldyga commented Apr 25, 2026 •

edited by ilan-gold

Loading

codecov Bot commented Apr 25, 2026 •

edited

Loading

zboldyga commented Apr 27, 2026 •

edited

Loading

zboldyga commented Apr 27, 2026 •

edited

Loading

zboldyga commented Apr 27, 2026 •

edited

Loading