Skip to content

feat: add illicofor rank_genes_groups#4038

Draft
ilan-gold wants to merge 27 commits intoig/exp_post_aggfrom
ig/illico
Draft

feat: add illicofor rank_genes_groups#4038
ilan-gold wants to merge 27 commits intoig/exp_post_aggfrom
ig/illico

Conversation

@ilan-gold
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold commented Apr 7, 2026

TODOs:

See: https://github.com/scverse/scanpy/actions/runs/24088645078/job/70268566419?pr=4038


@ilan-gold ilan-gold changed the title feat: add illico feat: add illicofor rank_genes_groups Apr 7, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.61%. Comparing base (53a2f74) to head (259d8f3).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@                 Coverage Diff                 @@
##           ig/exp_post_agg    #4038      +/-   ##
===================================================
- Coverage            78.63%   78.61%   -0.02%     
===================================================
  Files                  117      117              
  Lines                12738    12745       +7     
===================================================
+ Hits                 10016    10020       +4     
- Misses                2722     2725       +3     
Flag Coverage Δ
hatch-test.low-vers 77.95% <100.00%> (+0.01%) ⬆️
hatch-test.pre 78.54% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/_settings/presets.py 91.17% <100.00%> (ø)
src/scanpy/tools/_rank_genes_groups.py 93.71% <100.00%> (+0.95%) ⬆️

... and 3 files with indirect coverage changes

@remydubois
Copy link
Copy Markdown

Hey, good catch for the OVO column ordering. I have no idea where the bug comes from, it seems to be related to group labels for PBMC that numpy sorts in a weird way. I will try to get a fix in the coming days.

@remydubois
Copy link
Copy Markdown

So it seems to be due to np.argsort and np.sort not sorting the weird characters (" ", "+", "/", "#", etc) of the bulk_labels the same way. I will ship a fix sometime today or this weekend.

NB: I spent a bit of time looking into the PBMC dataset and it seems rather unusual: a lot of values are negative leading to non-defined logfoldchanges, on top of which there seems to be very little value diversity in the dataset. As a result, a lot of genes end up with identical ranksum, but because illico and scanpy do not compute z-score the exact same way (mathematically equivalent but programmatically different), gene ordering is impacted. The gene ordering impacts the position of NaN values in the results, but non-NaN values do match because they are very similar. I am not sure as if this dataset is a good test case.

NB2: details of the programmatic differences:
illico does:

U = ranksum - n_tgt * (n_tgt+ 1)/2
z = U - n_ref * n_tgt / 2

scanpy does:

z = 
ranksum - n_tgt * (n_tgt + 1 + n_ref) / 2

Those are mathematically equivalent but result in differences of the order of 1.e-9 approximately, changing gene orders when ranksums are equal.

@ilan-gold
Copy link
Copy Markdown
Contributor Author

I am not sure as if this dataset is a good test case.

Interesting, good observation but glad we are aware of this now. This could be another target for scanpy 2.0.

I will try a different dataset but glad to have this documented.

@remydubois
Copy link
Copy Markdown

Forget my previous message which was actually kind of off topic.

This PBMC dataset was actually a very good test case.

  1. It unveiled a silent bug in illico as it seems np.argsort and np.sort are not sorting the weird characters (" ", "+", "/", "#", etc) of the bulk_labels the same way. I will ship a fix in the next few days.
  2. Because it has very limited diversity in the data (the sheer values in .X are not very diverse), a lot of genes end up with identical ranksums (see below example), hence, identical z-scores.
  3. The sorting method used by scanpy is more elaborate as it allows user to select only top n genes. As a result, it does not sort identical values in the same order as illico (even if all genes are returned). That explains the genes ordering difference that I was still facing after fixing the name sorting issue described in 1. I believe I can add the functionality to return n_genes only like scanpy does, and implement the same sorting routine as scanpy, which appears to solve that issue.
  4. Not related to PBMC, but z-scores also mismatched because scanpy casts them to float32, and illico keeps thems as float64. I will fix that in the next patch as well.
  5. For the 1.e-9 adjustment, I don't know what's the best way to go. Currently, illico handles this with a np.where(mu_ref == 0, np.inf, mu_tgt / mu_ref) but I'm quite open to even change what's currently in illico if this e-9 offset is the de-facto standard in other softwares like Seurat or so.

Example: let's take CD14+ Monocyte as control group, CD34+ as perturbed group, and look at the genes CDK6 and TCAEL8. Although they don't exactly have the same values, both CDK6 and TCEAL8 have equal ranksums, hence, equal z-scores. Due to the sorting methodology difference, they end up not sorted in the same order between illico and scanpy.
From the next release, the test suite in illico will not only test closeness of the values but also explicitly test matching ordering of the genes.

NB: what is the situation with the Dask issue ? Current version is not dask-compatible.

@ilan-gold
Copy link
Copy Markdown
Contributor Author

NB: what is the situation with the Dask issue ? Current version is not dask-compatible.

That should come next, I want to keep this scoped for now to the in-memory stuff.

Another thing if you're doing a batch of fixes this weekend - supporting cs{r,c}_array would be awesome. Should only be one line of code.

Also for your numba types, only if you want instead of named tuples, FAU has a plug-in for handling putting cs{c,r}_{array,matrix} into numba kernels: https://github.com/scverse/fast-array-utils/blob/main/src/fast_array_utils/_plugins/numba_sparse.py. But it might not be worth it to bring on the dependency just for this purpose

@ilan-gold
Copy link
Copy Markdown
Contributor Author

ilan-gold commented Apr 10, 2026

For the 1.e-9 adjustment, I don't know what's the best way to go. Currently, illico handles this with a np.where(mu_ref == 0, np.inf, mu_tgt / mu_ref) but I'm quite open to even change what's currently in illico if this e-9 offset is the de-facto standard in other softwares like Seurat or so.

I'm seeing some instances of https://github.com/satijalab/seurat/blob/main/R/differential_expression.R#L1089 i.e., they calculate the mean expression with this "+1" offset so don't need the correction.

So we have three different things here, I guess. I think just giving the option to match scanpy (since that seems a project goal) is good, and then we can maybe revisit what seurat does as a new parameter for scanpy 2.0.

Not related to PBMC, but z-scores also mismatched because scanpy casts them to float32, and illico keeps thems as float64. I will fix that in the next patch as well.

For this, I would be open to also making this part of a scanpy 2.0 set of default i.e., using float64, as there are other places we might benefit from this. So if you were to keep float64, I don't think that would be a blocker. Again, an option could be good so we can smooth out the transition, although this is so small as to maybe not warrant it.

Overall, my ideal scenario would be that illico and scanpy with wilcoxon match, and then we use illico by default in scanpy 2.0. I'm pushing to minimize result changes to make the transition smoother (since results changing are always no fun) and so far the changes don't seem particularly destructiv.

Short of exact matches, like I said we can just document changes. It sounds like the biggest things that would close the gap are:

  1. argsort bug
  2. 1e-9 option
  3. The option to return n_genes
  4. cs{r,c}_array support

I think we can patch the float{32,64} stuff in a general "numerical accuracy" scanpy 2.0 preset (which we have on main right now).

Overall, this is really cool and I'm super happy that things seem to be pretty close!

@remydubois
Copy link
Copy Markdown

remydubois commented Apr 12, 2026

Hey,

I just released version 0.5.0rc1 which should fix the issues identified on this test case:

  1. Perturbation (or group) names are no longer re-sorted when outputs are formatted for scanpy ensuring they are ordered the same everywhere.
  2. Fold change is now computed by adding the same 1.e-9 factor, on top of being accumulated into a f64 placeholder (regardless of the original data dtype) ensuring a better matching.
  3. New argument n_genes allowing to return only top n DE genes per perturbation (which, even if not specified, results in genes sorting methodology being identical. This solves the issue raised by genes having identical scores).
  4. Support for cs[cr]_array. I did not explicitely add test cases for those as the test suite is already quite heavy and illico does nothing more than accessing .data, .indices, and .indptr of those objects. I did test it manually quickly and it ran with no issue.

Regarding the numerical precision question: so far I force the recarrays to have the same dtype as what's currently implemented in scanpy, see FC for instance. I believe we could change that when the need comes as you say.

Testing locally, it seems everything now runs smoothly for PMBC. Let me know how it goes with the rest of the CI.

@ilan-gold
Copy link
Copy Markdown
Contributor Author

ilan-gold commented Apr 13, 2026

@remydubois I think the issue is that you are sorting / cutting off by n_top genes globally but scanpy does this per group, but that is just a guess.

In any case, I think I got a little lost-in-the-sauce. I think matching p-values and scores should be enough because scanpy can internally handle LFC when it wraps illico. I just pushed a commit with this change. Of course, if you want to match LFC, no argument from me, but I realized it's probably not essential for internal consistency.

I'm pushing the results of scores and pvals_adj which all match except the two cases highlighted in the tests.

With this in mind, I am going to push ahead with integrating illico, sorry for the churn on the log fold changes, but I think we can just let scanpy do it as along as I can get scores/pvals out of illico.

BTW: https://remydubois.github.io/illico/api.html looks both incomplete and like it's leaking internals.

@ilan-gold
Copy link
Copy Markdown
Contributor Author

Ok everything passing locally with just getting P-value and z-score from illico. This is looking good now to me at least conceptually. Last thing would be to make sure mean-var calculation is fast since we won't rely on illico (at least for now) to do logfc, but I have something for that in #4041.

I assume the LFC calculation (i.e., actually calculating the means) is the less computationally intensive part?

@remydubois
Copy link
Copy Markdown

remydubois commented Apr 13, 2026

With this in mind, I am going to push ahead with integrating illico

Ok ! That's good news to me.

Because this was bugging me out a little bit: I checked out on your ig/illico branch and indeed reproduced test failure, until I added a pbmc = ad.AnnData(pbmc.X.copy(), obs=pbmc.obs.copy(), var=pbmc.var.copy()) (in order to make sure pmbc gets stripped out of all of its metadata or attributes):

def test_illico(test, corr_method, exp_post_agg):
    from illico.asymptotic_wilcoxon import asymptotic_wilcoxon

    pbmc = pbmc68k_reduced()
    pbmc = ad.AnnData(pbmc.X.copy(), obs=pbmc.obs.copy(), var=pbmc.var.copy())
    # ... Rest of the test. They all pass on my machine

Then, all tests passed (LFC, pvals, scores, pvals_adj, up to atol=1.e-9) which I still consider good news, even with in mind that LFC will remain scanpy-computed (non-matching LFC might mean non-matching names, which would have been a real issue). The reason why my "tests" passed locally before shipping 0.5.0rc1 for PBMC was exactly because I was stripping it entirely before running and comparing outputs, just to be sure.

By the way, wouldn't it be safer for scanpy's test suite to add an explicit check on the ordering of the genes, like I did here (which did fail, before stripping pbmc) ?

This is looking good now to me at least conceptually

Very happy to read it's moving forward 🚀

I assume the LFC calculation (i.e., actually calculating the means) is the less computationally intensive part?

Yes it's neglectable, I could not measure its impact on the overall runtime.

@ilan-gold
Copy link
Copy Markdown
Contributor Author

until I added a pbmc = ad.AnnData(pbmc.X.copy(), obs=pbmc.obs.copy(), var=pbmc.var.copy()) (in order to make sure pmbc gets stripped out of all of its metadata or attributes)

Oh interesting, I wonder if there is a bug in AnnData. Prima-facie, I would expect .copy to do just what you are doing: 8352445#diff-5ec943b8e30244cd1fbe2f048e8312c76626880147c4a5dce42aea518e72d24bL330. I wonder if uns or var are secretly not copied.... I'll need to look closer tomorrow, but my confidence is very high, super happy. Thanks for spending some more time looking into this.

By the way, wouldn't it be safer for scanpy's test suite to add an explicit check on the ordering of the genes, like I did here (which did fail, before stripping pbmc) ?

Probably, yes. I'll look more at this tomorrow now that you seem to have whittled things down quite a bit.

@ilan-gold
Copy link
Copy Markdown
Contributor Author

Ok very simple - rank_genes_groups uses raw is it's there, and it is here. illico doesn't know that so that is even further evidence that we should probably route this through scanpy at the level we do here so that the handling of raw/layers and arguments is done by scanpy and illico is focused on the downstream result of that instead of handling that logic.

If you use use_raw to False, you would have also got complete matching results when I was testing asymptotic_wilcoxon directly!

@ilan-gold ilan-gold requested a review from flying-sheep April 14, 2026 12:58
@ilan-gold
Copy link
Copy Markdown
Contributor Author

@flying-sheep I think this is ready to review at an architectural level

  1. Does the integration make sense? I think the minimalism of pvalues/zscores is good as well i.e., we handle choosing the raw/layers as well as LFC and p-value correction
  2. Does the new argument make sense, or should we instead keep wilcoxon constant as the method and make a backend argument for illico?
  3. If you find the current implementation unreadable i.e., creating a new AnnData object from the stripped parts in _RankGenes.__init__, I can make a separate PR for that. This might be a good opportunity also to bring in pandas-uuid instead of the RangeIndex.astype("str")!

@remydubois You may want to note that corr_method is only for scanpy results (and allow people to disable it, which is effectively what is happening now anyway if you don't use scanpy, I think)

@remydubois
Copy link
Copy Markdown

Ok very simple - rank_genes_groups uses raw is it's there

This makes sense - I've always thought that I would not make illico go too deep into the layers/raw abstractions of AnnData as I don't really understand them myself.

Well done @ilan-gold , feel free to tag me if further needed!

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
1 pipeline(s) were filtered out due to trigger conditions.

@ilan-gold
Copy link
Copy Markdown
Contributor Author

ilan-gold commented Apr 15, 2026

Ok this is really cooking now. I have matching results on random subsets of some real world data from illico (GSE264667_hepg2_raw_singlecell_01), obviously not running the whole thing. Looking very very good. If it weren't for the external dependency, this would certainly pass the smell test for being purely a performance improvement given the matching results.

@scverse scverse deleted a comment from azure-pipelines bot Apr 15, 2026
@remydubois
Copy link
Copy Markdown

Also, I realized that illico does not support the 'groups' argument (which I assume allows user to only compute DE on certain perturbations/labels).
In the short term, maybe running illico on everything and filtering out results would be enough. In the longer term, I could implement this functionality to asymptotic_wilcoxon

@ilan-gold
Copy link
Copy Markdown
Contributor Author

ilan-gold commented Apr 16, 2026

@remydubois Great point - let me try adding a test parametrization here for that. I assume it would work since we construct the anndata object manually for illico out of the parts that have (presumably) been processed by the groups argument.

EDIT:
Screenshot 2026-04-16 at 12 03 31

It looks like we might need a groups argument then.

@ilan-gold
Copy link
Copy Markdown
Contributor Author

ok @remydubois in the meantime in this PR I will filter since it requires the same amount of data and illico is bananas fast as you say. Just LMK if you do add it and I'll put in a note.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants