feat: add illicofor rank_genes_groups#4038
feat: add illicofor rank_genes_groups#4038ilan-gold wants to merge 27 commits intoig/exp_post_aggfrom
illicofor rank_genes_groups#4038Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## ig/exp_post_agg #4038 +/- ##
===================================================
- Coverage 78.63% 78.61% -0.02%
===================================================
Files 117 117
Lines 12738 12745 +7
===================================================
+ Hits 10016 10020 +4
- Misses 2722 2725 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
|
Hey, good catch for the OVO column ordering. I have no idea where the bug comes from, it seems to be related to group labels for PBMC that |
|
So it seems to be due to np.argsort and np.sort not sorting the weird characters (" ", "+", "/", "#", etc) of the bulk_labels the same way. I will ship a fix sometime today or this weekend. NB: I spent a bit of time looking into the PBMC dataset and it seems rather unusual: a lot of values are negative leading to non-defined logfoldchanges, on top of which there seems to be very little value diversity in the dataset. As a result, a lot of genes end up with identical ranksum, but because illico and scanpy do not compute z-score the exact same way (mathematically equivalent but programmatically different), gene ordering is impacted. The gene ordering impacts the position of NaN values in the results, but non-NaN values do match because they are very similar. I am not sure as if this dataset is a good test case. NB2: details of the programmatic differences: U = ranksum - n_tgt * (n_tgt+ 1)/2
z = U - n_ref * n_tgt / 2scanpy does: z =
ranksum - n_tgt * (n_tgt + 1 + n_ref) / 2Those are mathematically equivalent but result in differences of the order of 1.e-9 approximately, changing gene orders when ranksums are equal. |
Interesting, good observation but glad we are aware of this now. This could be another target for scanpy 2.0. I will try a different dataset but glad to have this documented. |
|
Forget my previous message which was actually kind of off topic. This PBMC dataset was actually a very good test case.
Example: let's take NB: what is the situation with the Dask issue ? Current version is not dask-compatible. |
That should come next, I want to keep this scoped for now to the in-memory stuff. Another thing if you're doing a batch of fixes this weekend - supporting Also for your |
I'm seeing some instances of https://github.com/satijalab/seurat/blob/main/R/differential_expression.R#L1089 i.e., they calculate the mean expression with this "+1" offset so don't need the correction. So we have three different things here, I guess. I think just giving the option to match scanpy (since that seems a project goal) is good, and then we can maybe revisit what seurat does as a new parameter for scanpy 2.0.
For this, I would be open to also making this part of a scanpy 2.0 set of default i.e., using Overall, my ideal scenario would be that Short of exact matches, like I said we can just document changes. It sounds like the biggest things that would close the gap are:
I think we can patch the Overall, this is really cool and I'm super happy that things seem to be pretty close! |
|
Hey, I just released version 0.5.0rc1 which should fix the issues identified on this test case:
Regarding the numerical precision question: so far I force the Testing locally, it seems everything now runs smoothly for PMBC. Let me know how it goes with the rest of the CI. |
|
@remydubois I think the issue is that you are sorting / cutting off by In any case, I think I got a little lost-in-the-sauce. I think matching p-values and scores should be enough because scanpy can internally handle LFC when it wraps I'm pushing the results of With this in mind, I am going to push ahead with integrating BTW: https://remydubois.github.io/illico/api.html looks both incomplete and like it's leaking internals. |
|
Ok everything passing locally with just getting P-value and z-score from I assume the LFC calculation (i.e., actually calculating the means) is the less computationally intensive part? |
Ok ! That's good news to me. Because this was bugging me out a little bit: I checked out on your def test_illico(test, corr_method, exp_post_agg):
from illico.asymptotic_wilcoxon import asymptotic_wilcoxon
pbmc = pbmc68k_reduced()
pbmc = ad.AnnData(pbmc.X.copy(), obs=pbmc.obs.copy(), var=pbmc.var.copy())
# ... Rest of the test. They all pass on my machineThen, all tests passed (LFC, pvals, scores, pvals_adj, up to By the way, wouldn't it be safer for
Very happy to read it's moving forward 🚀
Yes it's neglectable, I could not measure its impact on the overall runtime. |
Oh interesting, I wonder if there is a bug in
Probably, yes. I'll look more at this tomorrow now that you seem to have whittled things down quite a bit. |
|
Ok very simple - If you use |
|
@flying-sheep I think this is ready to review at an architectural level
@remydubois You may want to note that |
This makes sense - I've always thought that I would not make illico go too deep into the layers/raw abstractions of AnnData as I don't really understand them myself. Well done @ilan-gold , feel free to tag me if further needed! |
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
|
Ok this is really cooking now. I have matching results on random subsets of some real world data from |
|
Also, I realized that illico does not support the 'groups' argument (which I assume allows user to only compute DE on certain perturbations/labels). |
|
@remydubois Great point - let me try adding a test parametrization here for that. I assume it would work since we construct the anndata object manually for It looks like we might need a groups argument then. |
|
ok @remydubois in the meantime in this PR I will filter since it requires the same amount of data and |

TODOs:
See: https://github.com/scverse/scanpy/actions/runs/24088645078/job/70268566419?pr=4038
ovo"column" in scores (cc @remydubois)illico(see thefilterwarningson the new test)ovrresults are so far offexp_post_aggfeature feat: allow exponentiation post agg for log-fold-change inrank_genes_groups#4037) toillico, we will add theillicobackend (or all changes have been documented)illico#4012