New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cellarity: GroupBy with sparse-matrix agg, scores #564
Conversation
Codecov Report
@@ Coverage Diff @@
## master #564 +/- ##
==========================================
+ Coverage 86.49% 87.12% +0.62%
==========================================
Files 25 26 +1
Lines 3532 3728 +196
==========================================
+ Hits 3055 3248 +193
- Misses 477 480 +3
|
dfa1c27
to
29cbe8e
Compare
Looks excellent. I’d probably personally implement it as a method available on Also needs some minor details like
About the content:
Then we could have two class GroupBy(ABC):
"""..."""
def __new__(..., mask_nans: bool = False):
return GroupByPandas(...) if mask_nans else GroupByNumpy()
@abstractmethod
def mean(...): ...
class GroupByNumpy(GroupBy):
... @ivirshup what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This looks really great, and like a huge performance improvement! Really excited to see Cellarity contributing back. As we discussed offline (well, sorta offline):
Plan for merging:
I think there's a lot of value here, and I think there is (probably long) discussion to have around the API. I think we could merge as private with relativley few changes – then implement the interface in a seperate PRs. This would of course be used as internals for an actual GroupBy
object's reduction interface. Multiple people could then hack on API implementations working from a common (and up to date) codebase.
I believe this is the @jbloom22's prefered path as well. While it would be nice to have all this done in one PR, I think waiting on API decisions and implementing whatever consensus we come to is too much to ask of a non-core contributor. Sound reasonable @flying-sheep?
Naming and API
I'm going to put a short version of this here, and will open an issue with the longer version if the proposed "plan for merging" sounds good.
Naming
I don't think this quite fits the generality of groupby, since that should allow more custom access to the grouped object and doesn't require complete reduction over the groups. @gokceneraslan proposed a function with some similar functionality over in scverse/scanpy#1390 called summarized_expression_df
. I think this works well for computing the reduction values. Not so sure it works for computing the test statistics.
API
What's the right way to expose this to users? Since I don't think this quite fulfills the full split-apply-combine paradigm, I don't think it should be through a GroupBy
object. Two main ideas:
- Could this be a function?
- Could we split the reductions and the comparisons?
I'd be interested to hear thoughts from people who have commented on this in the past/ made similar PRs
Right now I'm thinking something like:
stats = ad.get.summarized_df(adata, key, layer="lognorm", stats=["mean", "var", "count"])
# could also be called `grouped_stats`
# Where stats is a dict[str, pd.DataFrame]
# for df in stats:
# assert ((df.index == adata.var_names) & (adata.columns == ["mean", "var", "count"])).all()
Score calculation could then look something like:
scores = pd.DataFrame({
"logfold": ad.stats.fold_score(dfs["group_1"], dfs["group_2"]),
"t_score": ad.stats.t_score(dfs["group_1"], dfs["group_2"]),
})
# Or scores could operate on groups
scores = ad.get.grouped_scores(stats, "group_1", "group_2", scores=["t", "fold"])
Other topics (to address seperately from API)
Docs
Could you add some examples to the doc strings? Would be nice to see how this is intended to be used.
pd_methods
@jbloom22, could you remind me why you left in the pd
methods? IIRC it was about differences in handling null values? Could you provide some examples for cases which you know are different between the two?
Multilevel group encoding
We should have a different way to encode multilevel groupings since we can't write a column of tuples to disk. I'd propose just passing the indicator matrix directly here, since this is a reasonably common way to represent these.
This should also allow for simplification of _extract_indices
down to equivalent to sklearn.preprocessing.OneHotEncoder(adata.obs[key], sparse=True)
.
Null values in groups
It would be nice to have this handled. Currenly if any of the values in adata.obs[key]
are null, this errors. I think the question is whether to consider this a group or not. I'd say no. I'd also say that null values shouldn't be allowed if an indicator matrix is passed, as I can't think of what that could mean.
Testing
I would like the test to get split up a bit so if it fails we can see all the components that are affected instead of just the first one.
- Could you break up the tests so they're parameterized?
- Could you test for inputs of different dtypes? I'd like to see that this works for integers and floats in particular.
- Could you also add a test ensuring that you're handling numerical instability? I'd hate for handling of this to break without us noticing.
Totally, if the API decisions do turn out to take a while. I like things as they are, except for the regarding them: from what I understood they’re skipping instead of propagating nans – depending on what one wants, |
Might also be useful to compute num_expressed_ = _toarray(A * (X != 0)), however this does create an additional copy of X |
Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
I implemented some of the outstanding changes, and an idea about the API in jbloom22#1 Alternatively we can stay with dicts here, but I’d still get rid of the The tests fail because I used Python 3.7 stuff (since we’re going to depend on Python 3.7 anyway) |
Create a CountMeanVar type
My current plan for this PR: I would like to split this PR out into a separate repo. This allows iterating on the API quickly and being able to add dependencies (some likely temporary). Likely name: |
Co-authored-by: Jon Bloom <jbloom@cellarity.com>
Most of this feature has been released in scanpy 1.10 via thanks for getting this started and apologies for how long it took to get out! |
Cellarity benefits from anndata / scanpy open-source development and is proud to contribute a class I wrote in July 2020 to speed up our Platform.
GroupBy class supports grouping and aggregating AnnData observations by key, per variable and computing scores between pairs of groups.
Features:
Implementation/Performance:
So it’s plenty fast to compute representations like t-scores on the fly for many pairs of groups of cells. E.g., computing t-scores for 500,000 cells over 1000 genes for 1800 pairs is reduced from 30m using scanpy.tl.rank_genes_groups to 1.9s (~1000x speedup). For reference, included pandas implementation is 30s, sparse-agg on densified adata is 5s.
Over the full 43K genes (which don't fit densely in RAM), t-scores take 13s for 148 pairs and 19s for 1800 pairs at finer resolution.
While this was designed to be extensible and support computing and optimizing single-cell representations as the data and models grow, I’m hopeful folks will get creative with use-cases on other data types too!