Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): Aggregation via group-by in sc.get #2590

Merged
merged 105 commits into from Feb 20, 2024

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Aug 3, 2023

Modified (for scanpy) version of scverse/anndata#564. Fixes scverse/anndata#556

Big points of change:

  1. No more tuple-indices and related functionality (i.e., scoring pairwise)
  2. Allow for obs and var group-by +varm, obsm, layers as options for data to aggregate
  3. Output is AnnData object instead of DataFrame
  4. scanpy-style public API

TODO (by @ivirshup):

Necessary:

  • Docs
  • Aggregate along other axis
  • Keep grouping cols in result
  • Reconsider API for non-anndata version (maybe return a dict of arrays?)
  • Decide on naming convention for "nonzero" variations, should this be "nonzero_count" so it's a little like "nanmean"

Optional, can do later:

  • Weighted (although.... Idk, maybe can skip. Does "weights" affect "count_nonzero"?)
  • Option for keeping around unseen groups, probably needs fill_value argument for those values
  • Support for obsm, varm
  • Directly pass Series to groupby
  • More aggregation functions (mean_nonzero, min, max, std, nan* variations)
  • Mask argument
  • Dask support

@ilan-gold ilan-gold marked this pull request as draft August 3, 2023 15:47
@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (b4ba81d) 74.62% compared to head (0a7cf85) 74.83%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2590      +/-   ##
==========================================
+ Coverage   74.62%   74.83%   +0.20%     
==========================================
  Files         115      116       +1     
  Lines       12763    12891     +128     
==========================================
+ Hits         9525     9647     +122     
- Misses       3238     3244       +6     
Files Coverage Δ
scanpy/_utils/__init__.py 68.68% <100.00%> (+0.48%) ⬆️
scanpy/get/__init__.py 100.00% <100.00%> (ø)
scanpy/get/_aggregated.py 95.04% <95.04%> (ø)

@ilan-gold ilan-gold marked this pull request as ready for review August 4, 2023 08:03
@ilan-gold ilan-gold changed the title (feat): GroupBy Aggregation in sc.get (feat): Aggregation via groupby in sc.get Aug 4, 2023
@ilan-gold ilan-gold changed the title (feat): Aggregation via groupby in sc.get (feat): Aggregation via group-by in sc.get Aug 4, 2023
@ilan-gold
Copy link
Contributor Author

ilan-gold commented Aug 4, 2023

Ok @ivirshup I think we're ready to go here. Thanks for the guidance!

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please either revert 0aef147 or allow both as in #2771, or wait for scverse/anndata#1245

also as I asked before: why go away from dataclasses?

@ivirshup
Copy link
Member

ivirshup commented Jan 11, 2024

also as I asked before: why go away from dataclasses?

I don't think that switching away from data classes removed any meaningful functionality here, but having to use default_factory, InitVar, and/or __post_init__ would add more complexity.

I don't think that there being some internal data classes is important here, especially since it's not user visible and may change at any time anyways. I have a few ideas for ways to change the implementation to add more methods, none of which are compatible with Aggregate being a data class.

  • One path forward just removes the class entirely, since it doesn't do much now
  • The other uses a number of cached properties, which I don't think make a ton of sense to use with dataclasses

Is there some functionality the data class was adding that I'm missing?

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding data classes, as said in person: I tend to use them when

  • they basically just mean I can delete the __init__ function and maybe add a = field(...)
  • instances of the class actually end up as dict keys, being compared or so

The former is almost the case, at the time I wrote it, it was the case.

Let’s get the consensus axis design into this!

scanpy/get/_aggregated.py Outdated Show resolved Hide resolved
scanpy/get/_aggregated.py Outdated Show resolved Hide resolved
scanpy/get/_aggregated.py Outdated Show resolved Hide resolved
@flying-sheep
Copy link
Member

sparse_indicator doesn’t have its weights branches hit at all, maybe we should remove that? Or will this be used at some point?

@ivirshup
Copy link
Member

sparse_indicator doesn’t have its weights branches hit at all, maybe we should remove that? Or will this be used at some point?

I think it will be used at some point, but also happy to remove.

I think parameterizing test_aggregate_axis_specification is overkill for what the test does.

@flying-sheep
Copy link
Member

I think it will be used at some point, but also happy to remove.

OK, good to know! Then this PR is fine as far as I’m concerned.

@ivirshup ivirshup merged commit 383a61b into scverse:master Feb 20, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AnnData split-apply-combine
3 participants