Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Dask in highly_variable_genes #2809

Merged
merged 59 commits into from Feb 15, 2024
Merged

Support Dask in highly_variable_genes #2809

merged 59 commits into from Feb 15, 2024

Conversation

flying-sheep
Copy link
Member

@flying-sheep flying-sheep commented Jan 15, 2024

@flying-sheep flying-sheep marked this pull request as draft January 15, 2024 14:32
@flying-sheep flying-sheep mentioned this pull request Jan 15, 2024
3 tasks
Copy link

codecov bot commented Jan 15, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (e7a1251) 74.07% compared to head (54388f8) 74.11%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2809      +/-   ##
==========================================
+ Coverage   74.07%   74.11%   +0.04%     
==========================================
  Files         115      115              
  Lines       12652    12685      +33     
==========================================
+ Hits         9372     9402      +30     
- Misses       3280     3283       +3     
Files Coverage Δ
scanpy/_compat.py 50.00% <100.00%> (ø)
...preprocessing/_deprecated/highly_variable_genes.py 95.40% <100.00%> (+0.05%) ⬆️
scanpy/preprocessing/_distributed.py 82.60% <ø> (ø)
scanpy/preprocessing/_highly_variable_genes.py 95.58% <98.47%> (-0.28%) ⬇️

... and 1 file with indirect coverage changes

Comment on lines 497 to 499
filt, _ = filter_genes(
_get_obs_rep(adata_subset, layer=layer), min_cells=1, inplace=False
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been broken the whole time by operating on X instead of the selected layer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet we didn't notice since the nonzero values are likely pretty similar, if not the same, for common use cases

@flying-sheep
Copy link
Member Author

flying-sheep commented Jan 19, 2024

Hmm, I found several weirdnesses in hvg:

  1. as mentioned above, this line has always ignored the layer parameter (fixed in this PR):
    filt = filter_genes(adata_subset, min_cells=1, inplace=False)[0]
  2. How does the binning in flavour='cell_ranger' work? (see n_bins not respected in highly_variable_genes(..., flavour='cell_ranger') #888, where I asked @Koncopd and @falexwolf how to address that)
  3. This line has survived since the very first commit, but what does it do? it sorts the view in-place?
    dispersion_norm[
    ::-1
    ].sort() # interestingly, np.argpartition is slightly slower

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the comments, can we get a regression test for "cell_ranger" (e.g. generate results with an older version)? I don't think we have one in the test suite.

Sure! That’s a concrete thing I can do. I’ll do that on thursday, I did the rest of what you asked today

I don't think this has happened yet, could we add this?


There are two more lines which aren't covered, but I believe they should be unreachable (both just ValueError that the arg should be "cell_ranger" or "seurat") so it's fine.


I'm a little concerned about changing the return for inplace=False, in case anyone was relying on that. Not too concerned, but what do you think about tracking that for 2.0?

scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
@ivirshup
Copy link
Member

Sorry for the accidental push, trying to see if we can get inline code coverage on github:

@flying-sheep
Copy link
Member Author

flying-sheep commented Feb 13, 2024

can we get a regression test for "cell_ranger" (e.g. generate results with an older version)? I don't think we have one in the test suite.

I’ll do that today if you can give me more info. Usually “regression test” refers to testing specific properties that were broken in a bug and subsequently fixed. What properties exactly are you looking for? Why cell_ranger and not seurat? Are you implying that we do that already for seurat?

/edit: done in #2851

There are two more lines which aren't covered, but I believe they should be unreachable (both just ValueError that the arg should be "cell_ranger" or "seurat") so it's fine.

yeah, lines like that are more defensive coding. I add them even to internal code to force us to look at everything instead of having a else: # flavor == "cell_ranger" branch or so

I'm a little concerned about changing the return for inplace=False, in case anyone was relying on that.

You mean the fact that the index makes the dataframe now actually useful? I can’t think of a way in which this breaks things in a way that isn’t immediately obvious and welcome. Of course, code can be infinitely weird, but can you think of a scenario?

@flying-sheep flying-sheep self-assigned this Feb 15, 2024
@ivirshup
Copy link
Member

You mean the fact that the index makes the dataframe now actually useful? I can’t think of a way in which this breaks things in a way that isn’t immediately obvious and welcome. Of course, code can be infinitely weird, but can you think of a scenario?

I get that it is more useful, but any code that was accessing it with .loc especially if it was relying on unique indices could run into a problem. It's minor, I'd be fine to leave it. Just we may need to revert.

@flying-sheep flying-sheep merged commit 102b4ef into master Feb 15, 2024
12 checks passed
@flying-sheep flying-sheep deleted the dask-hvg branch February 15, 2024 10:44
@flying-sheep
Copy link
Member Author

We have documented the columns returned by the data frame, not the index, so I’d say the documented behavior stayed the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants