-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rank_genes_groups refactoring 2nd try #1156
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge improvement! Thanks for going thought this! I've added a few questions and thoughts on how this would be more simple. Most of the points are nitpicky, and I think I would be happy with the restructuring as is.
One important point, did you address the correctness issues for the rank test?
For the point about treating the rest
and reference
cases equivalently, I thought the _basic_stats
method could have a block like:
if self.ireference:
mask_rest = self.groups_masks[self.ireference]
X_rest = self.X[mask_rest]
m, v = _get_mean_var(X_rest)
self.means_rest = np.broadcast_to(m, (n_groups, n_genes))
self.vars_rest = np.broadcast_to(v, (n_groups, n_genes))
So you wouldn't need to do checks in downstream functions.
On the structure of the code, I like to minimize the states an object can be in when possible. This just makes it easier to reason about what you can do with an instance. For example, I probably would have tried to write this so the test was like a config for how the test should be run. Then there would be a function which applies that to each pair of arrays.
Also, a personal request for this:
Could we change the default number of top genes to be the number of genes present?
# deleting the next line causes a memory leak for some reason | ||
del X_rest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make a small example showing this memory leak? I think I've run into something similar recently.
scanpy/tools/_rank_genes_groups.py
Outdated
# deleting the next line causes a memory leak for some reason | ||
del X_rest | ||
|
||
yield imask, mask, mask_rest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this an iterator if you're also saving the values to the object? I think it would be more simple to exclude the yield statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed yield.
|
||
self.ireference = None | ||
if reference != 'rest': | ||
self.ireference = np.where(self.groups_order == reference)[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having all these conditional blocks checking for ireference
, why not just fill the have the *_rest
values just be the reference?
I would probably also rename those attributes to *_other
too.
if len(self.groups_order) <= 2: | ||
break | ||
|
||
def compute_statistics(self, method, corr_method, n_genes_user, rankby_abs, **kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these arguments get some short documentation?
I was confused for a bit about when you would be using a correlation method 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the same parameters as in the main function, so the docs are there.
scanpy/tools/_rank_genes_groups.py
Outdated
self.groups_names.append(group_name) | ||
|
||
scores_sort = np.abs(scores) if rankby_abs else scores | ||
global_indices = _select_top_n(scores_sort, n_genes_user, self.X.shape[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about deprecating this feature? I don't think it makes sense to (by default) filter differentially expressed genes to any fixed n
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second this. All genes should be reported by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
Hi, @ivirshup . Thanks for the review. I'll address the comments soon. About the reference thing. Yes, you are right, i can use this approach of course. However, i didn't want to store the same value multiple times, it doesn't make much difference in performance, but still gives uneasy feelings, i would say. upd: oh, i see that |
Thanks so much for the amazing work @Koncopd 🎊 I think it'd be more convenient to have the same type of "keys" regardless of the provided
keys in .uns['rank_genes_groups'], if I set a reference and
if I don't. Wouldn't it be better to replace all *_rest in the code with *_reference, and calculate them based on whatever the reference is, rest or a specific group? Then we can provide
without thinking about what the reference is. Seurat reports pct1 for the first group and pct2 for the second group, for example, which is nice and simple, IMO. Also People typically define DE genes using cutoffs like pval_adj < 0.05, log2fc>1, pct>0.1. Edit: Just noticed similar comments from @ivirshup, sorry about replicating :) |
Thank you all for this! In particular, @Koncopd! I know that this will cause a little more headache, but could we consider renaming to The function has its name with a But I don't see that function ever coming within the current Scanpy main API. So, it'd be nice to have a simple and short name. As backward compat will be broken to some degree anyway in 1.5, and a small converter for legacy Please slack me if you really want my feedback. Sorry that I'm so absent. 😔 |
Instead of having Meaning, there is a class for this besides the convenience function:
And if you just want a dataframe and not store something in
After this Within the convenience wrapper |
Would you not maybe do this in stages with a |
@falexwolf |
|
I personally find MultiIndexes a bit hard to work with. Could you show how they would be used here? For example, how would I just get a dataframe for the naive T cells vs. rest? I'm also not sure I get why we'd order genes by rank, when there are multiple comparisons in the table. What operations does this make easier? Most importantly, I don't think we have support for reading and writing multi indexes in anndata. An alternative would be to just have an entry in adata.uns[key_added] = {
"params": {
"groupby": "leiden",
"reference": "rest",
"test": "wilcoxon",
"rep": "X"
},
"results": {
"1": ..., # pd.DataFrame, with index of .var_names
"2": ..., #etc
},
} |
I second the suggestion by @falexwolf to rename the function to something simpler but also to keep the previous functionality with a Deprecate message as suggested by @LuckyMD. @Koncopd The changes also requires adapting the corresponding |
scanpy/tools/_rank_genes_groups.py
Outdated
|
||
|
||
class _RankGenes: | ||
def __init__(self, adata, groups, groupby, reference, use_raw, layer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these values get documented? At the very least, what their types should be?
@Koncopd @falexwolf, I have a proposal for this PR. This PR makes a huge improvement to the internals of the differential expression code. Could we aim to limit the scope here to just improving the internals (i.e., no API introductions or changes)? Then we can introduce and change APIs in later PRs? My interest here is getting the improved internals onto master ASAP, so that other people can hack on it. This would allow correctness improvements to happen in parallel with API design. |
Sorry for the delay, i'll return to this soon. @ivirshup |
I've restored the original output format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. This is great!
Docs are built fine locally, i don't know why readthedocs fails. |
Related to 1), 2) of this.
Replaces #1081