Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rank_genes_groups refactoring 2nd try #1156

Merged
merged 16 commits into from
Jun 16, 2020
Merged

rank_genes_groups refactoring 2nd try #1156

merged 16 commits into from
Jun 16, 2020

Conversation

Koncopd
Copy link
Member

@Koncopd Koncopd commented Apr 8, 2020

Related to 1), 2) of this.

Replaces #1081

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge improvement! Thanks for going thought this! I've added a few questions and thoughts on how this would be more simple. Most of the points are nitpicky, and I think I would be happy with the restructuring as is.

One important point, did you address the correctness issues for the rank test?


For the point about treating the rest and reference cases equivalently, I thought the _basic_stats method could have a block like:

        if self.ireference:
            mask_rest = self.groups_masks[self.ireference]
            X_rest = self.X[mask_rest]
            m, v = _get_mean_var(X_rest)
            self.means_rest = np.broadcast_to(m, (n_groups, n_genes))
            self.vars_rest = np.broadcast_to(v, (n_groups, n_genes))

So you wouldn't need to do checks in downstream functions.


On the structure of the code, I like to minimize the states an object can be in when possible. This just makes it easier to reason about what you can do with an instance. For example, I probably would have tried to write this so the test was like a config for how the test should be run. Then there would be a function which applies that to each pair of arrays.


Also, a personal request for this:

Could we change the default number of top genes to be the number of genes present?

scanpy/tools/_rank_genes_groups.py Outdated Show resolved Hide resolved
scanpy/tools/_rank_genes_groups.py Outdated Show resolved Hide resolved
Comment on lines +164 to +172
# deleting the next line causes a memory leak for some reason
del X_rest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make a small example showing this memory leak? I think I've run into something similar recently.

# deleting the next line causes a memory leak for some reason
del X_rest

yield imask, mask, mask_rest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this an iterator if you're also saving the values to the object? I think it would be more simple to exclude the yield statement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed yield.


self.ireference = None
if reference != 'rest':
self.ireference = np.where(self.groups_order == reference)[0][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having all these conditional blocks checking for ireference, why not just fill the have the *_rest values just be the reference?

I would probably also rename those attributes to *_other too.

if len(self.groups_order) <= 2:
break

def compute_statistics(self, method, corr_method, n_genes_user, rankby_abs, **kwds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these arguments get some short documentation?

I was confused for a bit about when you would be using a correlation method 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the same parameters as in the main function, so the docs are there.

scanpy/tools/_rank_genes_groups.py Outdated Show resolved Hide resolved
self.groups_names.append(group_name)

scores_sort = np.abs(scores) if rankby_abs else scores
global_indices = _select_top_n(scores_sort, n_genes_user, self.X.shape[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about deprecating this feature? I don't think it makes sense to (by default) filter differentially expressed genes to any fixed n.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second this. All genes should be reported by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@Koncopd
Copy link
Member Author

Koncopd commented Apr 16, 2020

Hi, @ivirshup . Thanks for the review. I'll address the comments soon.
No, this currently doesn't deal with correctness, tie correction and the other things in the issues, just general structure. But i'll definitely turn to them after this.

About the reference thing. Yes, you are right, i can use this approach of course. However, i didn't want to store the same value multiple times, it doesn't make much difference in performance, but still gives uneasy feelings, i would say.

upd: oh, i see that numpy.broadcast_to creates a view, so it should avoid the problem of storing the same value multiple times.

@gokceneraslan
Copy link
Collaborator

gokceneraslan commented Apr 17, 2020

Thanks so much for the amazing work @Koncopd 🎊

I think it'd be more convenient to have the same type of "keys" regardless of the provided reference option in sc.tl.rank_genes_groups. Right now I get

['params', 'pts', 'scores', 'names', 'pvals', 'pvals_adj', 'logfoldchanges']

keys in .uns['rank_genes_groups'], if I set a reference and

['params', 'pts', 'pts_rest', 'scores', 'names', 'pvals', 'pvals_adj', 'logfoldchanges']

if I don't.

Wouldn't it be better to replace all *_rest in the code with *_reference, and calculate them based on whatever the reference is, rest or a specific group?

Then we can provide

['params', 'pts', 'pts_reference', 'scores', 'names', 'pvals', 'pvals_adj', 'logfoldchanges']

without thinking about what the reference is. Seurat reports pct1 for the first group and pct2 for the second group, for example, which is nice and simple, IMO.

Also sc.get.rank_genes_groups_df should return pct and pct_reference. It'd be also great to have pct cutoffs in the sc.get.rank_genes_groups_df function like pct_min.

People typically define DE genes using cutoffs like pval_adj < 0.05, log2fc>1, pct>0.1.

Edit: Just noticed similar comments from @ivirshup, sorry about replicating :)

@falexwolf
Copy link
Member

Thank you all for this! In particular, @Koncopd!

I know that this will cause a little more headache, but could we consider renaming to rank_genes?

The function has its name with a _groups suffix from very early considerations about distinguishing it from a second function that would rank genes by fitting a model on time or pseudotime.

But I don't see that function ever coming within the current Scanpy main API.

So, it'd be nice to have a simple and short name.

As backward compat will be broken to some degree anyway in 1.5, and a small converter for legacy h5ad (in the scanpy read function - renaming .uns['rank_genes_groups'] to .uns['rank_genes'] and making entries a dataframe instead of structured arrays). It'd be a good point in time, now.

Please slack me if you really want my feedback. Sorry that I'm so absent. 😔

@falexwolf
Copy link
Member

Instead of having self.d as the central attribute, why not having a central property self.stats and that's a DataFrame with a multi-index for the columns. The outer index is the group that you're comparing against a reference, the inner index loops over gene, score, pval, pval_adj, etc.

Meaning, there is a class for this besides the convenience function:

rg = sc.RankGenes()

And if you just want a dataframe and not store something in adata, you can do

rg.compute_stats(...)

After this rg.stats contains your results.

Within the convenience wrapper sc.tl.rank_genes, this stats attribute will be written to adata.

@LuckyMD
Copy link
Contributor

LuckyMD commented Apr 20, 2020

I know that this will cause a little more headache, but could we consider renaming to rank_genes?

Would you not maybe do this in stages with a DeprecationWarning first for a couple of releases? This change would break nearly everyone's pipelines and published notebooks. It's not a lot of work to change... but it might warrant a longer warning time.

@Koncopd
Copy link
Member Author

Koncopd commented Apr 22, 2020

@falexwolf
image
Looks fine with miltindex dataframe. However a bit awkward with this names column when n_genes is set. I will set n_genes to all by default but do we need this at all? If there is no n_genes, then names can be moved to dataframe's index as they are the same.

@Koncopd
Copy link
Member Author

Koncopd commented Apr 23, 2020

adata.rename_categories doesn't work with dataframes in uns.

@ivirshup
Copy link
Member

adata.rename_categories doesn't work with dataframes in uns.

  • What problem is this causing?
  • @falexwolf, why did this method get un-deprecated?

Looks fine with miltindex dataframe. However a bit awkward with this names column when n_genes is set. I will set n_genes to all by default but do we need this at all?

I personally find MultiIndexes a bit hard to work with. Could you show how they would be used here? For example, how would I just get a dataframe for the naive T cells vs. rest?

I'm also not sure I get why we'd order genes by rank, when there are multiple comparisons in the table. What operations does this make easier?

Most importantly, I don't think we have support for reading and writing multi indexes in anndata.

An alternative would be to just have an entry in uns that looked like:

adata.uns[key_added] = {
    "params": {
        "groupby": "leiden",
        "reference": "rest",
        "test": "wilcoxon",
        "rep": "X"
    },
    "results": {
        "1": ...,  # pd.DataFrame, with index of .var_names
        "2": ...,  #etc
    },
}

@fidelram
Copy link
Collaborator

I second the suggestion by @falexwolf to rename the function to something simpler but also to keep the previous functionality with a Deprecate message as suggested by @LuckyMD.

@Koncopd The changes also requires adapting the corresponding sc.pl.rank_genes_groups* functions. I can take over that once the PR is ready.



class _RankGenes:
def __init__(self, adata, groups, groupby, reference, use_raw, layer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these values get documented? At the very least, what their types should be?

@ivirshup
Copy link
Member

@Koncopd @falexwolf, I have a proposal for this PR.

This PR makes a huge improvement to the internals of the differential expression code. Could we aim to limit the scope here to just improving the internals (i.e., no API introductions or changes)? Then we can introduce and change APIs in later PRs?

My interest here is getting the improved internals onto master ASAP, so that other people can hack on it. This would allow correctness improvements to happen in parallel with API design.

@Koncopd
Copy link
Member Author

Koncopd commented May 14, 2020

Sorry for the delay, i'll return to this soon.

@ivirshup
I have nothing against this suggestion, actually this was my original plan.

@Koncopd
Copy link
Member Author

Koncopd commented Jun 5, 2020

I've restored the original output format.
However, there are still a few changes to the api.
if pts=True (default False) the function also computes the fraction of cells expressing the genes for each group, and if reference='rest' then pts_rest is added with fraction of cells expressing the genes for the complement of each group.
n_genes is set to all genes by default.

Copy link
Member

@falexwolf falexwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This is great!

@Koncopd
Copy link
Member Author

Koncopd commented Jun 10, 2020

Docs are built fine locally, i don't know why readthedocs fails.

@Koncopd Koncopd merged commit 8384401 into master Jun 16, 2020
fidelram added a commit that referenced this pull request Jun 24, 2020
@ivirshup ivirshup mentioned this pull request Jul 26, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants