rank_genes_groups refactoring 2nd try #1156

Koncopd · 2020-04-08T23:42:54Z

Related to 1), 2) of this.

Replaces #1081

ivirshup

Huge improvement! Thanks for going thought this! I've added a few questions and thoughts on how this would be more simple. Most of the points are nitpicky, and I think I would be happy with the restructuring as is.

One important point, did you address the correctness issues for the rank test?

For the point about treating the rest and reference cases equivalently, I thought the _basic_stats method could have a block like:

        if self.ireference:
            mask_rest = self.groups_masks[self.ireference]
            X_rest = self.X[mask_rest]
            m, v = _get_mean_var(X_rest)
            self.means_rest = np.broadcast_to(m, (n_groups, n_genes))
            self.vars_rest = np.broadcast_to(v, (n_groups, n_genes))

So you wouldn't need to do checks in downstream functions.

On the structure of the code, I like to minimize the states an object can be in when possible. This just makes it easier to reason about what you can do with an instance. For example, I probably would have tried to write this so the test was like a config for how the test should be run. Then there would be a function which applies that to each pair of arrays.

Also, a personal request for this:

Could we change the default number of top genes to be the number of genes present?

scanpy/tools/_rank_genes_groups.py

ivirshup · 2020-04-15T05:46:10Z

scanpy/tools/_rank_genes_groups.py

+                # deleting the next line causes a memory leak for some reason
+                del X_rest


Could you make a small example showing this memory leak? I think I've run into something similar recently.

ivirshup · 2020-04-15T05:56:24Z

scanpy/tools/_rank_genes_groups.py

+                # deleting the next line causes a memory leak for some reason
+                del X_rest
+
+            yield imask, mask, mask_rest


Why make this an iterator if you're also saving the values to the object? I think it would be more simple to exclude the yield statement.

Removed yield.

ivirshup · 2020-04-15T06:09:12Z

scanpy/tools/_rank_genes_groups.py

+
+        self.ireference = None
+        if reference != 'rest':
+            self.ireference = np.where(self.groups_order == reference)[0][0]


Instead of having all these conditional blocks checking for ireference, why not just fill the have the *_rest values just be the reference?

I would probably also rename those attributes to *_other too.

ivirshup · 2020-04-15T06:25:47Z

scanpy/tools/_rank_genes_groups.py

+            if len(self.groups_order) <= 2:
+                break
+
+    def compute_statistics(self, method, corr_method, n_genes_user, rankby_abs, **kwds):


Could these arguments get some short documentation?

I was confused for a bit about when you would be using a correlation method 😄

These are the same parameters as in the main function, so the docs are there.

scanpy/tools/_rank_genes_groups.py

ivirshup · 2020-04-15T06:49:46Z

scanpy/tools/_rank_genes_groups.py

+            self.groups_names.append(group_name)
+
+            scores_sort = np.abs(scores) if rankby_abs else scores
+            global_indices = _select_top_n(scores_sort, n_genes_user, self.X.shape[1])


What do you think about deprecating this feature? I don't think it makes sense to (by default) filter differentially expressed genes to any fixed n.

I second this. All genes should be reported by default.

Koncopd · 2020-04-16T13:34:16Z

Hi, @ivirshup . Thanks for the review. I'll address the comments soon.
No, this currently doesn't deal with correctness, tie correction and the other things in the issues, just general structure. But i'll definitely turn to them after this.

About the reference thing. Yes, you are right, i can use this approach of course. However, i didn't want to store the same value multiple times, it doesn't make much difference in performance, but still gives uneasy feelings, i would say.

upd: oh, i see that numpy.broadcast_to creates a view, so it should avoid the problem of storing the same value multiple times.

gokceneraslan · 2020-04-17T15:52:19Z

Thanks so much for the amazing work @Koncopd 🎊

I think it'd be more convenient to have the same type of "keys" regardless of the provided reference option in sc.tl.rank_genes_groups. Right now I get

['params', 'pts', 'scores', 'names', 'pvals', 'pvals_adj', 'logfoldchanges']

keys in .uns['rank_genes_groups'], if I set a reference and

['params', 'pts', 'pts_rest', 'scores', 'names', 'pvals', 'pvals_adj', 'logfoldchanges']

if I don't.

Wouldn't it be better to replace all *_rest in the code with *_reference, and calculate them based on whatever the reference is, rest or a specific group?

Then we can provide

['params', 'pts', 'pts_reference', 'scores', 'names', 'pvals', 'pvals_adj', 'logfoldchanges']

without thinking about what the reference is. Seurat reports pct1 for the first group and pct2 for the second group, for example, which is nice and simple, IMO.

Also sc.get.rank_genes_groups_df should return pct and pct_reference. It'd be also great to have pct cutoffs in the sc.get.rank_genes_groups_df function like pct_min.

People typically define DE genes using cutoffs like pval_adj < 0.05, log2fc>1, pct>0.1.

Edit: Just noticed similar comments from @ivirshup, sorry about replicating :)

falexwolf · 2020-04-20T11:16:54Z

Thank you all for this! In particular, @Koncopd!

I know that this will cause a little more headache, but could we consider renaming to rank_genes?

The function has its name with a _groups suffix from very early considerations about distinguishing it from a second function that would rank genes by fitting a model on time or pseudotime.

But I don't see that function ever coming within the current Scanpy main API.

So, it'd be nice to have a simple and short name.

As backward compat will be broken to some degree anyway in 1.5, and a small converter for legacy h5ad (in the scanpy read function - renaming .uns['rank_genes_groups'] to .uns['rank_genes'] and making entries a dataframe instead of structured arrays). It'd be a good point in time, now.

Please slack me if you really want my feedback. Sorry that I'm so absent. 😔

falexwolf · 2020-04-20T11:27:00Z

Instead of having self.d as the central attribute, why not having a central property self.stats and that's a DataFrame with a multi-index for the columns. The outer index is the group that you're comparing against a reference, the inner index loops over gene, score, pval, pval_adj, etc.

Meaning, there is a class for this besides the convenience function:

rg = sc.RankGenes()

And if you just want a dataframe and not store something in adata, you can do

rg.compute_stats(...)

After this rg.stats contains your results.

Within the convenience wrapper sc.tl.rank_genes, this stats attribute will be written to adata.

LuckyMD · 2020-04-20T12:38:11Z

I know that this will cause a little more headache, but could we consider renaming to rank_genes?

Would you not maybe do this in stages with a DeprecationWarning first for a couple of releases? This change would break nearly everyone's pipelines and published notebooks. It's not a lot of work to change... but it might warrant a longer warning time.

Koncopd · 2020-04-22T18:40:07Z

@falexwolf

Looks fine with miltindex dataframe. However a bit awkward with this names column when n_genes is set. I will set n_genes to all by default but do we need this at all? If there is no n_genes, then names can be moved to dataframe's index as they are the same.

Koncopd · 2020-04-23T21:36:57Z

adata.rename_categories doesn't work with dataframes in uns.

ivirshup · 2020-04-28T05:45:18Z

adata.rename_categories doesn't work with dataframes in uns.

What problem is this causing?
@falexwolf, why did this method get un-deprecated?

Looks fine with miltindex dataframe. However a bit awkward with this names column when n_genes is set. I will set n_genes to all by default but do we need this at all?

I personally find MultiIndexes a bit hard to work with. Could you show how they would be used here? For example, how would I just get a dataframe for the naive T cells vs. rest?

I'm also not sure I get why we'd order genes by rank, when there are multiple comparisons in the table. What operations does this make easier?

Most importantly, I don't think we have support for reading and writing multi indexes in anndata.

An alternative would be to just have an entry in uns that looked like:

adata.uns[key_added] = {
    "params": {
        "groupby": "leiden",
        "reference": "rest",
        "test": "wilcoxon",
        "rep": "X"
    },
    "results": {
        "1": ...,  # pd.DataFrame, with index of .var_names
        "2": ...,  #etc
    },
}

fidelram · 2020-05-12T15:55:14Z

I second the suggestion by @falexwolf to rename the function to something simpler but also to keep the previous functionality with a Deprecate message as suggested by @LuckyMD.

@Koncopd The changes also requires adapting the corresponding sc.pl.rank_genes_groups* functions. I can take over that once the PR is ready.

ivirshup · 2020-05-14T05:33:39Z

scanpy/tools/_rank_genes_groups.py

+
+
+class _RankGenes:
+    def __init__(self, adata, groups, groupby, reference, use_raw, layer):


Could these values get documented? At the very least, what their types should be?

ivirshup · 2020-05-14T10:25:54Z

@Koncopd @falexwolf, I have a proposal for this PR.

This PR makes a huge improvement to the internals of the differential expression code. Could we aim to limit the scope here to just improving the internals (i.e., no API introductions or changes)? Then we can introduce and change APIs in later PRs?

My interest here is getting the improved internals onto master ASAP, so that other people can hack on it. This would allow correctness improvements to happen in parallel with API design.

Koncopd · 2020-05-14T10:34:14Z

Sorry for the delay, i'll return to this soon.

@ivirshup
I have nothing against this suggestion, actually this was my original plan.

Koncopd · 2020-06-05T21:47:27Z

I've restored the original output format.
However, there are still a few changes to the api.
if pts=True (default False) the function also computes the fraction of cells expressing the genes for each group, and if reference='rest' then pts_rest is added with fraction of cells expressing the genes for the complement of each group.
n_genes is set to all genes by default.

falexwolf

Thank you. This is great!

Koncopd · 2020-06-10T09:54:09Z

Docs are built fine locally, i don't know why readthedocs fails.

Koncopd requested review from falexwolf and ivirshup April 9, 2020 11:07

ivirshup reviewed Apr 15, 2020

View reviewed changes

ivirshup reviewed May 14, 2020

View reviewed changes

fidelram mentioned this pull request May 19, 2020

filter_rank_genes_groups condition should be OR instead of AND #1213

Closed

Koncopd force-pushed the rgg_refactor2 branch from 0973359 to 78d3066 Compare June 5, 2020 21:27

falexwolf approved these changes Jun 8, 2020

View reviewed changes

Koncopd added 10 commits June 16, 2020 16:09

add class

b80592a

pts_rest

3856fbc

format

1c2da07

fix groups_order

4a282ec

codacy

03798e8

use multindex df and n_genes=all by default

50e1338

small fixes

414f7cf

don't compute pts by default

a0e6562

clean and fix

cd2750b

no yield in _basic_stats

77ae3c1

Koncopd added 5 commits June 16, 2020 16:09

standard output

2e490fa

format

3a57166

docs

155c308

eliminate zeros for sparse

c01917e

clean n_genes defaults

9a41c1a

Koncopd force-pushed the rgg_refactor2 branch from bf5d960 to 9a41c1a Compare June 16, 2020 14:10

rm ns

0c5381e

Koncopd merged commit 8384401 into master Jun 16, 2020

fidelram added a commit that referenced this pull request Jun 24, 2020

update rank_genes_groups images after #1156

943bc03

ivirshup mentioned this pull request Jul 26, 2021

diffxpy integration #1955

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rank_genes_groups refactoring 2nd try #1156

rank_genes_groups refactoring 2nd try #1156

Koncopd commented Apr 8, 2020

ivirshup left a comment •

edited

Loading

ivirshup Apr 15, 2020

ivirshup Apr 15, 2020

Koncopd Jun 5, 2020

ivirshup Apr 15, 2020

ivirshup Apr 15, 2020

Koncopd Jun 5, 2020

ivirshup Apr 15, 2020

fidelram Apr 17, 2020

Koncopd Apr 24, 2020

Koncopd commented Apr 16, 2020 •

edited

Loading

gokceneraslan commented Apr 17, 2020 •

edited

Loading

falexwolf commented Apr 20, 2020

falexwolf commented Apr 20, 2020

LuckyMD commented Apr 20, 2020

Koncopd commented Apr 22, 2020

Koncopd commented Apr 23, 2020

ivirshup commented Apr 28, 2020

fidelram commented May 12, 2020

ivirshup May 14, 2020

ivirshup commented May 14, 2020

Koncopd commented May 14, 2020 •

edited

Loading

Koncopd commented Jun 5, 2020

falexwolf left a comment

Koncopd commented Jun 10, 2020

		# deleting the next line causes a memory leak for some reason
		del X_rest



		class _RankGenes:
		def __init__(self, adata, groups, groupby, reference, use_raw, layer):

rank_genes_groups refactoring 2nd try #1156

rank_genes_groups refactoring 2nd try #1156

Conversation

Koncopd commented Apr 8, 2020

ivirshup left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Koncopd commented Apr 16, 2020 • edited Loading

gokceneraslan commented Apr 17, 2020 • edited Loading

falexwolf commented Apr 20, 2020

falexwolf commented Apr 20, 2020

LuckyMD commented Apr 20, 2020

Koncopd commented Apr 22, 2020

Koncopd commented Apr 23, 2020

ivirshup commented Apr 28, 2020

fidelram commented May 12, 2020

Choose a reason for hiding this comment

ivirshup commented May 14, 2020

Koncopd commented May 14, 2020 • edited Loading

Koncopd commented Jun 5, 2020

falexwolf left a comment

Choose a reason for hiding this comment

Koncopd commented Jun 10, 2020

ivirshup left a comment •

edited

Loading

Koncopd commented Apr 16, 2020 •

edited

Loading

gokceneraslan commented Apr 17, 2020 •

edited

Loading

Koncopd commented May 14, 2020 •

edited

Loading