Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split rank_genes_groups #1081

Closed
wants to merge 14 commits into from
Closed

Split rank_genes_groups #1081

wants to merge 14 commits into from

Conversation

Koncopd
Copy link
Member

@Koncopd Koncopd commented Mar 3, 2020

Splits rank_genes_groups into helper functions.
Related to 1), 2) of this.
I don't see any point in using dataframe internally (the second point), dict is more convenient here.

@falexwolf
Copy link
Member

Thank you! This looks great!

What do you think, @ivirshup?

Now, @Koncopd, can you also add a logg.warn("default of method has been changed to 't-test' from 't-test_overestim_var'")

And actually change the value?

Finally, can you test this on the Scanpy default tutorial and a a test based on the numerical values in addition to the image based tests? https://github.com/theislab/scanpy/blob/7e058a1a6a082e34a101d65fc7ac5e9cb6563220/scanpy/tests/notebooks/test_pbmc3k.py#L109-L111

Thank you very much!

Otherwise, this is good to be merged, IMO.

@Koncopd
Copy link
Member Author

Koncopd commented Mar 6, 2020

@falexwolf , i'll do.

@gokceneraslan
Copy link
Collaborator

Guys, now that you're dissecting rank_genes_groups, do you mind adding additional info to the results which is the fraction of cells expressing the genes (similar to what we have in dotplots) People calculate it manually ever time and it can be painful for those who are not familiar with pandas.

@Koncopd
Copy link
Member Author

Koncopd commented Mar 6, 2020

@gokceneraslan could you point me to code or an example?

@Koncopd
Copy link
Member Author

Koncopd commented Mar 6, 2020

Test fails seem unrelated to this PR.

@gokceneraslan
Copy link
Collaborator

From Seurat tutorial:

image

pct.1 and pct.2 are the ones I mentioned.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really needed doing, thanks for diving into it!

@Koncopd, how clean would you like to get this? I've noted a few initial things which I think would make this a bit more simple, but I'd like to hear your goals here. Are you interested in doing more with the differential expression tooling?

Comment on lines 103 to 106
if corr_method == 'benjamini-hochberg':
_, pvals_adj, _, _ = multipletests(pvals, alpha=0.05, method='fdr_bh')
elif corr_method == 'bonferroni':
pvals_adj = np.minimum(pvals * n_genes, 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this in the testing function (i.e. _t_test, _wilcoxon, etc.), could this be done afterwards since it's the same regardless of test type?

This would also let us remove the corr_method argument from these functions.

Comment on lines 94 to 96
foldchanges = (expm1_func(mean_group) + 1e-9) / (
expm1_func(mean_rest) + 1e-9
) # add small value to remove 0's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change for each testing function? If not, could it be separated?

Comment on lines 108 to 109
scores_sort = np.abs(scores) if rankby_abs else scores
global_indices = _select_top_n(scores_sort, n_genes_user, n_genes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, can these be split out from these functions?


if issparse(X):
merge = lambda tpl: vstack(tpl).todense()
adapt = lambda X: X.todense()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we avoid using np.matrix. Could these be X.toarray()?

@Koncopd
Copy link
Member Author

Koncopd commented Mar 9, 2020

@ivirshup thanks for catching these things, i'll fix them.
In addition to some splitting and cleaning i also want to improve wilcoxon implementation (avoid densification at least, i have some code for it). But this should be another step, i think.

@Koncopd
Copy link
Member Author

Koncopd commented Mar 9, 2020

@gokceneraslan i'll add these things.

) # add small value to remove 0's
if 'logfoldchanges' not in d:
d['logfoldchanges'] = {}
d['logfoldchanges'][group_name] = np.log2(foldchanges[global_indices])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet. Does this also produce logfoldchanges for logreg? sc.get.rank_genes_groups_df throws an exception now:

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this. I'll check why it happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry for misunderstanding. Error I posted is from stable scanpy, not from the PR. I was asking if it's fixed now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problem with sc.get. logreg only has names and scores keys, but sc.get tries to query nonexistent 'logfoldchanges' and so on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should i add logfoldchanges for logreg?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to interrupt the PR, but I think logfoldchange and percentages should be available for every method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the percentage for the reference group be also calculated and stored?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. It might look unnecessary for one-vs-rest kind of tests but for clusterA-vs-clusterB kind of tests it'd be useful.

@Koncopd
Copy link
Member Author

Koncopd commented Mar 18, 2020

This is still very messy and very inefficient with this separate calculation of pts. I should work on it more to restructure completely...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants