Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add n_top_genes argument to rank_genes_groups_df #2145

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

fbnrst
Copy link
Contributor

@fbnrst fbnrst commented Feb 18, 2022

This PR addresses https://scanpy.discourse.group/t/workflow-for-selecting-number-of-marker-genes-in-sc-queries-enrich/286

I wanted to have a simple interface to get the top n marker genes. Right now, rank_genes_groups_df only allows to threshold on logfc and pval, but especially for marker genes pval computation might not be statistically meaningful.

It adds the following kind of functionality:

import scanpy as sc

adata = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(adata, 'louvain')

print(sc.get.rank_genes_groups_df(adata, "1", n_top_genes=2))

output is just the top 2 genes of the list.

    names     scores  logfoldchanges          pvals      pvals_adj
0  FCGR3A  47.682064        5.891937  3.275554e-141  3.579712e-139
1     FTL  45.653259        2.497682  9.003150e-208  6.887410e-205

it also works for multiple groups:

print(sc.get.rank_genes_groups_df(adata, None, n_top_genes=2))
   group    names     scores  logfoldchanges          pvals      pvals_adj
0      0     CD3D  26.250046        3.859759   4.379061e-75   2.233321e-73
1      0     LDHB  21.207499        2.134979   1.488480e-67   5.993089e-66
2      1   FCGR3A  47.682064        5.891937  3.275554e-141  3.579712e-139
3      1      FTL  45.653259        2.497682  9.003150e-208  6.887410e-205
4      2      LYZ  38.981312        5.096991  1.697105e-172  1.298285e-169
5      2     CST3  34.241749        4.388617  1.448193e-149  5.539337e-147
6      3     NKG7  34.214161        6.089183   2.356710e-55   2.575547e-53
7      3     CTSW  24.584066        5.091688   2.026294e-39   9.118324e-38
8      4    CD79A  52.583344        6.626956   4.032974e-84   7.713062e-82
9      4    CD79B  32.102913        4.990217   1.958507e-51   1.872822e-49
10     5      FTL  26.084383        1.844273   1.236398e-74   2.364611e-72
11     5     LST1  25.554073        3.170759   5.653851e-81   4.325196e-78
12     6      LYZ  31.497107        4.328516  9.041131e-106  6.916466e-103
13     6     CST3  23.850258        3.281016   2.491629e-83   9.530482e-81
14     7     CST3  33.024582        4.195395  5.768439e-136  4.412856e-133
15     7      LYZ  31.264187        4.267053  9.712334e-101   1.485987e-98
16     8     PPIB  39.260998        3.990153   7.159966e-47   3.651583e-45
17     8     MZB1  33.305500        8.979518   7.611322e-26   1.878278e-24
18     9    STMN1  27.133045        5.936039   4.998127e-18   8.312102e-17
19     9    HMGB2  15.229477        5.016804   3.184879e-12   4.060720e-11
20    10  HNRNPA1  18.405415        2.040915   1.570832e-12   1.560632e-11
21    10     NPM1  14.230449        2.183721   3.424469e-10   3.046185e-09

This also extends to enrichment queries (this is what I wanted originally):

sc.queries.enrich(adata, "1", n_top_genes=10)

For enrichment queries, I added to the doc string that a pval threshold of 0.05 is used. Previously, this was not obvious to me (and for cluster marker genes, this might not always be sensible).

I didn't add anything to docs/release-notes/, yet. I first wanted to get your opinion. Is it useful, what is still needed here?

@codecov
Copy link

codecov bot commented Feb 18, 2022

Codecov Report

Merging #2145 (1fc4c69) into master (b69015e) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2145      +/-   ##
==========================================
- Coverage   71.43%   71.41%   -0.03%     
==========================================
  Files          92       92              
  Lines       11272    11274       +2     
==========================================
- Hits         8052     8051       -1     
- Misses       3220     3223       +3     
Impacted Files Coverage Δ
scanpy/queries/_queries.py 42.85% <ø> (ø)
scanpy/get/get.py 92.98% <100.00%> (+0.08%) ⬆️
scanpy/plotting/_tools/__init__.py 76.09% <0.00%> (-0.55%) ⬇️
scanpy/plotting/_utils.py 54.33% <0.00%> (-0.20%) ⬇️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant