add n_top_genes argument to rank_genes_groups_df #2145

fbnrst · 2022-02-18T17:12:54Z

This PR addresses https://scanpy.discourse.group/t/workflow-for-selecting-number-of-marker-genes-in-sc-queries-enrich/286

I wanted to have a simple interface to get the top n marker genes. Right now, rank_genes_groups_df only allows to threshold on logfc and pval, but especially for marker genes pval computation might not be statistically meaningful.

It adds the following kind of functionality:

import scanpy as sc

adata = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(adata, 'louvain')

print(sc.get.rank_genes_groups_df(adata, "1", n_top_genes=2))

output is just the top 2 genes of the list.

    names     scores  logfoldchanges          pvals      pvals_adj
0  FCGR3A  47.682064        5.891937  3.275554e-141  3.579712e-139
1     FTL  45.653259        2.497682  9.003150e-208  6.887410e-205

it also works for multiple groups:

print(sc.get.rank_genes_groups_df(adata, None, n_top_genes=2))

   group    names     scores  logfoldchanges          pvals      pvals_adj
0      0     CD3D  26.250046        3.859759   4.379061e-75   2.233321e-73
1      0     LDHB  21.207499        2.134979   1.488480e-67   5.993089e-66
2      1   FCGR3A  47.682064        5.891937  3.275554e-141  3.579712e-139
3      1      FTL  45.653259        2.497682  9.003150e-208  6.887410e-205
4      2      LYZ  38.981312        5.096991  1.697105e-172  1.298285e-169
5      2     CST3  34.241749        4.388617  1.448193e-149  5.539337e-147
6      3     NKG7  34.214161        6.089183   2.356710e-55   2.575547e-53
7      3     CTSW  24.584066        5.091688   2.026294e-39   9.118324e-38
8      4    CD79A  52.583344        6.626956   4.032974e-84   7.713062e-82
9      4    CD79B  32.102913        4.990217   1.958507e-51   1.872822e-49
10     5      FTL  26.084383        1.844273   1.236398e-74   2.364611e-72
11     5     LST1  25.554073        3.170759   5.653851e-81   4.325196e-78
12     6      LYZ  31.497107        4.328516  9.041131e-106  6.916466e-103
13     6     CST3  23.850258        3.281016   2.491629e-83   9.530482e-81
14     7     CST3  33.024582        4.195395  5.768439e-136  4.412856e-133
15     7      LYZ  31.264187        4.267053  9.712334e-101   1.485987e-98
16     8     PPIB  39.260998        3.990153   7.159966e-47   3.651583e-45
17     8     MZB1  33.305500        8.979518   7.611322e-26   1.878278e-24
18     9    STMN1  27.133045        5.936039   4.998127e-18   8.312102e-17
19     9    HMGB2  15.229477        5.016804   3.184879e-12   4.060720e-11
20    10  HNRNPA1  18.405415        2.040915   1.570832e-12   1.560632e-11
21    10     NPM1  14.230449        2.183721   3.424469e-10   3.046185e-09

This also extends to enrichment queries (this is what I wanted originally):

sc.queries.enrich(adata, "1", n_top_genes=10)

For enrichment queries, I added to the doc string that a pval threshold of 0.05 is used. Previously, this was not obvious to me (and for cluster marker genes, this might not always be sensible).

I didn't add anything to docs/release-notes/, yet. I first wanted to get your opinion. Is it useful, what is still needed here?

codecov · 2022-02-18T17:27:50Z

Codecov Report

Merging #2145 (1fc4c69) into master (b69015e) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2145      +/-   ##
==========================================
- Coverage   71.43%   71.41%   -0.03%     
==========================================
  Files          92       92              
  Lines       11272    11274       +2     
==========================================
- Hits         8052     8051       -1     
- Misses       3220     3223       +3

Impacted Files	Coverage Δ
scanpy/queries/_queries.py	`42.85% <ø> (ø)`
scanpy/get/get.py	`92.98% <100.00%> (+0.08%)`	⬆️
scanpy/plotting/_tools/__init__.py	`76.09% <0.00%> (-0.55%)`	⬇️
scanpy/plotting/_utils.py	`54.33% <0.00%> (-0.20%)`	⬇️

fbnrst added 2 commits February 18, 2022 17:56

add n_top_genes argument to rank_genes_groups_df

74ef15d

pre-commit corrections

1fc4c69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add n_top_genes argument to rank_genes_groups_df #2145

add n_top_genes argument to rank_genes_groups_df #2145

fbnrst commented Feb 18, 2022

codecov bot commented Feb 18, 2022 •

edited

add n_top_genes argument to rank_genes_groups_df #2145

Are you sure you want to change the base?

add n_top_genes argument to rank_genes_groups_df #2145

Conversation

fbnrst commented Feb 18, 2022

codecov bot commented Feb 18, 2022 • edited

Codecov Report

codecov bot commented Feb 18, 2022 •

edited