Rethink group IDs in rank_genes_groups

`rank_genes_groups` “returns” two recarrays, each with the shape #cells×#groups. one of them stores gene IDs, one the genes’ scores.

the problem with this is that recarrays store their column index (names) in the dtype, in a place where only strings are accepted. however users (and indeed both our wilcoxon example and the tests) may choose to use numeric group IDs.

genes with score 0 are unimportant anyway, so maybe we should return sparse data, in the form of a long-form recarray with something like this shape (with `<group_by>` being the `rank_genes_groups` parameter of the same name):

obs | var | <group_by> | score
-- | -- | -- | --
0 | ENSGXXXX | 5 | 9.728
… | … | … | …

This way the three IDs can have user-defined types, and the data is easier to process via e.g. `pd.DataFrame.fromrecords(adata.obs['gene_ranking'])` 

The data should probably be sorted by descending z-scores by group, i.e. if it was a DataFrame: `return gene_ranking.groupby(group_by).sort_values('score')`
  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink group IDs in rank_genes_groups #61

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rethink group IDs in rank_genes_groups #61

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions