-
Notifications
You must be signed in to change notification settings - Fork 709
Description
rank_genes_groups “returns” two recarrays, each with the shape #cells×#groups. one of them stores gene IDs, one the genes’ scores.
the problem with this is that recarrays store their column index (names) in the dtype, in a place where only strings are accepted. however users (and indeed both our wilcoxon example and the tests) may choose to use numeric group IDs.
genes with score 0 are unimportant anyway, so maybe we should return sparse data, in the form of a long-form recarray with something like this shape (with <group_by> being the rank_genes_groups parameter of the same name):
| obs | var | <group_by> | score |
|---|---|---|---|
| 0 | ENSGXXXX | 5 | 9.728 |
| … | … | … | … |
This way the three IDs can have user-defined types, and the data is easier to process via e.g. pd.DataFrame.fromrecords(adata.obs['gene_ranking'])
The data should probably be sorted by descending z-scores by group, i.e. if it was a DataFrame: return gene_ranking.groupby(group_by).sort_values('score')