Skip to content

Loophole with using scaled adata for sc.tl.score_genes_cell_cycle #3030

@kaushik-roy-physics

Description

@kaushik-roy-physics

I have a question regarding the recommendations provided in :

https://github.com/scverse/scanpy_usage/blob/master/180209_cell_cycle/cell_cycle.ipynb

to compute cell cycle scores using Scanpy. Specifically it mentions that the adata should be scaled to have zero mean and unit variance before using sc.tl.score_genes_cell_cycle. But when I look at the source code, particularly the part pasted below, I notice that unless the 'use_raw' option is set to True, obs_avg will be zero for the adata after scaling assuming gene_pool is not set.


_adata = adata.raw if use_raw else adata
_adata_subset = (
_adata[:, gene_pool] if len(gene_pool) < len(_adata.var_names) else _adata
)
# average expression of genes
if issparse(_adata_subset.X):
obs_avg = pd.Series(
np.array(_sparse_nanmean(_adata_subset.X, axis=0)).flatten(),
index=gene_pool,
)
else:
obs_avg = pd.Series(np.nanmean(_adata_subset.X, axis=0), index=gene_pool)

# Sometimes (and I don't know how) missing data may be there, with nansfor
obs_avg = obs_avg[np.isfinite(obs_avg)]

How is the selection of control_genes meaningful in this scenario? We will basically have, control_genes=min(len(s_genes, g2m_genes)) which is a randomly sampled subset of the total genes but this won't have this feature of selecting genes that are similar in expression levels to the marker gene list. I think we should find control_genes using the expression data in the raw adata, identify the control_genes using the binning procedure below and then calculate s_scores and g2m_scores for these genes using the scaled data if that is necessary (mean of these randomly selected genes in the scaled adata expression matrix).

I hope this makes sense. Any clarifications or confirmations would be really helpful.


n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
obs_cut = obs_avg.rank(method="min") // n_items
control_genes = pd.Index([], dtype="string")

# now pick `ctrl_size` genes from every cut
for cut in np.unique(obs_cut.loc[gene_list]):
    r_genes: pd.Index[str] = obs_cut[obs_cut == cut].index
    if ctrl_size < len(r_genes):
        r_genes = r_genes.to_series().sample(ctrl_size).index
    control_genes = control_genes.union(r_genes.difference(gene_list))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions