Loophole with using scaled adata for sc.tl.score_genes_cell_cycle

I have a question regarding the recommendations provided in : 

https://github.com/scverse/scanpy_usage/blob/master/180209_cell_cycle/cell_cycle.ipynb

to compute cell cycle scores using Scanpy. Specifically it mentions that the adata should be scaled to have zero mean and unit variance before using sc.tl.score_genes_cell_cycle. But when I look at the source code, particularly the part pasted below, I notice that unless the 'use_raw' option is set to True, obs_avg will be zero for the adata after scaling assuming gene_pool is not set.
___________________________________________________________________________________________
 _adata = adata.raw if use_raw else adata
    _adata_subset = (
        _adata[:, gene_pool] if len(gene_pool) < len(_adata.var_names) else _adata
    )
    # average expression of genes
    if issparse(_adata_subset.X):
        obs_avg = pd.Series(
            np.array(_sparse_nanmean(_adata_subset.X, axis=0)).flatten(),
            index=gene_pool,
        )
    else:
        obs_avg = pd.Series(np.nanmean(_adata_subset.X, axis=0), index=gene_pool)

    # Sometimes (and I don't know how) missing data may be there, with nansfor
    obs_avg = obs_avg[np.isfinite(obs_avg)]

_____________________________________________________________________________________________

How is the selection of control_genes meaningful in this scenario? We will basically have, control_genes=min(len(s_genes, g2m_genes)) which is a randomly sampled subset of the total genes but this won't have this feature of selecting genes that are similar in expression levels to the marker gene list. I think we should find control_genes using the expression data in the raw adata, identify the control_genes using the binning procedure below and then calculate s_scores and g2m_scores for these genes using the scaled data if that is necessary (mean of these randomly selected genes in the scaled adata expression matrix).

I hope this makes sense. Any clarifications or confirmations would be really helpful. 

_____________________________________________________________________________________________
n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
    obs_cut = obs_avg.rank(method="min") // n_items
    control_genes = pd.Index([], dtype="string")

    # now pick `ctrl_size` genes from every cut
    for cut in np.unique(obs_cut.loc[gene_list]):
        r_genes: pd.Index[str] = obs_cut[obs_cut == cut].index
        if ctrl_size < len(r_genes):
            r_genes = r_genes.to_series().sample(ctrl_size).index
        control_genes = control_genes.union(r_genes.difference(gene_list))



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loophole with using scaled adata for sc.tl.score_genes_cell_cycle #3030

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loophole with using scaled adata for sc.tl.score_genes_cell_cycle #3030

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions