-
Notifications
You must be signed in to change notification settings - Fork 709
Description
I have a question regarding the recommendations provided in :
https://github.com/scverse/scanpy_usage/blob/master/180209_cell_cycle/cell_cycle.ipynb
to compute cell cycle scores using Scanpy. Specifically it mentions that the adata should be scaled to have zero mean and unit variance before using sc.tl.score_genes_cell_cycle. But when I look at the source code, particularly the part pasted below, I notice that unless the 'use_raw' option is set to True, obs_avg will be zero for the adata after scaling assuming gene_pool is not set.
_adata = adata.raw if use_raw else adata
_adata_subset = (
_adata[:, gene_pool] if len(gene_pool) < len(_adata.var_names) else _adata
)
# average expression of genes
if issparse(_adata_subset.X):
obs_avg = pd.Series(
np.array(_sparse_nanmean(_adata_subset.X, axis=0)).flatten(),
index=gene_pool,
)
else:
obs_avg = pd.Series(np.nanmean(_adata_subset.X, axis=0), index=gene_pool)
# Sometimes (and I don't know how) missing data may be there, with nansfor
obs_avg = obs_avg[np.isfinite(obs_avg)]
How is the selection of control_genes meaningful in this scenario? We will basically have, control_genes=min(len(s_genes, g2m_genes)) which is a randomly sampled subset of the total genes but this won't have this feature of selecting genes that are similar in expression levels to the marker gene list. I think we should find control_genes using the expression data in the raw adata, identify the control_genes using the binning procedure below and then calculate s_scores and g2m_scores for these genes using the scaled data if that is necessary (mean of these randomly selected genes in the scaled adata expression matrix).
I hope this makes sense. Any clarifications or confirmations would be really helpful.
n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
obs_cut = obs_avg.rank(method="min") // n_items
control_genes = pd.Index([], dtype="string")
# now pick `ctrl_size` genes from every cut
for cut in np.unique(obs_cut.loc[gene_list]):
r_genes: pd.Index[str] = obs_cut[obs_cut == cut].index
if ctrl_size < len(r_genes):
r_genes = r_genes.to_series().sample(ctrl_size).index
control_genes = control_genes.union(r_genes.difference(gene_list))