score_genes reproducibility #313

coh-racng · 2018-10-19T00:52:18Z

In my preprocessing pipeline, I filtered for the most variable genes and also regress out n_counts and cell cycle. However, it seems like every time I restart my Jupyter notebook and rerun the preprocessing pipeline, I get a different neighborhood graph and UMAP, and therefore different clustering. If I use the same preprocessed data, I can reproduce the same PCA (using svd_solver='arpack'), UMAP, and clusters. So I was wondering if there is randomness introduced during filtering process or regress_out methods? What should I do to ensure reproducibility?

coh-racng · 2018-10-19T01:03:47Z

setting random seed before preprocessing did the trick.

falexwolf · 2018-10-19T13:41:39Z

Ah, that should very likely be due to the cell cyle annotation via score_genes. Are you using it? There is a sampling procedure involved that might not set a random seed in the function call.

We should fix that.

falexwolf · 2018-10-19T13:43:50Z

Hm, but it looks like https://scanpy.readthedocs.io/en/latest/api/scanpy.api.tl.score_genes.html should be reproducible with the default settings. In what you describe, only pca, louvain and umap have stochastic elements all of which set a default seed in the function call. I have never observed issues with reproducibility there.

Can we narrow it down to score_genes? Or do you think it's somewhere else?

fidelram · 2018-10-22T13:20:59Z

I think this can be closed as @coh-racng found a solution.

falexwolf · 2018-10-23T18:11:00Z

Yes, but I think that each tool should have the possibility of setting a seed without needing to set it externally. So, I'd like to know why @coh-racng had to do this.

coh-racng · 2018-10-23T23:55:31Z

Thanks for the help. I have traced the randomness to sc.tl.score_genes_cell_cycle. Even if I explicitly state the random state to be 0, two repeats of score_genes_cell_cycle from the same data would give different downstream clustering. On the other hand, if I use one score_genes_cell_cycle output and repeat the downstream clustering methods, the clusters are the same.
I am currently using scanpy==1.3.1

chris-rands · 2019-11-07T16:45:45Z

Was there any progress on this? I also have an issue of non-reproducible UMAPs when re-starting Jupyter notebook. I have not been able to trace the exact cause of the issue, but it appears unrelated to the PCA step (I am using svd_solver='arpack'). In my case setting random.seed did not fix the issue. However, surprisingly, combining this with disabling hash randomization by setting PYTHONHASHSEED to 0 did generate reproducible results

Iwo-K · 2021-03-15T18:06:03Z

I think I encountered this or a very similar problem and I have a fix. In my analysis I used the scanpy.tl.score_genes_cell_cycle to assign cell cycle scores, followed by scanpy.tl.regress_out and scanpy.tl.pca. On different machines I observed differences in the PCA values. What was surprising, I could see these differences even when using the exact same singularity container.

I analysed the functions step by step and found that small differences were coming from the scanpy.tl.score_genes function, specifically computation of the means here:
X_list = np.nanmean(X_list, axis=1)
and
X_control = np.nanmean(X_control, axis=1)

I suspect this had something to do with machine precision and maybe choosing between float32 and float64 types. I managed to fix this by setting the dtype to float64 in each statement. This produced identical gene/cell cycle scores on each machine.

The fix (in the scanpy.tl.score_genes function):

X_list = np.nanmean(X_list, axis=1, dtype = 'float64')
and
X_control = np.nanmean(X_control, axis=1, dtype = 'float64')

And to keep the data types consistent convert back to float32 at the end:
adata.obs[score_name] = pd.Series(np.array(score).ravel(), index=adata.obs_names).astype('float32')

I hope this will help someone. I am happy to open a pull request if you think this is worthwhile.

WhoIsJack · 2021-05-27T15:31:12Z

Pinging this, as I've encountered it as well.

I ran into non-reproducible UMAPs when rerunning code/notebooks and systematically went through my pipeline to find the source(s) of error, one of which was sc.tl.score_genes_cell_cycle.

Setting the random seed externally did not help, but @Iwo-K's comment got me on the right track. I am now using the following simple hack, which fixes the issue for me:

adata.X = adata.X.astype('<f8')  # Make float64 to ensure stability
sc.tl.score_genes_cell_cycle(adata, use_raw=False,
                             s_genes=cc_s_genes, g2m_genes=cc_g2m_genes,
                             random_state=0)
adata.X = adata.X.astype('<f4')  # Return to float32 for consistency

Would be great if this would be fixed internally, perhaps using @Iwo-K's solution?

ivirshup · 2021-06-16T07:08:04Z

Sorry about the late response on this!

@Iwo-K, thanks for looking into this. That fix works for me, and a PR would be welcome. I think it would be fine to always return the 64 bit values though. As for why this occurs, I think this thread may have some relevant info: numpy/numpy#624 (comment)

coh-racng closed this as completed Oct 19, 2018

falexwolf reopened this Oct 19, 2018

dweemx mentioned this issue Jan 8, 2020

Set a fixed seed for all steps vib-singlecell-nf/vsn-pipelines#43

Closed

LuckyMD mentioned this issue Jan 24, 2020

Problem at reproducibility of UMAP / leiden #1009

Open

This was referenced Apr 28, 2020

Inconsistent results from arpack pca implementation using VMs with different numbers of cpus #1187

Open

Different outputs theislab/single-cell-tutorial#36

Closed

ivirshup added the Bug 🐛 label Jun 16, 2021

ivirshup changed the title ~~preprocessing and clustering reproducibility~~ score_genes reproducibility Jun 16, 2021

ivirshup added this to the 1.8.1 milestone Jun 16, 2021

This was referenced Jun 17, 2021

Fixed randomness in tl.diffmap - compute_eigen (v0 argument) #1858

Merged

Score genes reproducibility fix #1890

Merged

ivirshup linked a pull request Jun 22, 2021 that will close this issue

Score genes reproducibility fix #1890

Merged

ivirshup closed this as completed in #1890 Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

score_genes reproducibility #313

score_genes reproducibility #313

coh-racng commented Oct 19, 2018

coh-racng commented Oct 19, 2018

falexwolf commented Oct 19, 2018

falexwolf commented Oct 19, 2018

fidelram commented Oct 22, 2018

falexwolf commented Oct 23, 2018

coh-racng commented Oct 23, 2018 •

edited

chris-rands commented Nov 7, 2019

Iwo-K commented Mar 15, 2021

WhoIsJack commented May 27, 2021

ivirshup commented Jun 16, 2021

score_genes reproducibility #313

score_genes reproducibility #313

Comments

coh-racng commented Oct 19, 2018

coh-racng commented Oct 19, 2018

falexwolf commented Oct 19, 2018

falexwolf commented Oct 19, 2018

fidelram commented Oct 22, 2018

falexwolf commented Oct 23, 2018

coh-racng commented Oct 23, 2018 • edited

chris-rands commented Nov 7, 2019

Iwo-K commented Mar 15, 2021

WhoIsJack commented May 27, 2021

ivirshup commented Jun 16, 2021

coh-racng commented Oct 23, 2018 •

edited