Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

score_genes reproducibility #313

Closed
coh-racng opened this issue Oct 19, 2018 · 10 comments · Fixed by #1890
Closed

score_genes reproducibility #313

coh-racng opened this issue Oct 19, 2018 · 10 comments · Fixed by #1890
Labels
Milestone

Comments

@coh-racng
Copy link
Contributor

In my preprocessing pipeline, I filtered for the most variable genes and also regress out n_counts and cell cycle. However, it seems like every time I restart my Jupyter notebook and rerun the preprocessing pipeline, I get a different neighborhood graph and UMAP, and therefore different clustering. If I use the same preprocessed data, I can reproduce the same PCA (using svd_solver='arpack'), UMAP, and clusters. So I was wondering if there is randomness introduced during filtering process or regress_out methods? What should I do to ensure reproducibility?

@coh-racng
Copy link
Contributor Author

setting random seed before preprocessing did the trick.

@falexwolf
Copy link
Member

Ah, that should very likely be due to the cell cyle annotation via score_genes. Are you using it? There is a sampling procedure involved that might not set a random seed in the function call.

We should fix that.

@falexwolf falexwolf reopened this Oct 19, 2018
@falexwolf
Copy link
Member

Hm, but it looks like https://scanpy.readthedocs.io/en/latest/api/scanpy.api.tl.score_genes.html should be reproducible with the default settings. In what you describe, only pca, louvain and umap have stochastic elements all of which set a default seed in the function call. I have never observed issues with reproducibility there.

Can we narrow it down to score_genes? Or do you think it's somewhere else?

@fidelram
Copy link
Collaborator

I think this can be closed as @coh-racng found a solution.

@falexwolf
Copy link
Member

Yes, but I think that each tool should have the possibility of setting a seed without needing to set it externally. So, I'd like to know why @coh-racng had to do this.

@coh-racng
Copy link
Contributor Author

coh-racng commented Oct 23, 2018

Thanks for the help. I have traced the randomness to sc.tl.score_genes_cell_cycle. Even if I explicitly state the random state to be 0, two repeats of score_genes_cell_cycle from the same data would give different downstream clustering. On the other hand, if I use one score_genes_cell_cycle output and repeat the downstream clustering methods, the clusters are the same.
I am currently using scanpy==1.3.1

@chris-rands
Copy link
Contributor

Was there any progress on this? I also have an issue of non-reproducible UMAPs when re-starting Jupyter notebook. I have not been able to trace the exact cause of the issue, but it appears unrelated to the PCA step (I am using svd_solver='arpack'). In my case setting random.seed did not fix the issue. However, surprisingly, combining this with disabling hash randomization by setting PYTHONHASHSEED to 0 did generate reproducible results

@Iwo-K
Copy link
Contributor

Iwo-K commented Mar 15, 2021

I think I encountered this or a very similar problem and I have a fix. In my analysis I used the scanpy.tl.score_genes_cell_cycle to assign cell cycle scores, followed by scanpy.tl.regress_out and scanpy.tl.pca. On different machines I observed differences in the PCA values. What was surprising, I could see these differences even when using the exact same singularity container.

I analysed the functions step by step and found that small differences were coming from the scanpy.tl.score_genes function, specifically computation of the means here:
X_list = np.nanmean(X_list, axis=1)
and
X_control = np.nanmean(X_control, axis=1)

I suspect this had something to do with machine precision and maybe choosing between float32 and float64 types. I managed to fix this by setting the dtype to float64 in each statement. This produced identical gene/cell cycle scores on each machine.

The fix (in the scanpy.tl.score_genes function):

X_list = np.nanmean(X_list, axis=1, dtype = 'float64')
and
X_control = np.nanmean(X_control, axis=1, dtype = 'float64')

And to keep the data types consistent convert back to float32 at the end:
adata.obs[score_name] = pd.Series(np.array(score).ravel(), index=adata.obs_names).astype('float32')

I hope this will help someone. I am happy to open a pull request if you think this is worthwhile.

@WhoIsJack
Copy link

Pinging this, as I've encountered it as well.

I ran into non-reproducible UMAPs when rerunning code/notebooks and systematically went through my pipeline to find the source(s) of error, one of which was sc.tl.score_genes_cell_cycle.

Setting the random seed externally did not help, but @Iwo-K's comment got me on the right track. I am now using the following simple hack, which fixes the issue for me:

adata.X = adata.X.astype('<f8')  # Make float64 to ensure stability
sc.tl.score_genes_cell_cycle(adata, use_raw=False,
                             s_genes=cc_s_genes, g2m_genes=cc_g2m_genes,
                             random_state=0)
adata.X = adata.X.astype('<f4')  # Return to float32 for consistency

Would be great if this would be fixed internally, perhaps using @Iwo-K's solution?

@ivirshup
Copy link
Member

Sorry about the late response on this!

@Iwo-K, thanks for looking into this. That fix works for me, and a PR would be welcome. I think it would be fine to always return the 64 bit values though. As for why this occurs, I think this thread may have some relevant info: numpy/numpy#624 (comment)

@ivirshup ivirshup changed the title preprocessing and clustering reproducibility score_genes reproducibility Jun 16, 2021
@ivirshup ivirshup added this to the 1.8.1 milestone Jun 16, 2021
@ivirshup ivirshup linked a pull request Jun 22, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants