New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
score_genes reproducibility #313
Comments
setting random seed before preprocessing did the trick. |
Ah, that should very likely be due to the cell cyle annotation via We should fix that. |
Hm, but it looks like https://scanpy.readthedocs.io/en/latest/api/scanpy.api.tl.score_genes.html should be reproducible with the default settings. In what you describe, only pca, louvain and umap have stochastic elements all of which set a default seed in the function call. I have never observed issues with reproducibility there. Can we narrow it down to score_genes? Or do you think it's somewhere else? |
I think this can be closed as @coh-racng found a solution. |
Yes, but I think that each tool should have the possibility of setting a seed without needing to set it externally. So, I'd like to know why @coh-racng had to do this. |
Thanks for the help. I have traced the randomness to sc.tl.score_genes_cell_cycle. Even if I explicitly state the random state to be 0, two repeats of score_genes_cell_cycle from the same data would give different downstream clustering. On the other hand, if I use one score_genes_cell_cycle output and repeat the downstream clustering methods, the clusters are the same. |
Was there any progress on this? I also have an issue of non-reproducible UMAPs when re-starting Jupyter notebook. I have not been able to trace the exact cause of the issue, but it appears unrelated to the PCA step (I am using |
I think I encountered this or a very similar problem and I have a fix. In my analysis I used the I analysed the functions step by step and found that small differences were coming from the I suspect this had something to do with machine precision and maybe choosing between float32 and float64 types. I managed to fix this by setting the dtype to float64 in each statement. This produced identical gene/cell cycle scores on each machine. The fix (in the
And to keep the data types consistent convert back to float32 at the end: I hope this will help someone. I am happy to open a pull request if you think this is worthwhile. |
Pinging this, as I've encountered it as well. I ran into non-reproducible UMAPs when rerunning code/notebooks and systematically went through my pipeline to find the source(s) of error, one of which was Setting the random seed externally did not help, but @Iwo-K's comment got me on the right track. I am now using the following simple hack, which fixes the issue for me: adata.X = adata.X.astype('<f8') # Make float64 to ensure stability
sc.tl.score_genes_cell_cycle(adata, use_raw=False,
s_genes=cc_s_genes, g2m_genes=cc_g2m_genes,
random_state=0)
adata.X = adata.X.astype('<f4') # Return to float32 for consistency Would be great if this would be fixed internally, perhaps using @Iwo-K's solution? |
Sorry about the late response on this! @Iwo-K, thanks for looking into this. That fix works for me, and a PR would be welcome. I think it would be fine to always return the 64 bit values though. As for why this occurs, I think this thread may have some relevant info: numpy/numpy#624 (comment) |
In my preprocessing pipeline, I filtered for the most variable genes and also regress out n_counts and cell cycle. However, it seems like every time I restart my Jupyter notebook and rerun the preprocessing pipeline, I get a different neighborhood graph and UMAP, and therefore different clustering. If I use the same preprocessed data, I can reproduce the same PCA (using svd_solver='arpack'), UMAP, and clusters. So I was wondering if there is randomness introduced during filtering process or regress_out methods? What should I do to ensure reproducibility?
The text was updated successfully, but these errors were encountered: