Inconsistent results from arpack pca implementation using VMs with different numbers of cpus

I am finding that my analysis is not perfectly reproducible across different computational platforms. I thought I was going crazy but I have since reproduced this finding using the minimal 3000 PBMC dataset clustering example. Essentially I run the same code either on a virtual machine with 8 CPUs or one with 16 CPUs and I get non-identical PCA results. It doesn't seem to matter if I use the arpack or the randomized solver even though using the randomized solver gives the warning:

`Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.`

I'd like to just attach the jupyter notebook but it won't seem to let me do that so I'm copying the code below.
...


```python
# First run on a machine with 8 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8_randomized.h5ad', adata)

# Then run on a machine with 16 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16_randomized.h5ad', adata)

# Running on a machine with 16 CPUs, evaluate the differences between the results first from the arpack solver
adata8 = sc.read('test8.h5ad')
adata16 = sc.read('test16.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

# Running on a machine with 16 CPUs, evaluate the differences between the results from the randomized solver
adata8 = sc.read('test8_randomized.h5ad')
adata16 = sc.read('test16_randomized.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

```
This outputs the following

```
0
134513
37696
0    659
1    605
2    398
3    352
4    342
5    174
6    118
7     41
8     11
Name: leiden, dtype: int64
0    527
1    484
2    398
3    324
4    320
5    301
6    174
7    109
8     52
9     11
Name: leiden, dtype: int64

0
134127
37278
0    646
1    617
2    382
3    362
4    334
5    173
6    129
7     46
8     11
Name: leiden, dtype: int64
0    646
1    631
2    408
3    349
4    334
5    170
6    106
7     45
8     11
Name: leiden, dtype: int64

```




```pytb
...
```

#### Versions:

scanpy==1.4.4.post1 anndata==0.7.1 umap==0.3.10 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.0 louvain==0.6.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent results from arpack pca implementation using VMs with different numbers of cpus #1187

Versions:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent results from arpack pca implementation using VMs with different numbers of cpus #1187

Description

Versions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions