Skip to content

Inconsistent results from arpack pca implementation using VMs with different numbers of cpus #1187

@dylkot

Description

@dylkot

I am finding that my analysis is not perfectly reproducible across different computational platforms. I thought I was going crazy but I have since reproduced this finding using the minimal 3000 PBMC dataset clustering example. Essentially I run the same code either on a virtual machine with 8 CPUs or one with 16 CPUs and I get non-identical PCA results. It doesn't seem to matter if I use the arpack or the randomized solver even though using the randomized solver gives the warning:

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose svd_solver='arpack'. This will likely become the Scanpy default in the future.

I'd like to just attach the jupyter notebook but it won't seem to let me do that so I'm copying the code below.
...

# First run on a machine with 8 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8_randomized.h5ad', adata)

# Then run on a machine with 16 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16_randomized.h5ad', adata)

# Running on a machine with 16 CPUs, evaluate the differences between the results first from the arpack solver
adata8 = sc.read('test8.h5ad')
adata16 = sc.read('test16.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

# Running on a machine with 16 CPUs, evaluate the differences between the results from the randomized solver
adata8 = sc.read('test8_randomized.h5ad')
adata16 = sc.read('test16_randomized.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

This outputs the following

0
134513
37696
0    659
1    605
2    398
3    352
4    342
5    174
6    118
7     41
8     11
Name: leiden, dtype: int64
0    527
1    484
2    398
3    324
4    320
5    301
6    174
7    109
8     52
9     11
Name: leiden, dtype: int64

0
134127
37278
0    646
1    617
2    382
3    362
4    334
5    173
6    129
7     46
8     11
Name: leiden, dtype: int64
0    646
1    631
2    408
3    349
4    334
5    170
6    106
7     45
8     11
Name: leiden, dtype: int64

...

Versions:

scanpy==1.4.4.post1 anndata==0.7.1 umap==0.3.10 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.0 louvain==0.6.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions