Skip to content

race condition breaks parallel run of sc.datasets functions (e.g. in tests) #4097

@flying-sheep

Description

@flying-sheep

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

_check_datafile_present_and_download has a race condition:

e.g. pbmc3k_processed uses the pattern “check if a cached version exists in sc.settings.datasetdir. If not, download it to that location. If yes, read it”

errors like this means we have a corrupted file. I’m pretty sure that’s because of multiprocess pytest:

  • pytest-xdist host process will figure out the list of tests and then distribute them between workers
  • pytest-xdist worker 1 starts evaluating the fixture first, doesn’t see the cache file, and starts downloading the file
  • pytest-xdist worker 2 starts evaluating the fixture too (even a session-scoped fixure is evaluated once per process), finds the partially-written file (without knowing it’s not done downloading), and tries to read it → error

so if I’m right, we could fix that on the scanpy side by either

  • somehow setting a lock while writing, either manually (wait for lock instead of just doing path.is_file()) or by setting locking=True (see here). that’d mean that other processes wait until the one that first got there is done downloading.
  • downloading to a temporary location and renaming the downloaded file to the cache location. this would cause parallel downloads but would also work since file renaming is an atomic operation

cc @VladimirShitov

Minimal code sample

# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "scanpy@git+https://github.com/scverse/scanpy.git@main",
# ]
# ///
#
# This script automatically imports the development branch of scanpy to check for issues

import scanpy as sc
# TODO: it’s late OK

Error output

0
_ ERROR at setup of TestCheckAdataLoaded.test_adata_loaded_true_after_prepare_anndata_wasserstein_tsne _
[gw1] linux -- Python 3.14.4 /home/runner/.local/share/hatch/env/virtual/patpy/NbKEWlp-/hatch-test.py3.14-pre/bin/python3

    @pytest.fixture(scope="session")
    def pbmc3k_adata():
        """Preprocessed PBMC3k dataset with randomly assigned sample labels.
    
        Provides real single-cell data with X_pca embedding and louvain cell-type
        annotations, suitable for methods that require biological structure in the
        data (e.g. DiffusionEMD, GloScope, WassersteinTSNE, PILOT, MOFA).
        """
        import scanpy as sc
    
>       adata = sc.datasets.pbmc3k_processed()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/conftest.py:89: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../.local/share/hatch/env/virtual/patpy/NbKEWlp-/hatch-test.py3.14-pre/lib/python3.14/site-packages/scanpy/datasets/_utils.py:16: in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
../../../.local/share/hatch/env/virtual/patpy/NbKEWlp-/hatch-test.py3.14-pre/lib/python3.14/site-packages/scanpy/datasets/_datasets.py:452: in pbmc3k_processed
    return read(settings.datasetdir / "pbmc3k_processed.h5ad", backup_url=url)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../../.local/share/hatch/env/virtual/patpy/NbKEWlp-/hatch-test.py3.14-pre/lib/python3.14/site-packages/scanpy/readwrite.py:150: in read
    return _read(
../../../.local/share/hatch/env/virtual/patpy/NbKEWlp-/hatch-test.py3.14-pre/lib/python3.14/site-packages/scanpy/readwrite.py:838: in _read
    return read_h5ad(filename, backed=backed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../../.local/share/hatch/env/virtual/patpy/NbKEWlp-/hatch-test.py3.14-pre/lib/python3.14/site-packages/anndata/_io/h5ad.py:263: in read_h5ad
    with h5py.File(filename, "r") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^
../../../.local/share/hatch/env/virtual/patpy/NbKEWlp-/hatch-test.py3.14-pre/lib/python3.14/site-packages/h5py/_hl/files.py:555: in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../../.local/share/hatch/env/virtual/patpy/NbKEWlp-/hatch-test.py3.14-pre/lib/python3.14/site-packages/h5py/_hl/files.py:232: in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   OSError: Unable to synchronously open file (truncated file: eof = 3014656, sblock->base_addr = 0, stored_eof = 24653425)

h5py/h5f.pyx:106: OSError

Versions

Details
scanpy  1.13.0.dev61+g395006786
----    ----
pytz    2026.1.post1
zarr    3.1.6
typing_extensions       4.15.0
google-crc32c   1.8.0
kiwisolver      1.5.0
donfig  0.8.1.post1
MarkupSafe      3.0.3
PyYAML  6.0.3
threadpoolctl   3.6.0
packaging       26.1
Jinja2  3.1.6
pydantic-settings       2.14.0
dask    2026.3.0
pillow  12.2.0
coverage        7.13.5
python-dotenv   1.2.2
numpy   2.4.4
numcodecs       0.16.5
pydantic        2.13.3
scverse-misc    0.0.5
natsort 8.4.0
python-dateutil 2.9.0.post0
tblib   3.2.2
sparse  0.18.0
pyarrow 23.0.1
llvmlite        0.47.0
scipy   1.17.1
typing-inspection       0.4.2
six     1.17.0
legacy-api-wrap 1.5
anndata 0.12.10
toolz   1.1.0
pandas  2.3.3
session-info2   0.4.1
annotated-types 0.7.0
cloudpickle     3.1.2
matplotlib      3.10.8
scikit-learn    1.8.0
cycler  0.12.1
joblib  1.5.3
fsspec  2026.3.0
pydantic_core   2.46.3
fast-array-utils        1.4.1
msgpack 1.1.2
psutil  7.2.2
pyparsing       3.3.2
numba   0.65.0
h5py    3.16.0
----    ----
Python  3.14.4 (main, Apr  8 2026, 17:48:49) [GCC 15.2.1 20260209]
OS      Linux-6.19.13-arch1-1-x86_64-with-glibc2.43
CPU     16/16 logical CPU cores
Updated 2026-04-28 18:19

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions