Skip to content

Download dataset files atomically to avoid corrupt cache (#4097)#4142

Open
gaoflow wants to merge 2 commits into
scverse:mainfrom
gaoflow:fix-4097-atomic-download
Open

Download dataset files atomically to avoid corrupt cache (#4097)#4142
gaoflow wants to merge 2 commits into
scverse:mainfrom
gaoflow:fix-4097-atomic-download

Conversation

@gaoflow
Copy link
Copy Markdown

@gaoflow gaoflow commented Jun 1, 2026

Fixes #4097.

Problem

_download (used by _check_datafile_present_and_download for every cached dataset) writes directly to the destination cache path:

with path.open("wb") as f:
    ...

As @flying-sheep diagnosed in #4097, this races under parallel execution (e.g. pytest-xdist workers sharing settings.datasetdir):

  1. worker 1 doesn't see the cache file and starts downloading, creating path and writing into it;
  2. worker 2 checks path.is_file(), sees the partially-written file, and reads it → corrupted file / error.

Fix

Download to a NamedTemporaryFile in the same directory and atomically Path.replace() it into place once the download is complete (the second option suggested in the issue). A rename on the same filesystem is atomic, so path only ever becomes visible fully written; concurrent downloads still work, the loser's temp file just replaces in turn.

On failure the temporary file is removed, and path itself is left untouched — previously the error path unconditionally unlink-ed path, which under the race could delete a sibling process's already-completed download.

Verification

Added test_download_atomic in tests/test_datasets.py: it mocks urlopen and asserts that the destination path does not exist while bytes are still being streamed, that the final content is correct, and that no temporary file is left behind. The test fails on the current code (destination appeared before the download finished) and passes with this change. I also verified that both an immediate failure (e.g. HTTPError) and a mid-stream failure propagate the exception and leave neither a destination nor a leftover temporary file.

gaoflow added 2 commits June 2, 2026 00:17
`_download` wrote directly to the destination cache path, so a concurrent
reader (e.g. a parallel pytest worker sharing `settings.datasetdir`) could
find the file present via `path.is_file()` and read it while it was still
being written, getting a corrupted/partial file.

Download to a temporary file in the same directory and atomically rename it
into place once complete, so the destination only ever appears fully written.
On failure the temporary file is removed and the destination is left
untouched (it may belong to another process that finished first).
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (4ba31e4) to head (53dc24b).
✅ All tests successful. No failed tests found.

❌ Your project check has failed because the head coverage (0.00%) is below the target coverage (75.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #4142       +/-   ##
==========================================
- Coverage   79.61%       0   -79.62%     
==========================================
  Files         120       0      -120     
  Lines       12786       0    -12786     
==========================================
- Hits        10179       0    -10179     
+ Misses       2607       0     -2607     
Flag Coverage Δ
hatch-test.low-vers ?
hatch-test.pre ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 120 files with indirect coverage changes

@Zethson
Copy link
Copy Markdown
Member

Zethson commented Jun 5, 2026

Thanks!

Two comments:

  1. Could you please ensure that pre-commit, the docs build, and the CI passes? The RTD build failed with

/home/docs/checkouts/readthedocs.org/user_builds/icb-scanpy/checkouts/4142/docs/release-notes/1.13.0.dev76+g53dc24bf0.md:36: WARNING: py:mod reference target not found: pytest-xdist [ref.mod]

  1. We're kind of moving to pooch instead of making custom requests. Would using pooch be simpler here and also solve this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

race condition breaks parallel run of sc.datasets functions (e.g. in tests)

2 participants