Download dataset files atomically to avoid corrupt cache (#4097)#4142
Open
gaoflow wants to merge 2 commits into
Open
Download dataset files atomically to avoid corrupt cache (#4097)#4142gaoflow wants to merge 2 commits into
gaoflow wants to merge 2 commits into
Conversation
`_download` wrote directly to the destination cache path, so a concurrent reader (e.g. a parallel pytest worker sharing `settings.datasetdir`) could find the file present via `path.is_file()` and read it while it was still being written, getting a corrupted/partial file. Download to a temporary file in the same directory and atomically rename it into place once complete, so the destination only ever appears fully written. On failure the temporary file is removed and the destination is left untouched (it may belong to another process that finished first).
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project check has failed because the head coverage (0.00%) is below the target coverage (75.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #4142 +/- ##
==========================================
- Coverage 79.61% 0 -79.62%
==========================================
Files 120 0 -120
Lines 12786 0 -12786
==========================================
- Hits 10179 0 -10179
+ Misses 2607 0 -2607
Flags with carried forward coverage won't be shown. Click here to find out more. |
Member
|
Thanks! Two comments:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #4097.
Problem
_download(used by_check_datafile_present_and_downloadfor every cached dataset) writes directly to the destination cache path:As @flying-sheep diagnosed in #4097, this races under parallel execution (e.g.
pytest-xdistworkers sharingsettings.datasetdir):pathand writing into it;path.is_file(), sees the partially-written file, and reads it → corrupted file / error.Fix
Download to a
NamedTemporaryFilein the same directory and atomicallyPath.replace()it into place once the download is complete (the second option suggested in the issue). A rename on the same filesystem is atomic, sopathonly ever becomes visible fully written; concurrent downloads still work, the loser's temp file just replaces in turn.On failure the temporary file is removed, and
pathitself is left untouched — previously the error path unconditionallyunlink-edpath, which under the race could delete a sibling process's already-completed download.Verification
Added
test_download_atomicintests/test_datasets.py: it mocksurlopenand asserts that the destination path does not exist while bytes are still being streamed, that the final content is correct, and that no temporary file is left behind. The test fails on the current code (destination appeared before the download finished) and passes with this change. I also verified that both an immediate failure (e.g.HTTPError) and a mid-stream failure propagate the exception and leave neither a destination nor a leftover temporary file.