cloud_storage: protect atime writes with a mutex #16648

nvartolomei · 2024-02-20T17:02:19Z

Fixes a potential cloud storage cache access time tracker file corruption which can happen during redpanda shutdown.

This corruption is negligible as the file doesn't store any important data. We ignore parsing failures during startup. Only the file header contains variable sized fields and corruption in this section of the file is impossible. The remaining contents of the file are read using fixed size buffers. At most we would have read a bogus file hash or a bogus timestamp. None of which could do harm.

cache::save_access_time_tracker can run concurrently during redpanda shutdown sequence.

cache::start starts a timer which calls
cache::maybe_save_access_time_tracker, then cache::stop call cache::save_access_time_tracker unconditionally.

It is possible that stop invokes cache::save_access_time_tracker while this method is running in a fiber started by the timer. If that's the case, in save_access_time_tracker method we didn't have any mutual exclusion mechanism when writing to the temporary file and the subsequent rename.

During a particular ordering of events, the following error gets logged:

ERROR 2024-02-20 13:47:13,188 [shard 0:main] cloud_storage -
cache_service.cc:893 - failed to save access time tracker during auto
cloud_storage::cache::stop()::(anonymous class)::operator()(auto) const
[eptr:auto = std::exception_ptr]:
std::__1::__fs::filesystem::filesystem_error (error system:2, filesystem
error: rename failed: No such file or directory
["/var/lib/redpanda/data/cloud_storage_cache/accesstime.tmp"]
["/var/lib/redpanda/data/cloud_storage_cache/accesstime"])

The best explanation is the following ordering of events:

1: f1, f2: calls save_access_time_tracker
2: f1, f2: opens accesstime.tmp
3: f1, f2: calls _save_access_time_tracker
4: f1, f2: co_await ss::rename_file(tmp_path.string(),
final_path.string())

In the step 4, only 1 fiber can succeed and the other will fail with the exception.

This is a benign race-condition. However, another less pleasant race-condition is also possible:

1: f1, f2: calls save_access_time_tracker
2: f1, f2: opens accesstime.tmp
3: f1, f2: calls _save_access_time_tracker
4: f1, f2: auto out = co_await ss::make_file_output_stream(std::move(f))
note: make_file_out_stream creates a buffered stream with 8K buffer
5: f1: co_await _access_time_tracker.write(out) note: write does acquire
a exclusive lock so nothing to worry about there
6: f2: co_await _access_time_tracker.write(out);
7: f2: co_await out.flush();
8: f1: co_await out.flush();

Notice that f1 wrote first to the file, then f2 did and flushed the in-memory buffers, then (!) f1 flushed its in-memory buffers.

It is possible that the flush invoked by f1 will overwrite up to 8K bytes at an arbitrary position in the file initially written by f2. Thus, f2 will become corrupted.

Backports Required

Release Notes

Bug Fixes

Fix a potential cloud storage cache access time tracker file corruption during shutdown.

Fixes a potential cloud storage cache access time tracker file corruption which can happen during redpanda shutdown. This corruption is negligible as the file doesn't store any important data. We ignore parsing failures during startup. Only the file header contains variable sized fields and corruption in this section of the file is impossible. The remaining contents of the file are read using fixed size buffers. At most we would have read a bogus file hash or a bogus timestamp. None of which could do harm. `cache::save_access_time_tracker` can run concurrently during redpanda shutdown sequence. `cache::start` starts a timer which calls `cache::maybe_save_access_time_tracker`, then `cache::stop` call `cache::save_access_time_tracker` unconditionally. It is possible that stop invokes `cache::save_access_time_tracker` while this method is running in a fiber started by the timer. If that's the case, in `save_access_time_tracker` method we didn't have any mutual exclusion mechanism when writing to the temporary file and the subsequent rename. During a particular ordering of events, the following error gets logged: ``` ERROR 2024-02-20 13:47:13,188 [shard 0:main] cloud_storage - cache_service.cc:893 - failed to save access time tracker during auto cloud_storage::cache::stop()::(anonymous class)::operator()(auto) const [eptr:auto = std::exception_ptr]: std::__1::__fs::filesystem::filesystem_error (error system:2, filesystem error: rename failed: No such file or directory ["/var/lib/redpanda/data/cloud_storage_cache/accesstime.tmp"] ["/var/lib/redpanda/data/cloud_storage_cache/accesstime"]) ``` The best explanation is the following ordering of events: 1: f1, f2: calls save_access_time_tracker 2: f1, f2: opens accesstime.tmp 3: f1, f2: calls _save_access_time_tracker 4: f1, f2: co_await ss::rename_file(tmp_path.string(), final_path.string()) In the step 4, only 1 fiber can succeed and the other will fail with the exception. This is a benign race-condition. However, another less pleasant race-condition is also possible: 1: f1, f2: calls save_access_time_tracker 2: f1, f2: opens accesstime.tmp 3: f1, f2: calls _save_access_time_tracker 4: f1, f2: auto out = co_await ss::make_file_output_stream(std::move(f)) note: make_file_out_stream creates a buffered stream with 8K buffer 5: f1: co_await _access_time_tracker.write(out) note: write does acquire a exclusive lock so nothing to worry about there 6: f2: co_await _access_time_tracker.write(out); 7: f2: co_await out.flush(); 8: f1: co_await out.flush(); Notice that f1 wrote first to the file, then f2 did and flushed the in-memory buffers, then (!) f1 flushed its in-memory buffers. It is possible that the flush invoked by f1 will overwrite up to 8K bytes at an arbitrary position in the file initially written by f2. Thus, f2 will become corrupted.

Lazin

LGTM

nvartolomei requested a review from Lazin February 20, 2024 17:02

github-actions bot added the area/redpanda label Feb 20, 2024

Lazin approved these changes Feb 20, 2024

View reviewed changes

nvartolomei merged commit c03827e into redpanda-data:dev Feb 20, 2024
17 checks passed

nvartolomei deleted the nv/cloud-cache-atime-mutex branch February 20, 2024 20:52

renovate bot mentioned this pull request May 4, 2024

feat(github-release)!: Update redpanda-operator to v24.1.6 otosky/home-ops#1232

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: protect atime writes with a mutex #16648

cloud_storage: protect atime writes with a mutex #16648

nvartolomei commented Feb 20, 2024

Lazin left a comment

cloud_storage: protect atime writes with a mutex #16648

cloud_storage: protect atime writes with a mutex #16648

Conversation

nvartolomei commented Feb 20, 2024

Backports Required

Release Notes

Bug Fixes

Lazin left a comment

Choose a reason for hiding this comment