Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: protect atime writes with a mutex #16648

Merged

Conversation

nvartolomei
Copy link
Contributor

Fixes a potential cloud storage cache access time tracker file corruption which can happen during redpanda shutdown.

This corruption is negligible as the file doesn't store any important data. We ignore parsing failures during startup. Only the file header contains variable sized fields and corruption in this section of the file is impossible. The remaining contents of the file are read using fixed size buffers. At most we would have read a bogus file hash or a bogus timestamp. None of which could do harm.

cache::save_access_time_tracker can run concurrently during redpanda shutdown sequence.

cache::start starts a timer which calls
cache::maybe_save_access_time_tracker, then cache::stop call cache::save_access_time_tracker unconditionally.

It is possible that stop invokes cache::save_access_time_tracker while this method is running in a fiber started by the timer. If that's the case, in save_access_time_tracker method we didn't have any mutual exclusion mechanism when writing to the temporary file and the subsequent rename.

During a particular ordering of events, the following error gets logged:

ERROR 2024-02-20 13:47:13,188 [shard 0:main] cloud_storage -
cache_service.cc:893 - failed to save access time tracker during auto
cloud_storage::cache::stop()::(anonymous class)::operator()(auto) const
[eptr:auto = std::exception_ptr]:
std::__1::__fs::filesystem::filesystem_error (error system:2, filesystem
error: rename failed: No such file or directory
["/var/lib/redpanda/data/cloud_storage_cache/accesstime.tmp"]
["/var/lib/redpanda/data/cloud_storage_cache/accesstime"])

The best explanation is the following ordering of events:

1: f1, f2: calls save_access_time_tracker
2: f1, f2: opens accesstime.tmp
3: f1, f2: calls _save_access_time_tracker
4: f1, f2: co_await ss::rename_file(tmp_path.string(),
final_path.string())

In the step 4, only 1 fiber can succeed and the other will fail with the exception.

This is a benign race-condition. However, another less pleasant race-condition is also possible:

1: f1, f2: calls save_access_time_tracker
2: f1, f2: opens accesstime.tmp
3: f1, f2: calls _save_access_time_tracker
4: f1, f2: auto out = co_await ss::make_file_output_stream(std::move(f))
note: make_file_out_stream creates a buffered stream with 8K buffer
5: f1: co_await _access_time_tracker.write(out) note: write does acquire
a exclusive lock so nothing to worry about there
6: f2: co_await _access_time_tracker.write(out);
7: f2: co_await out.flush();
8: f1: co_await out.flush();

Notice that f1 wrote first to the file, then f2 did and flushed the in-memory buffers, then (!) f1 flushed its in-memory buffers.

It is possible that the flush invoked by f1 will overwrite up to 8K bytes at an arbitrary position in the file initially written by f2. Thus, f2 will become corrupted.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

Bug Fixes

  • Fix a potential cloud storage cache access time tracker file corruption during shutdown.

Fixes a potential cloud storage cache access time tracker file
corruption which can happen during redpanda shutdown.

This corruption is negligible as the file doesn't store any important
data. We ignore parsing failures during startup. Only the file header
contains variable sized fields and corruption in this section of the
file is impossible.  The remaining contents of the file are read using
fixed size buffers. At most we would have read a bogus file hash or a
bogus timestamp. None of which could do harm.

`cache::save_access_time_tracker` can run concurrently during redpanda
shutdown sequence.

`cache::start` starts a timer which calls
`cache::maybe_save_access_time_tracker`, then `cache::stop` call
`cache::save_access_time_tracker` unconditionally.

It is possible that stop invokes `cache::save_access_time_tracker` while
this method is running in a fiber started by the timer. If that's the
case, in `save_access_time_tracker` method we didn't have any mutual
exclusion mechanism when writing to the temporary file and the
subsequent rename.

During a particular ordering of events, the following error gets logged:

```
ERROR 2024-02-20 13:47:13,188 [shard 0:main] cloud_storage -
cache_service.cc:893 - failed to save access time tracker during auto
cloud_storage::cache::stop()::(anonymous class)::operator()(auto) const
[eptr:auto = std::exception_ptr]:
std::__1::__fs::filesystem::filesystem_error (error system:2, filesystem
error: rename failed: No such file or directory
["/var/lib/redpanda/data/cloud_storage_cache/accesstime.tmp"]
["/var/lib/redpanda/data/cloud_storage_cache/accesstime"])
```

The best explanation is the following ordering of events:

1: f1, f2: calls save_access_time_tracker
2: f1, f2: opens accesstime.tmp
3: f1, f2: calls _save_access_time_tracker
4: f1, f2: co_await ss::rename_file(tmp_path.string(),
   final_path.string())

In the step 4, only 1 fiber can succeed and the other will fail with the
exception.

This is a benign race-condition. However, another less pleasant
race-condition is also possible:

1: f1, f2: calls save_access_time_tracker
2: f1, f2: opens accesstime.tmp
3: f1, f2: calls _save_access_time_tracker
4: f1, f2: auto out = co_await ss::make_file_output_stream(std::move(f))
   note: make_file_out_stream creates a buffered stream with 8K buffer
5: f1: co_await _access_time_tracker.write(out) note: write does acquire
   a exclusive lock so nothing to worry about there
6: f2: co_await _access_time_tracker.write(out);
7: f2: co_await out.flush();
8: f1: co_await out.flush();

Notice that f1 wrote first to the file, then f2 did and flushed the
in-memory buffers, then (!) f1 flushed its in-memory buffers.

It is possible that the flush invoked by f1 will overwrite up to 8K
bytes at an arbitrary position in the file initially written by f2.
Thus, f2 will become corrupted.
Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvartolomei nvartolomei merged commit c03827e into redpanda-data:dev Feb 20, 2024
17 checks passed
@nvartolomei nvartolomei deleted the nv/cloud-cache-atime-mutex branch February 20, 2024 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants