Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to change the semantics of the last stable offset to always mean committed offset rather than "raft last visible index" #18244

Closed
nvartolomei opened this issue May 3, 2024 · 4 comments
Labels
area/cloud-storage Shadow indexing subsystem area/kafka area/replication kind/enhance New feature or request

Comments

@nvartolomei
Copy link
Contributor

nvartolomei commented May 3, 2024

In the write caching design we made the assumption that last stable offset is always equal-or-greater than the committed offset because when transactions are in use, we always wait for the commands to be applied to the STMs, and commands below committed offset are never applied.

The archival/tiered storage correctness assumption builds on that. Because we upper-bound the tiered storage data to the last stable offset, it should never see suffix truncations caused by data loss. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior.

model::offset partition::last_stable_offset() const {
if (_rm_stm) {
return _rm_stm->last_stable_offset();
}
return high_watermark();
}

Now, after unrelated auditing of the code base, I have discovered that we didn't account for what happens when transactions are disabled, i.e. rm_stm is not present on a partition. Last Stable Offset is defined to be equivalent to the High Watermark.

That might cause problems when write caching is in use. One possible fix is to bound the archival to the minimum between last stable offset and raft committed offset. All other potential users of LSO should also keep this in mind.

What if we instead redefine what LSO means when write caching is in use regardless of whether rm stm exists? We can make it always return the least between the traditional LSO (below undecided transactions) and raft committed offset.

This would keep the partition interface lean without the need of introducing another offset at this layer. It might come as a bit of a surprise that LSO doesn't match HWM when no transactions are in use. But it is already like this today when write caching is in use. Or when rm_stm lags behind applying commands (ref

auto last_applied = last_applied_offset();
).

JIRA Link: CORE-2766

@nvartolomei
Copy link
Contributor Author

cc @bharathv

@bharathv
Copy link
Contributor

bharathv commented May 3, 2024

when you mean write caching, I assume you are using it as an umbrella term for acks=0/1 and actual write caching with acks=all?

That might cause problems when write caching is in use. One possible fix is to bound the archival to the minimum between last stable offset and raft committed offset. All other potential users of LSO should also keep this in mind.

TIL, that seems sketchy.. I don't fully understand that code but what happens when there is a truncation today during write caching? I think always using raft committed offset (like how stms do) seems less controversial to me. Even with rm_stm, it still returns high watermark equivalent if there is no acks=all data, that doesn't pose any problems?

What if we instead redefine what LSO means when write caching is in use regardless of whether rm stm exists? We can make it always return the least between the traditional LSO (below undecided transactions) and raft committed offset.

I think that slows down consumers of acks=0/1 data. They have to wait for data to be raft committed until they consume, I remember changing it at some point and a bunch of acks=0/1 tests failed.

@nvartolomei
Copy link
Contributor Author

nvartolomei commented May 7, 2024

@Lazin, please read this thread. tl;dr; I believe we need to define a "safe_to_upload" offset method in cloud storage which is max between LSO and committed offset. Nothing guarantees that LSO >= raft committed offset but it seems that cloud storage correctness assumes that (wrongly).

@nvartolomei nvartolomei added the area/cloud-storage Shadow indexing subsystem label May 7, 2024
nvartolomei added a commit to nvartolomei/redpanda that referenced this issue May 7, 2024
The archival/tiered storage correctness assumption builds on the
(wrong) assumption that LSO is monotonic. Tiered storage doesn't have a
concept of suffix truncation so if that would happen it would lead
violations of correctness properties and diverging logs/undefined
behavior.

However, we have discovered that property does not hold if there are no
in-progress transaction and acks=0/1 or write caching is in use because
LSO falls back to "last visible index"[^1] which can get truncated.

Ref redpanda-data#18244

[^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
nvartolomei added a commit to nvartolomei/redpanda that referenced this issue May 7, 2024
The archival/tiered storage correctness assumption builds on the
(wrong) assumption that LSO is monotonic. Tiered storage doesn't have a
concept of suffix truncation so if that would happen it would lead
violations of correctness properties and diverging logs/undefined
behavior.

However, we have discovered that property does not hold if there are no
in-progress transaction and acks=0/1 or write caching is in use because
LSO falls back to "last visible index"[^1] which can get truncated.

Ref redpanda-data#18244

[^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
@Lazin
Copy link
Contributor

Lazin commented May 7, 2024

the assumption was correct when it was introduced, because of that it might not be the only place where we make it

nvartolomei added a commit to nvartolomei/redpanda that referenced this issue May 8, 2024
The archival/tiered storage correctness assumption builds on the
(wrong) assumption that LSO is monotonic. Tiered storage doesn't have a
concept of suffix truncation so if that would happen it would lead
violations of correctness properties and diverging logs/undefined
behavior.

However, we have discovered that property does not hold if there are no
in-progress transaction and acks=0/1 or write caching is in use because
LSO falls back to "last visible index"[^1] which can get truncated.

Ref redpanda-data#18244

[^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 10, 2024
The archival/tiered storage correctness assumption builds on the
(wrong) assumption that LSO is monotonic. Tiered storage doesn't have a
concept of suffix truncation so if that would happen it would lead
violations of correctness properties and diverging logs/undefined
behavior.

However, we have discovered that property does not hold if there are no
in-progress transaction and acks=0/1 or write caching is in use because
LSO falls back to "last visible index"[^1] which can get truncated.

Ref redpanda-data#18244

[^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322

(cherry picked from commit ff358cc)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 10, 2024
The archival/tiered storage correctness assumption builds on the
(wrong) assumption that LSO is monotonic. Tiered storage doesn't have a
concept of suffix truncation so if that would happen it would lead
violations of correctness properties and diverging logs/undefined
behavior.

However, we have discovered that property does not hold if there are no
in-progress transaction and acks=0/1 or write caching is in use because
LSO falls back to "last visible index"[^1] which can get truncated.

Ref redpanda-data#18244

[^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322

(cherry picked from commit ff358cc)
Lazin pushed a commit to Lazin/redpanda that referenced this issue Jun 1, 2024
The archival/tiered storage correctness assumption builds on the
(wrong) assumption that LSO is monotonic. Tiered storage doesn't have a
concept of suffix truncation so if that would happen it would lead
violations of correctness properties and diverging logs/undefined
behavior.

However, we have discovered that property does not hold if there are no
in-progress transaction and acks=0/1 or write caching is in use because
LSO falls back to "last visible index"[^1] which can get truncated.

Ref redpanda-data#18244

[^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
Lazin pushed a commit to Lazin/redpanda that referenced this issue Jun 12, 2024
The archival/tiered storage correctness assumption builds on the
(wrong) assumption that LSO is monotonic. Tiered storage doesn't have a
concept of suffix truncation so if that would happen it would lead
violations of correctness properties and diverging logs/undefined
behavior.

However, we have discovered that property does not hold if there are no
in-progress transaction and acks=0/1 or write caching is in use because
LSO falls back to "last visible index"[^1] which can get truncated.

Ref redpanda-data#18244

[^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/kafka area/replication kind/enhance New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants