Proposal to change the semantics of the last stable offset to always mean committed offset rather than "raft last visible index" #18244

nvartolomei · 2024-05-03T13:11:14Z

In the write caching design we made the assumption that last stable offset is always equal-or-greater than the committed offset because when transactions are in use, we always wait for the commands to be applied to the STMs, and commands below committed offset are never applied.

The archival/tiered storage correctness assumption builds on that. Because we upper-bound the tiered storage data to the last stable offset, it should never see suffix truncations caused by data loss. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior.

redpanda/src/v/cluster/partition.cc

Lines 1248 to 1254 in 2823c09

    
           model::offset partition::last_stable_offset() const { 
        
               if (_rm_stm) { 
        
                   return _rm_stm->last_stable_offset(); 
        
               } 
        
               return high_watermark(); 
        
           }

Now, after unrelated auditing of the code base, I have discovered that we didn't account for what happens when transactions are disabled, i.e. rm_stm is not present on a partition. Last Stable Offset is defined to be equivalent to the High Watermark.

That might cause problems when write caching is in use. One possible fix is to bound the archival to the minimum between last stable offset and raft committed offset. All other potential users of LSO should also keep this in mind.

What if we instead redefine what LSO means when write caching is in use regardless of whether rm stm exists? We can make it always return the least between the traditional LSO (below undecided transactions) and raft committed offset.

This would keep the partition interface lean without the need of introducing another offset at this layer. It might come as a bit of a surprise that LSO doesn't match HWM when no transactions are in use. But it is already like this today when write caching is in use. Or when rm_stm lags behind applying commands (ref

redpanda/src/v/cluster/rm_stm.cc

Line 1293 in 88ac775

auto last_applied = last_applied_offset();

).

JIRA Link: CORE-2766

The text was updated successfully, but these errors were encountered:

nvartolomei · 2024-05-03T13:11:25Z

cc @bharathv

bharathv · 2024-05-03T23:01:07Z

when you mean write caching, I assume you are using it as an umbrella term for acks=0/1 and actual write caching with acks=all?

That might cause problems when write caching is in use. One possible fix is to bound the archival to the minimum between last stable offset and raft committed offset. All other potential users of LSO should also keep this in mind.

TIL, that seems sketchy.. I don't fully understand that code but what happens when there is a truncation today during write caching? I think always using raft committed offset (like how stms do) seems less controversial to me. Even with rm_stm, it still returns high watermark equivalent if there is no acks=all data, that doesn't pose any problems?

What if we instead redefine what LSO means when write caching is in use regardless of whether rm stm exists? We can make it always return the least between the traditional LSO (below undecided transactions) and raft committed offset.

I think that slows down consumers of acks=0/1 data. They have to wait for data to be raft committed until they consume, I remember changing it at some point and a bunch of acks=0/1 tests failed.

nvartolomei · 2024-05-07T12:01:51Z

@Lazin, please read this thread. tl;dr; I believe we need to define a "safe_to_upload" offset method in cloud storage which is max between LSO and committed offset. Nothing guarantees that LSO >= raft committed offset but it seems that cloud storage correctness assumes that (wrongly).

The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322

Lazin · 2024-05-07T19:28:32Z

the assumption was correct when it was introduced, because of that it might not be the only place where we make it

The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322

The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322 (cherry picked from commit ff358cc)

The archival/tiered storage correctness assumption builds on the (wrong) assumption that LSO is monotonic. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior. However, we have discovered that property does not hold if there are no in-progress transaction and acks=0/1 or write caching is in use because LSO falls back to "last visible index"[^1] which can get truncated. Ref redpanda-data#18244 [^1]: https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1322

nvartolomei added kind/enhance New feature or request area/kafka area/replication labels May 3, 2024

nvartolomei added the area/cloud-storage Shadow indexing subsystem label May 7, 2024

nvartolomei mentioned this issue May 7, 2024

archival: clamp uploads to committed offset #18285

Merged

7 tasks

nvartolomei closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to change the semantics of the last stable offset to always mean committed offset rather than "raft last visible index" #18244

Proposal to change the semantics of the last stable offset to always mean committed offset rather than "raft last visible index" #18244

nvartolomei commented May 3, 2024 •

edited by github-actions bot

Loading

nvartolomei commented May 3, 2024

bharathv commented May 3, 2024

nvartolomei commented May 7, 2024 •

edited

Loading

Lazin commented May 7, 2024

Proposal to change the semantics of the last stable offset to always mean committed offset rather than "raft last visible index" #18244

Proposal to change the semantics of the last stable offset to always mean committed offset rather than "raft last visible index" #18244

Comments

nvartolomei commented May 3, 2024 • edited by github-actions bot Loading

nvartolomei commented May 3, 2024

bharathv commented May 3, 2024

nvartolomei commented May 7, 2024 • edited Loading

Lazin commented May 7, 2024

nvartolomei commented May 3, 2024 •

edited by github-actions bot

Loading

nvartolomei commented May 7, 2024 •

edited

Loading