Fix/inconsistent flushed offset #16105

Lazin · 2024-01-15T20:21:22Z

This PR improves logging to provide more information. It also fixes the problem in the consensus::flush_log method. The method is trying to detect the situation when the truncation happens concurrently. But the existing check fails in situation when there is not just concurrent truncation but truncation + log append.

Backports Required

Release Notes

Bug Fixes

Fix assertion triggered by interleaving of log flush and log truncation followed by append

vbotbuildovich · 2024-01-15T22:52:39Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43782#018d0f09-186b-49dd-af9b-32a869ef08be

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43782#018d0f09-1865-48d4-a7bd-8f1402ee09b0

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44347#018d45c8-e672-4a85-a4c6-9757e6b09250

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44347#018d564f-e7b5-49e2-b093-d15b559c27b1

emaxerrno · 2024-01-16T06:03:02Z

@Lazin - should we add this to the opfuzz class, that was the whole point of that class, to drive sequence of events in append/truncate/flush/etc - though i suppose maybe not concurrent? curious on your take.

Lazin · 2024-01-16T09:05:39Z

@Lazin - should we add this to the opfuzz class, that was the whole point of that class, to drive sequence of events in append/truncate/flush/etc - though i suppose maybe not concurrent? curious on your take.

The append/truncate operations are often happen on the follower when the follower receives the message with higher term and offset which is smaller than committed offset. The problem is that opfuzz runs storage code. And storage works correctly in this case. The problem is that the consensus assumes something about the storage which is not always the case. I think that without the assertion the code would work just fine.

andrwng · 2024-01-16T23:40:15Z

src/v/raft/consensus.cc

-    if (flushed_up_to > lstats.dirty_offset) {
+    if (
+      flushed_up_to > lstats.dirty_offset
+      || flushed_offset_term < lstats.dirty_offset_term) {


Is it still possible for the term to have stayed the same, but we still see the issue? I'm thinking something like this?

v committed v dirty ... term 8 offset 31] t9: [32, 35] ... term 8 offset 31] t9: [32, 35] flush called, flushed_up_to=35, flushed_offset_term=9 here we have a scheduling point ... term 8 offset 31] truncate after 31 v committed v dirty ... term 8 offset 31] t8: [32, 32] t9: [33, 36] append records from term 8 and term 9 v committed v dirty ... term 8 offset 31] t8: [32, 32] t9: [33, 36] flush continues, lstats.dirty_offset=36, lstats.dirty_offset_term=9 flushed_up_to(35) <= lstats.dirty_offset(36) flushed_offset_term(9) >= lstats.dirty_offset_term(9) ... so we make it past the early returns committed offset(31) < _flushed_offset(35) ... so we hit the assertion

I'm wondering what happens if we change the condition to if flushed_up_to > lstats.committed_offset: co_return flushed::yes (assuming the storage layer flush() works as expected)? Sure it's a naive approach to avoiding the assertion, but it's not obvious to me why it wouldn't Just Work

I don't think this is possible. Why would Raft truncate mid-term?

Thinking about this more, agreed this shouldn't be possible with Raft: when a record is replaced via truncation, it should only be replaced with records of a higher term.

truncation in the same term is still possible if the request is redelivered/reordered

src/v/raft/consensus.cc

add extra logging to flush and truncate methods

Method returns the counter which is guaranteed to be incremented after suffix truncation. It's not guaranteed to be incremented exactly once per logical truncation. It can only be used to detect that truncation actually happened.

Use get_log_truncation_counter method in read_write_truncate test.

Use get_truncation_counter method in write_truncate_compact method.

The 'flush_log' method checks the dirty offset before and after flush. If the truncation happened concurrently it detects this by comparing offsets (the offset after flush is smaller than the offset before the flush in this case). Then it uses assertion to check if the flushed offset is not smaller than the committed offset. This may not work as expected in case if there is a concurrent truncation and append. For instance, before the truncation the dirty offset is 32. The 'flush_log' method records flushed offset 32 and calls log->flush. Before 'flush_log' is resumed the log gets truncated and then new value gets added. So the committed offset of the log is now 31 but dirty offset is 32. Because of that the method won't be able to detect the concurrent flush operation will trigger an assertion. This commit fixes this problem by checking not only dirty offset but also a truncation counter. If the counter was incremented then the log was truncated and new data was added to the end of the log.

Lazin · 2024-01-26T11:07:53Z

Update: added log counter and used it to detect concurrent truncation.

Lazin · 2024-01-29T16:08:56Z

CI failures: #16308 and #15679

piyushredpanda · 2024-02-06T14:03:30Z

/ci-repeat

vbotbuildovich · 2024-02-06T17:39:34Z

/backport v23.3.x

Lazin requested review from andrwng and mmaslankaprv January 15, 2024 20:21

github-actions bot added the area/redpanda label Jan 15, 2024

andrwng reviewed Jan 17, 2024

View reviewed changes

Lazin requested a review from andrwng January 24, 2024 13:04

mmaslankaprv reviewed Jan 24, 2024

View reviewed changes

src/v/raft/consensus.cc Outdated Show resolved Hide resolved

Lazin added 5 commits January 26, 2024 04:29

storage: Add trace logging

1aef8ff

add extra logging to flush and truncate methods

storage: Add get_log_truncation_counter method

8d781ab

Method returns the counter which is guaranteed to be incremented after suffix truncation. It's not guaranteed to be incremented exactly once per logical truncation. It can only be used to detect that truncation actually happened.

storage: Update storage_e2e_test

6a86a83

Use get_log_truncation_counter method in read_write_truncate test.

storage: Update storage_e2e_test

af0e2c2

Use get_truncation_counter method in write_truncate_compact method.

Lazin force-pushed the fix/inconsistent-flushed-offset branch from 5ef73e3 to 8a4b612 Compare January 26, 2024 11:06

Lazin requested a review from mmaslankaprv January 26, 2024 11:07

mmaslankaprv approved these changes Feb 6, 2024

View reviewed changes

Lazin merged commit 23c0ce6 into redpanda-data:dev Feb 6, 2024
17 checks passed

This was referenced Feb 6, 2024

[v23.3.x] assertion in controller log flush #16500

Closed

[v23.3.x] Fix/inconsistent flushed offset #16501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/inconsistent flushed offset #16105

Fix/inconsistent flushed offset #16105

Lazin commented Jan 15, 2024

vbotbuildovich commented Jan 15, 2024 •

edited

emaxerrno commented Jan 16, 2024

Lazin commented Jan 16, 2024

andrwng Jan 16, 2024

Lazin Jan 17, 2024

andrwng Jan 17, 2024

mmaslankaprv Jan 24, 2024

Lazin commented Jan 26, 2024

Lazin commented Jan 29, 2024

piyushredpanda commented Feb 6, 2024

vbotbuildovich commented Feb 6, 2024

Fix/inconsistent flushed offset #16105

Fix/inconsistent flushed offset #16105

Conversation

Lazin commented Jan 15, 2024

Backports Required

Release Notes

Bug Fixes

vbotbuildovich commented Jan 15, 2024 • edited

emaxerrno commented Jan 16, 2024

Lazin commented Jan 16, 2024

andrwng Jan 16, 2024

Choose a reason for hiding this comment

Lazin Jan 17, 2024

Choose a reason for hiding this comment

andrwng Jan 17, 2024

Choose a reason for hiding this comment

mmaslankaprv Jan 24, 2024

Choose a reason for hiding this comment

Lazin commented Jan 26, 2024

Lazin commented Jan 29, 2024

piyushredpanda commented Feb 6, 2024

vbotbuildovich commented Feb 6, 2024

vbotbuildovich commented Jan 15, 2024 •

edited