-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/inconsistent flushed offset #16105
Fix/inconsistent flushed offset #16105
Conversation
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43782#018d0f09-186b-49dd-af9b-32a869ef08be ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43782#018d0f09-1865-48d4-a7bd-8f1402ee09b0 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44347#018d45c8-e672-4a85-a4c6-9757e6b09250 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44347#018d564f-e7b5-49e2-b093-d15b559c27b1 |
@Lazin - should we add this to the opfuzz class, that was the whole point of that class, to drive sequence of events in append/truncate/flush/etc - though i suppose maybe not concurrent? curious on your take. |
The append/truncate operations are often happen on the follower when the follower receives the message with higher term and offset which is smaller than committed offset. The problem is that opfuzz runs storage code. And storage works correctly in this case. The problem is that the consensus assumes something about the storage which is not always the case. I think that without the assertion the code would work just fine. |
src/v/raft/consensus.cc
Outdated
if (flushed_up_to > lstats.dirty_offset) { | ||
if ( | ||
flushed_up_to > lstats.dirty_offset | ||
|| flushed_offset_term < lstats.dirty_offset_term) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still possible for the term to have stayed the same, but we still see the issue? I'm thinking something like this?
v committed v dirty
... term 8 offset 31] t9: [32, 35]
... term 8 offset 31] t9: [32, 35]
flush called, flushed_up_to=35, flushed_offset_term=9
here we have a scheduling point
... term 8 offset 31]
truncate after 31
v committed v dirty
... term 8 offset 31] t8: [32, 32] t9: [33, 36]
append records from term 8 and term 9
v committed v dirty
... term 8 offset 31] t8: [32, 32] t9: [33, 36]
flush continues, lstats.dirty_offset=36, lstats.dirty_offset_term=9
flushed_up_to(35) <= lstats.dirty_offset(36)
flushed_offset_term(9) >= lstats.dirty_offset_term(9)
... so we make it past the early returns
committed offset(31) < _flushed_offset(35)
... so we hit the assertion
I'm wondering what happens if we change the condition to if flushed_up_to > lstats.committed_offset: co_return flushed::yes
(assuming the storage layer flush()
works as expected)? Sure it's a naive approach to avoiding the assertion, but it's not obvious to me why it wouldn't Just Work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is possible. Why would Raft truncate mid-term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, agreed this shouldn't be possible with Raft: when a record is replaced via truncation, it should only be replaced with records of a higher term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncation in the same term is still possible if the request is redelivered/reordered
add extra logging to flush and truncate methods
Method returns the counter which is guaranteed to be incremented after suffix truncation. It's not guaranteed to be incremented exactly once per logical truncation. It can only be used to detect that truncation actually happened.
Use get_log_truncation_counter method in read_write_truncate test.
Use get_truncation_counter method in write_truncate_compact method.
The 'flush_log' method checks the dirty offset before and after flush. If the truncation happened concurrently it detects this by comparing offsets (the offset after flush is smaller than the offset before the flush in this case). Then it uses assertion to check if the flushed offset is not smaller than the committed offset. This may not work as expected in case if there is a concurrent truncation and append. For instance, before the truncation the dirty offset is 32. The 'flush_log' method records flushed offset 32 and calls log->flush. Before 'flush_log' is resumed the log gets truncated and then new value gets added. So the committed offset of the log is now 31 but dirty offset is 32. Because of that the method won't be able to detect the concurrent flush operation will trigger an assertion. This commit fixes this problem by checking not only dirty offset but also a truncation counter. If the counter was incremented then the log was truncated and new data was added to the end of the log.
5ef73e3
to
8a4b612
Compare
Update: added log counter and used it to detect concurrent truncation. |
/ci-repeat |
/backport v23.3.x |
Fixes #15201
This PR improves logging to provide more information. It also fixes the problem in the
consensus::flush_log
method. The method is trying to detect the situation when the truncation happens concurrently. But the existing check fails in situation when there is not just concurrent truncation but truncation + log append.Backports Required
Release Notes
Bug Fixes