-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write caching follow ups #17823
write caching follow ups #17823
Conversation
/dt |
new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e9c-4def-92d2-6d8f9be3e178:
new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e9e-4b98-bd48-33c179af6e46:
new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e99-4ba7-ba9b-446532fb1bd6:
new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0ff-269b-4210-8327-e536ae44a667:
new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0ff-2698-4763-86a4-d63139bb0b83:
new failures in https://buildkite.com/redpanda/redpanda/builds/47770#018ed946-dd6b-4674-815c-52b27c2ff231:
new failures in https://buildkite.com/redpanda/redpanda/builds/47771#018eda06-c969-4fb8-9720-faf1980d0d60:
|
6d6181e
to
0f9b2b8
Compare
/dt |
0eff963
to
6e44680
Compare
/dt |
src/v/raft/consensus.cc
Outdated
const auto& conf = _configuration_manager.get_latest().current_config(); | ||
if (conf.voters.size() == 1) { | ||
// single participant raft group, ensuring indices advance | ||
auto dirty_offset = _log->offsets().dirty_offset; | ||
process_append_entries_reply( | ||
_self.id(), | ||
append_entries_reply{ | ||
.target_node_id = _self, | ||
.node_id = _self, | ||
.group = _group, | ||
.term = _term, | ||
.last_flushed_log_index = _flushed_offset, | ||
.last_dirty_log_index = dirty_offset, | ||
.result = reply_result::success}, | ||
follower_req_seq{0}, | ||
dirty_offset); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we were chatting about making this part of replicate_stm
. In this form it is incorrect, we can not simply check the voter count in current configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted offline, the check was an optimization but we decided to make it unconditional, moved around the code a little bit in stm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rpk bits look good to me 👍 I'll leave the approval to the core team.
6e44680
to
eea0d47
Compare
Checking the PR we have updates for docs on:
is that correct? |
@Deflaimun Correct, any references to configs / topic properties in write caching should be updated to use |
eea0d47
to
50adf9a
Compare
|
||
maybe_update_majority_replicated_index(); | ||
maybe_update_leader_commit_idx(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
In a single participant raft group, commit index is never updated with lower ack levels. Currently commit index recomputation is only done on append entries response which doesn't happen with single replica + lower ack levels. (Test added in subsequent commit)
New configuration name is write_caching_default. Accepted values = {true, false, disabled} true = write caching is true by default for all topics false = write caching is false by default for all topics disabled = write caching is disabled for all topics even when property overrides are present. Switching to true/false to be consistent with other configurations
Original implementation relied on a CV that used a high resolution timer and with a timer per raft group that proved to expensive at scale. Also, the flusher fired periodically regardless of whether there is pending unflushed data from lower acks. This commit redoes it using a deferred flush that uses a lowres clock and the the flushes are scheduled only if unflushed data is appended. Additionally the timers are canceled in favor of background flushes that are scheduled when flush.bytes is hit. This removes reliance on timers as much as possible and the timers should never fire in default acks=all usecases.
The test waits until a barrier is attained until the dirty offset. BOOST_REQUIRE_EQUAL(r.value(), leader_offsets.committed_offset); A barrier doesn't guarantee that out of the box, calling it may return an offset below dirty offset until which all hearbeats agree. If we want a barrier until a desired offset, we have to ensure the barrier returns >= desired offset, adjusts the calling code to do this. Underlying reason is that flush_log() (called from the barrier) maynot always guarantee a flush, it may just return right away if there is already a flush in progress (and hence nothing to flush). Meanwhile a round of heartbeats guarantee barrier progress and the barrier returns an earlier offset which doesn't include the dirty offset that pending a flush.
no logic changes, will be used in next commit.
If the replication notification happens before wait_for_majority() is called, the waiter is never resolved. Adds an additional check before creating the waiter instance.
50adf9a
to
53b82ba
Compare
Failure unrelated: #16198 |
🕺 |
// Truncation is sealed once the following events happen | ||
// - There is new entry from a different term replacing the | ||
// appended entries. | ||
// - The new entry is committed. | ||
// The second condition is important because without that the | ||
// original entries may be reinstanted after another leadership | ||
// change. For example: | ||
// | ||
// 5 replicas A, B, C, D, E, leader=A, term=5 | ||
// | ||
// A - replicate([base: 10, last: 20]) term=5 | ||
// A - append_local([10, 20]) term=5, dirty=20 | ||
// A -> B - append_entries([10, 20]) term=5, dirty=20 | ||
// A - frozen briefly, cannot send further append_entries | ||
// (C, D, E), elect C as leader, term=6, dirty=9 | ||
// C - append_local([10]) - configuration batch - term=6 | ||
// C -> A append_entries([10]), term=6 | ||
// C - crashes | ||
// A truncates, term=6, dirty_offset=10 -> First truncation | ||
// (B, D, E), elect B as leader, term=7, dirty=20 | ||
// B -> A, D, E append_entries([10, 20]) | ||
// committed offset = 20 | ||
// | ||
// In the above example if we do not wait for committed | ||
// offset and stop at first truncation event, we risk an | ||
// incorrect truncation detection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
This PR includes a bunch of fixes found during wc testing.
Backports Required
Release Notes