-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write caching - raft implementation #17009
Conversation
/dt |
new failures in https://buildkite.com/redpanda/redpanda/builds/46028#018e3032-c295-4e2e-be6d-3d91f9ce58d6:
new failures in https://buildkite.com/redpanda/redpanda/builds/46028#018e3032-c29e-4227-b554-d92aede2646f:
new failures in https://buildkite.com/redpanda/redpanda/builds/46028#018e3032-c29b-420f-83f9-a79093d8a238:
new failures in https://buildkite.com/redpanda/redpanda/builds/46028#018e3032-c298-4832-af0a-35f266447f7c:
new failures in https://buildkite.com/redpanda/redpanda/builds/46028#018e3044-542a-46d1-88d3-8e7dfc164641:
new failures in https://buildkite.com/redpanda/redpanda/builds/46028#018e3044-542d-4d78-bcd4-3083b0ee6c82:
new failures in https://buildkite.com/redpanda/redpanda/builds/46028#018e3044-5427-4079-983f-7f77a9c84816:
new failures in https://buildkite.com/redpanda/redpanda/builds/46028#018e3044-5423-4a5b-9679-9c9737240f29:
new failures in https://buildkite.com/redpanda/redpanda/builds/46063#018e33eb-ff62-42c4-a000-93eecdad7c08:
new failures in https://buildkite.com/redpanda/redpanda/builds/46063#018e33eb-ff5f-4aca-a67c-3485657ba532:
new failures in https://buildkite.com/redpanda/redpanda/builds/46063#018e33f0-f080-4819-95a0-89483d36cd13:
new failures in https://buildkite.com/redpanda/redpanda/builds/46063#018e33f0-f07d-4e0f-a584-f2cfc77ea6b6:
new failures in https://buildkite.com/redpanda/redpanda/builds/46063#018e33f0-f07a-46c6-93e2-1928a5e9e412:
new failures in https://buildkite.com/redpanda/redpanda/builds/46080#018e35bf-7186-40d9-bdb8-78ad05e30bfd:
new failures in https://buildkite.com/redpanda/redpanda/builds/46082#018e365e-88fb-48e9-96b8-ea7b73d6aeff:
new failures in https://buildkite.com/redpanda/redpanda/builds/46082#018e365e-88fb-4c71-9afd-59b94ea4b9c8:
new failures in https://buildkite.com/redpanda/redpanda/builds/46082#018e365e-88fc-4933-9ce0-71ce3c2ceedc:
new failures in https://buildkite.com/redpanda/redpanda/builds/46082#018e365e-88fc-4ccd-9019-92f82926c981:
new failures in https://buildkite.com/redpanda/redpanda/builds/46082#018e365e-88fb-4ba9-9425-21bb64a1ff0e:
new failures in https://buildkite.com/redpanda/redpanda/builds/46164#018e3bd0-08e7-4fbd-9b65-3cada65af592:
|
@bharathv double check if
Do we need to replace last_visible_index with committed offset here? Otherwise, I reckon, suffix truncation will not be possible if we take a snapshot above committed offset and the other 2 replicas lose uncommitted data and form a quorum. |
VERY COOL! |
/dt |
8affe74
to
04e3970
Compare
/dt |
/ci-repeat 5 |
/ci-repeat 2 |
src/v/raft/types.h
Outdated
// Callers may choose to force flush on an individual replicate request | ||
// basis. This is useful if certain callers intend to override any | ||
// default behavior at global/topic scope. | ||
void set_force_flush() { _force_flush = true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm debating whether to have a separate override like this or to make it a mandatory parameter in the constructor. Latter seems better because that forces the (future) developers to actually lookup what force flush is and not miss it, particularly when replicating new commands to the user topics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, some nits only
ss::future<result<replicate_result>> | ||
replicate_entries_stm::wait_for_majority_no_flush() { | ||
if (!_append_result) { | ||
co_return build_replicate_result(); | ||
} | ||
auto appended_offset = _append_result->value().last_offset; | ||
auto appended_term = _append_result->value().last_term; | ||
|
||
auto stop_cond = [this, appended_offset, appended_term] { | ||
const auto current_committed_offset = _ptr->committed_offset(); | ||
const auto current_majority_replicated | ||
= _ptr->majority_replicated_index(); | ||
const auto replicated = current_majority_replicated >= appended_offset; | ||
const auto truncated = _ptr->term() > appended_term | ||
&& current_committed_offset | ||
> _initial_committed_offset | ||
&& _ptr->_log->get_term(appended_offset) | ||
!= appended_term; | ||
|
||
return replicated || truncated; | ||
}; | ||
try { | ||
co_await _ptr->_majority_replicated_index_updated.wait(stop_cond); | ||
co_return process_result(appended_offset, appended_term); | ||
} catch (const ss::broken_condition_variable&) { | ||
vlog( | ||
_ctxlog.debug, | ||
"Replication of entries with last offset: {}, flush: {} aborted - " | ||
"shutting down", | ||
_dirty_offset, | ||
_is_flush_required); | ||
co_return result<replicate_result>( | ||
make_error_code(errc::shutting_down)); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this differs from wait_for_majority_flush
only in the stop_cond
. Maybe we can have a generic method that would accept the predicate as an argument ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this, I think the weird part is the cv they wait on. That is different for both cases, so we may have to plumb it differently (like rolling that into "Func" we pass) but at that point it looks borderline ugly, open to any other suggestions.
This seemingly non controversial change actually uncovered an issue where timeouts were not passed to the batcher correctly. Timeout now passed in the "opts" is applied all the way into the batcher.
Callers may choose to force flush on an individual replicate request basis. This is useful if certain callers intend to override any default behavior at global/topic scope, eg: replicate a stm command to the user topic.
This race was exposed during testing. If a request already timed out before serialzing batches into the append entry request, that may result in an empty append and a subsequent offset violation. The fix is to ignore such timed out requests.
With the new replicate_batcher changes, the timeout is now applied correctly. Internal users of this client (schema_registry/transforms) are not setting an explicit timeout which is resulting in 0 timeout and requests timing out immediately. This change defaults to notimeout using a timeout set to -1.
pure refactor, no logic changes.
pure refactor, no logic changes.
pure refactor, no logic changes
This commit decouples the flush decision from acks setting. Flush happens in the following cases * A force flush override is set by the caller (highest precedence) ** or ** * There is a quorum_ack request in the batch **and** write caching is disabled.
Now that flushing is decoupled from acks setting, relect that in _last_write_flushed.
no logic changes.
This is the meat of the change. If the flush is not needed, the caller waits for the majority replicated index to bump before acknowledging the caller.
no logic changes
The new timepoint tracks when a last successful flush happened. This is used to implement periodic flushing.
for readability.
This commit unifies the flushing logic for all cached writes (acks = 0/1 or write_caching). The background flush triggers if one of the following conditions hits. - pending flushes have reached flush.bytes in size - flush.ms has passed since last flush Prior to this commit, flush.ms was not enforced, so this commit introduced a small behavioral change.
…ncy with gtest primitives
with new parameterized tests, the total runtime in debug builds tipped over causing the test to timeout.
Force pushed to fix merge conflicts. While testing I noticed a small buggy behavior with the way we update majority replicated index during truncations, the fix for it is orthogonal to the work here, so I'll make it a separate follow up PR. |
@@ -98,6 +101,7 @@ def inject_failures(): | |||
node=node)) | |||
time.sleep(5) | |||
|
|||
(acks, wc_conf) = (-1, "on") if write_caching else (1, "off") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test do anything? I don't see the cluster configuration write_caching
being wired.
found where it is plugged
ss::future<> consensus::background_flusher() { | ||
while (!_bg.is_closed()) { | ||
try { | ||
co_await _background_flusher.wait(log_config().flush_ms()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit of a busy loop especially with high number of partitions; do we need it?
we already signal the cvar when we append, why we need the timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implements the core of flush.ms functionality. A timer per raft group for two reasons
- flush.ms is a topic level property, so could be different for each group.
- We need to flush bytes that haven't hit flush.bytes limit but hit the flush.ms delay.
Its 100ms by default timer and only does work only if there is something to be flushed, so I doubt if it will have any substantial impact on the CPU usage but definitely something to monitor with large tiers.
Perhaps one follow up improvement could be to have one timer per topic / core, if this turns out to be a problem.
Implements raft side of things for write caching. Summary of changes
Next set of patches:
Backports Required
Release Notes