Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write caching follow ups #17823

Merged
merged 7 commits into from
Apr 16, 2024
Merged

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Apr 12, 2024

This PR includes a bunch of fixes found during wc testing.

  • commit index not updating in a single participant raft group with lower ack levels
  • renamed write_caching to write_caching_default which takes {true, false, disabled} as accepted values
  • changed background flushing to use a low res clock and to only occur when needed (reduced impact on acks=-1)
  • reduced contention on raft op lock during background flushing.
  • fixed a missing notification in replication monitor when the waiter is enqueued after notification happens.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@bharathv
Copy link
Contributor Author

/dt

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 12, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e9c-4def-92d2-6d8f9be3e178:

"rptest.tests.cluster_config_test.ClusterConfigTest.test_rpk_export_import"

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e9e-4b98-bd48-33c179af6e46:

"rptest.tests.describe_topics_test.DescribeTopicsTest.test_describe_topics_with_documentation_and_types"

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e99-4ba7-ba9b-446532fb1bd6:

"rptest.tests.write_caching_test.WriteCachingMetricsTest.test_request_metrics"

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0ff-269b-4210-8327-e536ae44a667:

"rptest.tests.cluster_config_test.ClusterConfigTest.test_rpk_export_import"
"rptest.tests.describe_topics_test.DescribeTopicsTest.test_describe_topics_with_documentation_and_types"

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0ff-2698-4763-86a4-d63139bb0b83:

"rptest.tests.write_caching_test.WriteCachingMetricsTest.test_request_metrics"

new failures in https://buildkite.com/redpanda/redpanda/builds/47770#018ed946-dd6b-4674-815c-52b27c2ff231:

"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_after_upgrade.empty_seed_starts_cluster=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/47771#018eda06-c969-4fb8-9720-faf1980d0d60:

"rptest.tests.write_caching_fi_test.WriteCachingFailureInjectionTest.test_unavoidable_data_loss"

@bharathv
Copy link
Contributor Author

/dt

@bharathv bharathv force-pushed the wc_followups_2 branch 2 times, most recently from 0eff963 to 6e44680 Compare April 13, 2024 19:50
@bharathv bharathv added this to the 24.1 milestone Apr 13, 2024
@bharathv
Copy link
Contributor Author

/dt

@bharathv bharathv marked this pull request as ready for review April 15, 2024 04:48
@bharathv bharathv requested review from mmaslankaprv, ztlpn and nvartolomei and removed request for twmb and gene-redpanda April 15, 2024 04:48
Comment on lines 2695 to 2711
const auto& conf = _configuration_manager.get_latest().current_config();
if (conf.voters.size() == 1) {
// single participant raft group, ensuring indices advance
auto dirty_offset = _log->offsets().dirty_offset;
process_append_entries_reply(
_self.id(),
append_entries_reply{
.target_node_id = _self,
.node_id = _self,
.group = _group,
.term = _term,
.last_flushed_log_index = _flushed_offset,
.last_dirty_log_index = dirty_offset,
.result = reply_result::success},
follower_req_seq{0},
dirty_offset);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were chatting about making this part of replicate_stm. In this form it is incorrect, we can not simply check the voter count in current configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted offline, the check was an optimization but we decided to make it unconditional, moved around the code a little bit in stm.

Copy link
Contributor

@r-vasquez r-vasquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpk bits look good to me 👍 I'll leave the approval to the core team.

@Deflaimun
Copy link
Contributor

Checking the PR we have updates for docs on:

  • rpk
  • topic property
  • cluster property
  • admin api

is that correct?

@bharathv
Copy link
Contributor Author

@Deflaimun Correct, any references to configs / topic properties in write caching should be updated to use write_caching_default instead of write_caching (configuration) and {true, false, disabled} instead of {on, off, disabled} (accepted values).

Comment on lines +2694 to +2702

maybe_update_majority_replicated_index();
maybe_update_leader_commit_idx();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

In a single participant raft group, commit index is never updated with
lower ack levels. Currently commit index recomputation is only done
on append entries response which doesn't happen with single replica +
lower ack levels. (Test added in subsequent commit)
New configuration name is write_caching_default.

Accepted values = {true, false, disabled}

true = write caching is true by default for all topics
false = write caching is false by default for all topics
disabled = write caching is disabled for all topics even when
property overrides are present.

Switching to true/false to be consistent with other configurations
Original implementation relied on a CV that used a high resolution timer
and with a timer per raft group that proved to expensive at scale. Also,
the flusher fired periodically regardless of whether there is pending
unflushed data from lower acks.

This commit redoes it using a deferred flush that uses a lowres clock
and the the flushes are scheduled only if unflushed data is appended.
Additionally the timers are canceled in favor of background flushes that
are scheduled when flush.bytes is hit. This removes reliance on timers
as much as possible and the timers should never fire in default acks=all
usecases.
The test waits until a barrier is attained until the dirty offset.

BOOST_REQUIRE_EQUAL(r.value(), leader_offsets.committed_offset);

A barrier doesn't guarantee that out of the box, calling it may return
an offset below dirty offset until which all hearbeats agree. If we
want a barrier until a desired offset, we have to ensure the barrier
returns >= desired offset, adjusts the calling code to do this.

Underlying reason is that flush_log() (called from the barrier)  maynot
always guarantee a flush, it may just return right away if there is
already a flush in progress (and hence nothing to flush). Meanwhile a
round of heartbeats guarantee barrier progress and the barrier returns
an earlier offset which doesn't include the dirty offset that pending a
flush.
no logic changes, will be used in next commit.
If the replication notification happens before wait_for_majority() is
called, the waiter is never resolved. Adds an additional check before
creating the waiter instance.
@bharathv
Copy link
Contributor Author

Failure unrelated: #16198

@nvartolomei
Copy link
Contributor

🕺

Comment on lines +124 to +149
// Truncation is sealed once the following events happen
// - There is new entry from a different term replacing the
// appended entries.
// - The new entry is committed.
// The second condition is important because without that the
// original entries may be reinstanted after another leadership
// change. For example:
//
// 5 replicas A, B, C, D, E, leader=A, term=5
//
// A - replicate([base: 10, last: 20]) term=5
// A - append_local([10, 20]) term=5, dirty=20
// A -> B - append_entries([10, 20]) term=5, dirty=20
// A - frozen briefly, cannot send further append_entries
// (C, D, E), elect C as leader, term=6, dirty=9
// C - append_local([10]) - configuration batch - term=6
// C -> A append_entries([10]), term=6
// C - crashes
// A truncates, term=6, dirty_offset=10 -> First truncation
// (B, D, E), elect B as leader, term=7, dirty=20
// B -> A, D, E append_entries([10, 20])
// committed offset = 20
//
// In the above example if we do not wait for committed
// offset and stop at first truncation event, we risk an
// incorrect truncation detection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants