write caching follow ups #17823

bharathv · 2024-04-12T03:54:58Z

This PR includes a bunch of fixes found during wc testing.

commit index not updating in a single participant raft group with lower ack levels
renamed write_caching to write_caching_default which takes {true, false, disabled} as accepted values
changed background flushing to use a low res clock and to only occur when needed (reduced impact on acks=-1)
reduced contention on raft op lock during background flushing.
fixed a missing notification in replication monitor when the waiter is enqueued after notification happens.

Backports Required

Release Notes

none

bharathv · 2024-04-12T04:33:53Z

/dt

vbotbuildovich · 2024-04-12T06:53:37Z

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e9c-4def-92d2-6d8f9be3e178:

"rptest.tests.cluster_config_test.ClusterConfigTest.test_rpk_export_import"

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e9e-4b98-bd48-33c179af6e46:

"rptest.tests.describe_topics_test.DescribeTopicsTest.test_describe_topics_with_documentation_and_types"

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e99-4ba7-ba9b-446532fb1bd6:

"rptest.tests.write_caching_test.WriteCachingMetricsTest.test_request_metrics"

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0ff-269b-4210-8327-e536ae44a667:

"rptest.tests.cluster_config_test.ClusterConfigTest.test_rpk_export_import"
"rptest.tests.describe_topics_test.DescribeTopicsTest.test_describe_topics_with_documentation_and_types"

new failures in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0ff-2698-4763-86a4-d63139bb0b83:

"rptest.tests.write_caching_test.WriteCachingMetricsTest.test_request_metrics"

new failures in https://buildkite.com/redpanda/redpanda/builds/47770#018ed946-dd6b-4674-815c-52b27c2ff231:

"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_after_upgrade.empty_seed_starts_cluster=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/47771#018eda06-c969-4fb8-9720-faf1980d0d60:

"rptest.tests.write_caching_fi_test.WriteCachingFailureInjectionTest.test_unavoidable_data_loss"

vbotbuildovich · 2024-04-12T06:57:08Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47710#018ed0df-5e9c-4def-92d2-6d8f9be3e178

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47771#018eda06-c96b-4170-9f2b-fc75721b8df1

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47780#018edbc2-0814-43fa-a70d-30aa7f861916

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47780#018edbd7-61fa-473d-965d-01b4b40279df

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47784#018ee058-8689-4308-9676-6580e439f819

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47816#018ee36e-bdf7-483a-b1e0-b32bf6fb89d1

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47816#018ee38e-45fd-4f35-bce2-de185d106417

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47878#018ee817-17fb-4557-be3e-eaa6d5f751a3

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47878#018ee81e-3814-4be3-909b-2cf54040d6ce

bharathv · 2024-04-12T20:37:21Z

/dt

bharathv · 2024-04-13T19:54:29Z

/dt

mmaslankaprv · 2024-04-15T08:30:12Z

src/v/raft/consensus.cc

+    const auto& conf = _configuration_manager.get_latest().current_config();
+    if (conf.voters.size() == 1) {
+        // single participant raft group, ensuring indices advance
+        auto dirty_offset = _log->offsets().dirty_offset;
+        process_append_entries_reply(
+          _self.id(),
+          append_entries_reply{
+            .target_node_id = _self,
+            .node_id = _self,
+            .group = _group,
+            .term = _term,
+            .last_flushed_log_index = _flushed_offset,
+            .last_dirty_log_index = dirty_offset,
+            .result = reply_result::success},
+          follower_req_seq{0},
+          dirty_offset);
+    }


we were chatting about making this part of replicate_stm. In this form it is incorrect, we can not simply check the voter count in current configuration.

Chatted offline, the check was an optimization but we decided to make it unconditional, moved around the code a little bit in stm.

src/v/raft/tests/append_entries_test.cc

tests/rptest/tests/write_caching_fi_test.py

src/v/raft/consensus.cc

r-vasquez

rpk bits look good to me 👍 I'll leave the approval to the core team.

Deflaimun · 2024-04-15T20:40:11Z

Checking the PR we have updates for docs on:

rpk
topic property
cluster property
admin api

is that correct?

bharathv · 2024-04-16T04:50:32Z

@Deflaimun Correct, any references to configs / topic properties in write caching should be updated to use write_caching_default instead of write_caching (configuration) and {true, false, disabled} instead of {on, off, disabled} (accepted values).

src/v/redpanda/admin/debug.cc

src/v/raft/consensus.cc

src/v/raft/replicate_entries_stm.cc

src/v/raft/consensus.h

src/v/raft/consensus.cc

mmaslankaprv · 2024-04-16T16:12:52Z

src/v/raft/consensus.cc

+
+    maybe_update_majority_replicated_index();
+    maybe_update_leader_commit_idx();
+


In a single participant raft group, commit index is never updated with lower ack levels. Currently commit index recomputation is only done on append entries response which doesn't happen with single replica + lower ack levels. (Test added in subsequent commit)

New configuration name is write_caching_default. Accepted values = {true, false, disabled} true = write caching is true by default for all topics false = write caching is false by default for all topics disabled = write caching is disabled for all topics even when property overrides are present. Switching to true/false to be consistent with other configurations

Original implementation relied on a CV that used a high resolution timer and with a timer per raft group that proved to expensive at scale. Also, the flusher fired periodically regardless of whether there is pending unflushed data from lower acks. This commit redoes it using a deferred flush that uses a lowres clock and the the flushes are scheduled only if unflushed data is appended. Additionally the timers are canceled in favor of background flushes that are scheduled when flush.bytes is hit. This removes reliance on timers as much as possible and the timers should never fire in default acks=all usecases.

The test waits until a barrier is attained until the dirty offset. BOOST_REQUIRE_EQUAL(r.value(), leader_offsets.committed_offset); A barrier doesn't guarantee that out of the box, calling it may return an offset below dirty offset until which all hearbeats agree. If we want a barrier until a desired offset, we have to ensure the barrier returns >= desired offset, adjusts the calling code to do this. Underlying reason is that flush_log() (called from the barrier) maynot always guarantee a flush, it may just return right away if there is already a flush in progress (and hence nothing to flush). Meanwhile a round of heartbeats guarantee barrier progress and the barrier returns an earlier offset which doesn't include the dirty offset that pending a flush.

no logic changes, will be used in next commit.

If the replication notification happens before wait_for_majority() is called, the waiter is never resolved. Adds an additional check before creating the waiter instance.

bharathv · 2024-04-16T20:39:40Z

Failure unrelated: #16198

nvartolomei · 2024-04-17T12:01:56Z

🕺

dotnwat · 2024-04-18T04:59:47Z

src/v/raft/replication_monitor.cc

+    // Truncation is sealed once the following events happen
+    // - There is new entry from a different term replacing the
+    //   appended entries.
+    // - The new entry is committed.
+    // The second condition is important because without that the
+    // original entries may be reinstanted after another leadership
+    // change. For example:
+    //
+    // 5 replicas A, B, C, D, E, leader=A, term=5
+    //
+    // A - replicate([base: 10, last: 20]) term=5
+    // A - append_local([10, 20]) term=5, dirty=20
+    // A -> B - append_entries([10, 20]) term=5, dirty=20
+    // A - frozen briefly, cannot send further append_entries
+    // (C, D, E), elect C as leader, term=6, dirty=9
+    // C - append_local([10]) - configuration batch - term=6
+    // C -> A append_entries([10]), term=6
+    // C - crashes
+    // A truncates, term=6, dirty_offset=10 -> First truncation
+    // (B, D, E), elect B as leader, term=7, dirty=20
+    // B -> A, D, E append_entries([10, 20])
+    // committed offset = 20
+    //
+    // In the above example if we do not wait for committed
+    // offset and stop at first truncation event, we risk an
+    // incorrect truncation detection.


github-actions bot added area/rpk area/redpanda labels Apr 12, 2024

bharathv force-pushed the wc_followups_2 branch from 6d6181e to 0f9b2b8 Compare April 12, 2024 19:27

bharathv force-pushed the wc_followups_2 branch 2 times, most recently from 0eff963 to 6e44680 Compare April 13, 2024 19:50

bharathv added this to the 24.1 milestone Apr 13, 2024

bharathv marked this pull request as ready for review April 15, 2024 04:48

bharathv requested review from twmb, r-vasquez and gene-redpanda as code owners April 15, 2024 04:48

bharathv requested review from mmaslankaprv, ztlpn and nvartolomei and removed request for twmb and gene-redpanda April 15, 2024 04:48

mmaslankaprv reviewed Apr 15, 2024

View reviewed changes

src/v/raft/tests/append_entries_test.cc Show resolved Hide resolved

nvartolomei reviewed Apr 15, 2024

View reviewed changes

tests/rptest/tests/write_caching_fi_test.py Show resolved Hide resolved

src/v/raft/consensus.cc Show resolved Hide resolved

r-vasquez reviewed Apr 15, 2024

View reviewed changes

bharathv force-pushed the wc_followups_2 branch from 6e44680 to eea0d47 Compare April 15, 2024 19:20

bharathv requested review from nvartolomei and mmaslankaprv April 15, 2024 22:37