storage: add cache target estimates and space management control loop skeleton #11133

dotnwat · 2023-06-01T02:03:35Z

Adds a cloud cache interface for estimating the target minimum for storage size, as well as a control loop skeleton which will be used to manage disk space to be added in subsequent PR.

Backports Required

Release Notes

none

dotnwat · 2023-06-01T17:14:42Z

Force-push:

Fixed circular dependency in the shared library build.

andrwng

LGTM, nothing blocking

andrwng · 2023-06-01T22:53:23Z

src/v/cloud_storage/remote_partition.cc

+        auto [size, chunked] = seg->min_cache_cost();
+        if (chunked) {
+            return cache_usage{
+              .target_min_bytes = min_chunks * size,
+              .target_bytes = wanted_chunks * size,
+            };
+        } else {
+            return cache_usage{
+              .target_min_bytes = min_segments * size,
+              .target_bytes = wanted_segments * size,
+            };
+        }


nit: consider making min_cache_cost() return the usage, to avoid this chunked out bool? Especially if there's any appetite for making min_* and wanted_* configurable, it seems pretty natural to make the segment encapsulate this

andrwng · 2023-06-01T22:59:49Z

src/v/resource_mgmt/storage.cc

+    co_await _gate.close();
+}
+
+ss::future<> disk_space_manager::controller() {


nit: could we avoid using "controller"? Maybe run_loop or something? Especially if this at some point makes its way into cluster/

andrwng · 2023-06-02T00:09:22Z

src/v/config/configuration.cc

+  , enable_storage_space_manager(
+      *this,
+      "enable_storage_space_manager",
+      "Enable the storage space manager that coordinates and control space "
+      "usage between log data and the cloud storage cache.",
+      {.needs_restart = needs_restart::no, .visibility = visibility::user},
+      true)


I don't think the behavior you're implementing warrant doing anything differently in the face of skew, but I'm curious if you've given thought to whether this should be configurable/controllable at the node level. A few scenarios come to mind when space usage may vary across nodes

when there is some hypothetical retention bug that consumes the space on a a single node but not others

when there is heterogeneous hardware

when there is severe partition-size skew

Just thinking whether during an incident it'll be desirable to have a way to flip this off for one but not all nodes.

this makes a lot of sense. i'd even assume that we'd want to be able to turn it off on a per node basis without needing a restart. at least i think that we dont' allow that later feature so we'd need an admin endpoint.

I've added a ticket to track this here. It's a good idea: #11192

jcsp · 2023-06-02T14:01:11Z

src/v/cloud_storage/remote_partition.cc

@@ -1101,4 +1103,79 @@ materialized_segments& remote_partition::materialized() {
    return _api.materialized();
 }

+cache_usage remote_partition::get_cache_usage() const {


Naming: something other than usage will be easier to understand (usually usage means how much space we're using, but this is really asking for target usage, or some other noun if we can think of one)

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

A segment may be in legacy mode or chunked mode which will affect the way that it interacts with the cloud cache. This patch adds an interface for exposing this aspect of the state of the segment. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

Adds an interface for estimating the target minimum / wanted sizes for a remote partition in the cloud storage cache. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

Introduces a new disk space manager control loop that has access to both the cloud storage cache as well as log storage for the purposes of managing free space. This commit only introduces the control loop and prints storage statistics collected from the subsystem using our new APIs at debug level. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

dotnwat · 2023-06-04T22:23:50Z

Force push 1:

Fix naming of usage/target per Johns suggestion
Fixed naming of control loop per Andrews suggestion
Updated tests to handle recent v2/v3 changes from Abhijat

Force push 2:

Forgot to squash commits

dotnwat · 2023-06-05T04:16:29Z

Existing Failures:

CI Failure (failed to upload enough indices) in EndToEndShadowIndexingTest.test_write #11166
ValueError: min() arg is an empty sequence in CreateTopicUpgradeTest.test_retention_config_on_upgrade_from_v22_2_to_v22_3 #11182
CI Failure (can't fetch stable replicas) in PartitionMoveInterruption.test_cancelling_partition_move_x_core #9755
CI Failure (timeout waiting for end offsets to be updated for all partitions) in OffsetForLeaderEpochTest.test_offset_for_leader_epoch #11169
CI Failure (search victim assert) in ControllerEraseTest.test_erase_controller_log #8217

dotnwat · 2023-06-07T19:44:37Z

Failures:

dotnwat · 2023-06-07T19:44:43Z

Ping @andrwng

andrwng

LGTM!

dotnwat requested review from jcsp, andrwng and VladLazar and removed request for jcsp and andrwng June 1, 2023 02:03

github-actions bot added the area/redpanda label Jun 1, 2023

dotnwat requested a review from andrwng June 1, 2023 02:03

dotnwat force-pushed the space-management2 branch from 3bc6a08 to fb0e895 Compare June 1, 2023 17:14

andrwng previously approved these changes Jun 2, 2023

View reviewed changes

jcsp reviewed Jun 2, 2023

View reviewed changes

dotnwat mentioned this pull request Jun 4, 2023

Support flexible node space management enable/disable #11192

Open

dotnwat dismissed andrwng’s stale review via be24755 June 4, 2023 22:20

dotnwat force-pushed the space-management2 branch from fb0e895 to be24755 Compare June 4, 2023 22:20

dotnwat added 5 commits June 4, 2023 15:22

cloud_storage: parameterize test utility on sname_format

2612f32

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

cloud_stroage: add cache estimate for partition

ec175dc

Adds an interface for estimating the target minimum / wanted sizes for a remote partition in the cloud storage cache. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

cluster: rollup cache target sizes across remote partitions

61316cb

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

dotnwat force-pushed the space-management2 branch from be24755 to ed387c6 Compare June 4, 2023 22:22

dotnwat requested review from jcsp and andrwng June 5, 2023 04:16

andrwng approved these changes Jun 7, 2023

View reviewed changes

dotnwat merged commit 63a74ca into redpanda-data:dev Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: add cache target estimates and space management control loop skeleton #11133

storage: add cache target estimates and space management control loop skeleton #11133

dotnwat commented Jun 1, 2023

dotnwat commented Jun 1, 2023

andrwng left a comment

andrwng Jun 1, 2023

andrwng Jun 1, 2023

dotnwat Jun 4, 2023

andrwng Jun 2, 2023

dotnwat Jun 4, 2023

dotnwat Jun 4, 2023

jcsp Jun 2, 2023

dotnwat Jun 4, 2023

dotnwat commented Jun 4, 2023 •

edited

Loading

dotnwat commented Jun 5, 2023 •

edited

Loading

dotnwat commented Jun 7, 2023

dotnwat commented Jun 7, 2023

andrwng left a comment

storage: add cache target estimates and space management control loop skeleton #11133

storage: add cache target estimates and space management control loop skeleton #11133

Conversation

dotnwat commented Jun 1, 2023

Backports Required

Release Notes

dotnwat commented Jun 1, 2023

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat commented Jun 4, 2023 • edited Loading

dotnwat commented Jun 5, 2023 • edited Loading

dotnwat commented Jun 7, 2023

dotnwat commented Jun 7, 2023

andrwng left a comment

Choose a reason for hiding this comment

dotnwat commented Jun 4, 2023 •

edited

Loading

dotnwat commented Jun 5, 2023 •

edited

Loading