Add enable override to the CPU profiler API #14468

ballard26 · 2023-10-26T22:35:49Z

This PR adds the wait_ms parameter to the admin API for the CPU profiler. When this parameter is set the API will enable the profiler, wait until wait_ms has passed, disabled the profiler, then return the samples collected during that period.

Fixes #14069

Backports Required

Release Notes

Features

Adds wait_ms parameter to CPU profiler admin API. The API will wait for wait_ms milliseconds then return the profile samples collected during that period of time.

StephanDollberg · 2023-10-27T09:00:33Z

src/v/resource_mgmt/cpu_profiler.cc

+        on_enabled_change();
+    }
+
+    co_await ss::sleep(timeout);


Given how this works (aka we are not streaming yet) does it even make sense to allow wait_ms to be bigger than 2 minutes (plus some leeway)?

2mins is how long the internal buffers will take to fill with the default sample period. If someone changes that it could take more or less time. 15mins was selected to give some leeway there. I could technically make the upper bound dynamic and have it be the time it'd take to fill the buffers given the current sample rate. Though that may be confusing from an end-users perspective.

StephanDollberg · 2023-10-27T09:01:56Z

src/v/resource_mgmt/tests/cpu_profiler_test.cc

+      [&]() -> auto { return busy_loop(wait_ms); });
+
+    BOOST_TEST(results[ss::this_shard_id()].samples.size() >= 1);
+


Also check that profiler is off here?

Good point, adding a check for this.

travisdowns · 2023-10-27T13:55:35Z

src/v/resource_mgmt/cpu_profiler.h

@@ -76,11 +76,18 @@ class cpu_profiler : public ss::peering_sharded_service<cpu_profiler> {
    // is called on.
    shard_samples shard_results() const;

+    ss::future<std::vector<shard_samples>> override_and_get_results(


This public function should be commented.

Added a comment

travisdowns · 2023-10-27T14:00:55Z

src/v/resource_mgmt/cpu_profiler.cc

+    if (_gate.is_closed()) {
+        co_return std::vector<shard_samples>{};
+    }
+    auto holder = _gate.hold();


If you set a 15min profile time, holding this gate will prevent the server from being shut down for 15 min, right?

Maybe it's better to drop the gate and grab it against after the sleep (which may occasionally fail if it races with shutdown, which is fine).

Good point, it is possible that it'll cause a segfault on shutdown though if the gate isn't held during the wait. I.e, the object is destroyed as a result of shutting down then the task tries to reacquire the gate.

I could make the wait abort-able then have the stop logic in the class abort any on-going waits. This should avoid all potential segfaults along with having to wait 15mins to shutdown.

Good point about gate re-acquisition. Too bad gate doesn't support that scenario: safely checking whether the underlying gate object has gone away (e.g., via reference counting).

Your suggestion sounds good to me.

travisdowns · 2023-10-27T14:05:09Z

src/v/resource_mgmt/cpu_profiler.cc

+    // If other enable overrides are still on-going don't signal an enable
+    // change.
+    if (!is_enabled()) {
+        on_enabled_change();


I feel like it would be simpler if we could just call on_enabled_change() unconditionally anywhere and inside there it would determine if anything needs to happen: i.e, if the seastar level enable != the repdanda enable, rather than having to carefully detect the cases where enablement may have changed in either direction and call it unconditionally.

That said, I think the logic here is correct.

Calling it without any checks would work as well given the logic of on_enabled_change(). This was probably a premature optimization to avoid the unneeded system calls associated with re-enabling the profiler.

Calling it without any checks would work as well given the logic of on_enabled_change(). This was probably a premature optimization to avoid the unneeded system calls associated with re-enabling the profiler.

Right, I noticed that, though the optimization you point out might be a worthwhile optimization, but it could be done with 1 check inside on_enabled_change which just checks if the states are out of sync.

Good point, switching to checking out-of-sync states within on_enable_change rather than at the locations its called.

travisdowns · 2023-10-27T14:09:28Z

src/v/resource_mgmt/cpu_profiler.h

 private:
    // Used to poll seastar at set intervals to capture all samples
    ss::timer<ss::lowres_clock> _query_timer;
    ss::gate _gate;

+    // Used by the API to override cluster config options for a set period of


nit: This is the high-level purpose, but I think this comment could be clearer on what the value represents. It's the number of currently active override requests, right?

Good catch, this comment was actually for an older mechanism I was using to implement this. Will change it to be more specific to what is happening now.

travisdowns · 2023-10-27T14:13:12Z

tests/rptest/tests/cpu_profiler_admin_api_test.py

+        except requests.exceptions.HTTPError:
+            pass
+        else:
+            assert False, "call with wait_ms > 15min should of failed"


nit: should have

Corrected my grammar.

tests/rptest/tests/cpu_profiler_admin_api_test.py

src/v/resource_mgmt/tests/cpu_profiler_test.cc

travisdowns · 2023-10-27T14:22:15Z

src/v/resource_mgmt/tests/cpu_profiler_test.cc


    auto results = cp.shard_results();
    BOOST_TEST(results.samples.size() >= 1);
 }
+


Can you add 1 more test (or extend an existing test) which checks that nesting works, e.g., start a sample run with time 2P, then a second one with time P and when the latter completes check that the profiler is still enabled and that when the 2P run finishes it is disabled?

Then I think we'd have good coverage here.

Added a nested override test.

travisdowns

Awesome to see this!

A few minor changes & comments.

travisdowns

obligatory review comment (hate that ctrl+enter starts a review rather than just adding the comment)

travisdowns · 2023-10-29T00:43:26Z

src/v/resource_mgmt/cpu_profiler.cc

+    if (_gate.is_closed()) {
+        co_return std::vector<shard_samples>{};
+    }
+    auto holder = _gate.hold();


Good point about gate re-acquisition. Too bad gate doesn't support that scenario: safely checking whether the underlying gate object has gone away (e.g., via reference counting).

Your suggestion sounds good to me.

travisdowns · 2023-10-29T00:44:49Z

src/v/resource_mgmt/cpu_profiler.cc

+    // If other enable overrides are still on-going don't signal an enable
+    // change.
+    if (!is_enabled()) {
+        on_enabled_change();


Calling it without any checks would work as well given the logic of on_enabled_change(). This was probably a premature optimization to avoid the unneeded system calls associated with re-enabling the profiler.

Right, I noticed that, though the optimization you point out might be a worthwhile optimization, but it could be done with 1 check inside on_enabled_change which just checks if the states are out of sync.

vbotbuildovich · 2023-11-02T02:49:17Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40270#018b8db3-1715-472a-b4a6-e88f9672086e: "rptest.tests.cluster_features_test.FeaturesSingleNodeTest.test_get_features"
"rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=node_add"
"rptest.tests.acls_test.AccessControlListTestUpgrade.test_upgrade_sasl"
"rptest.tests.upgrade_test.UpgradeFromPriorFeatureVersionTest.test_basic_upgrade"
"rptest.tests.upgrade_test.RedpandaInstallerTest.test_install_by_line"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=False.tick_interval=3600000"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default_explicit_after_upgrade.wipe_cache=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_finishes_after_manual_cancellation.delete_topic=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=False"
"rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_explicit_activation"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_recommissioning_one_of_decommissioned_nodes"
"rptest.tests.cluster_features_test.FeaturesMultiNodeUpgradeTest.test_upgrade"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_node_is_not_allowed_to_join_after_restart.new_bootstrap=True"
"rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.3.next_release=.23.1"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_invalid_destination.num_to_upgrade=2"

vbotbuildovich · 2023-11-08T03:41:08Z

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40655#018bacc3-bbf5-4677-b2a2-1f3f573ebe7c: "rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

travisdowns · 2024-01-08T13:18:58Z

src/v/resource_mgmt/cpu_profiler.h

-    results(std::optional<ss::shard_id> shard_id);
+    ss::future<std::vector<shard_samples>> results(
+      std::optional<ss::shard_id> shard_id,
+      std::optional<ss::lowres_clock::time_point> filter_before = std::nullopt);

    // Returns the samples and dropped samples from the shard this function
    // is called on.


Missing doc for the new filter_before option.

Updated doc comment to include filter_before.

travisdowns · 2024-01-08T13:34:38Z

src/v/resource_mgmt/cpu_profiler.h

@@ -70,12 +70,15 @@ class cpu_profiler : public ss::peering_sharded_service<cpu_profiler> {

    // Collects `shard_results()` for each shard in a node and returns
    // them as a vector.


Missing doc for the new filter_before option.

Added docs for this option.

travisdowns · 2024-01-08T13:40:18Z

src/v/resource_mgmt/cpu_profiler.cc

@@ -178,6 +186,8 @@ cpu_profiler::override_and_get_results(
    // enable the profiler if disabled pre-override.
    on_enabled_change();

+    auto polling_start_time = ss::lowres_clock::now();


I guess the doc for cpu_profiler::override_and_get_results should be updated now?

Maybe the name could even simply get something like "collect_results_for_period" since what is really doing is collecting results starting now and running through period. The override is just an internal detail of what needs to happen to make that work and indeed does not always occur (if the profiler was already running).

The comments have been updated. The name definitely makes more sense. Switching to it now.

vbotbuildovich · 2024-01-08T20:12:17Z

new failures in https://buildkite.com/redpanda/redpanda/builds/43568#018cea77-ecce-40f2-ab88-02d3fd8dff53:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile_with_override"

new failures in https://buildkite.com/redpanda/redpanda/builds/43568#018cea77-eccb-48f1-adcd-a648471be524:

"rptest.tests.recovery_mode_test.RecoveryModeTest.test_recovery_mode"
"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

new failures in https://buildkite.com/redpanda/redpanda/builds/43568#018cea9f-b18f-4504-8946-113507d98804:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile_with_override"

new failures in https://buildkite.com/redpanda/redpanda/builds/43742#018d001d-8317-4547-a10d-46f96c74e2d5:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

new failures in https://buildkite.com/redpanda/redpanda/builds/43742#018d002e-292d-4fe9-8112-5de3ff7aede2:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/43855#018d19ce-e1cb-4a6b-8f69-dceb53004fa5:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

new failures in https://buildkite.com/redpanda/redpanda/builds/43855#018d19df-7589-42bd-a7e9-be9b19439da2:

"rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_chunks"

new failures in https://buildkite.com/redpanda/redpanda/builds/44211#018d3b96-7119-44a3-99b0-ab67ddfe3c1b:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

vbotbuildovich · 2024-01-13T01:37:58Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43742#018d002e-2930-4473-a6c2-b2e3cf9c9730

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43855#018d19df-758e-40c4-a46d-8c749de184f9

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44211#018d3b96-7119-44a3-99b0-ab67ddfe3c1b

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44211#018d3ba7-5d21-4957-bdb8-30f833ae64f3

…ven timepoint

travisdowns

LGTM, excited about this change!

travisdowns · 2024-01-24T12:17:58Z

/ci-repeat

ballard26 · 2024-01-25T01:54:54Z

It looks like the unit tests in debug mode got stuck. The unit tests changed in this PR though did pass and no other unit test should be effected by the changes in this PR.

ballard26 requested review from travisdowns and StephanDollberg October 26, 2023 22:35

github-actions bot added the area/redpanda label Oct 26, 2023

ballard26 force-pushed the cpu-prof-timeout branch from 082829c to 4c1c5ad Compare October 26, 2023 22:40

StephanDollberg previously approved these changes Oct 27, 2023

View reviewed changes

travisdowns reviewed Oct 27, 2023

View reviewed changes

tests/rptest/tests/cpu_profiler_admin_api_test.py Show resolved Hide resolved

travisdowns reviewed Oct 27, 2023

View reviewed changes

src/v/resource_mgmt/tests/cpu_profiler_test.cc Outdated Show resolved Hide resolved

travisdowns reviewed Oct 27, 2023

View reviewed changes

travisdowns requested changes Oct 27, 2023

View reviewed changes

travisdowns reviewed Oct 29, 2023

View reviewed changes

ballard26 dismissed StephanDollberg’s stale review via cddc766 November 2, 2023 00:43

ballard26 force-pushed the cpu-prof-timeout branch from 4c1c5ad to cddc766 Compare November 2, 2023 00:43

ballard26 requested a review from a team as a code owner November 2, 2023 00:43

github-actions bot added area/k8s k8s/tests labels Nov 2, 2023

ballard26 force-pushed the cpu-prof-timeout branch 2 times, most recently from ee1d886 to f8cffc2 Compare November 2, 2023 00:48

ballard26 requested a review from travisdowns November 2, 2023 00:49

github-actions bot removed area/k8s k8s/tests labels Nov 2, 2023

ballard26 requested review from StephanDollberg and removed request for a team November 2, 2023 00:49

ballard26 force-pushed the cpu-prof-timeout branch 2 times, most recently from db3491e to 0070bf6 Compare November 2, 2023 00:54

ballard26 force-pushed the cpu-prof-timeout branch from 21c4d6f to 69e9734 Compare January 7, 2024 17:27

resource_mgmt: ensure the seastar profiler is stopped

73a8307

ballard26 force-pushed the cpu-prof-timeout branch from 69e9734 to 438c026 Compare January 8, 2024 08:31

travisdowns reviewed Jan 8, 2024

View reviewed changes

ballard26 force-pushed the cpu-prof-timeout branch 2 times, most recently from 0d8c1ec to b98da18 Compare January 8, 2024 18:13

ballard26 force-pushed the cpu-prof-timeout branch 2 times, most recently from abe7644 to d92bd01 Compare January 12, 2024 23:11

ballard26 requested a review from travisdowns January 12, 2024 23:12

ballard26 added 7 commits January 17, 2024 17:41

resource_mgmt: add override_and_get_results method to cpu_profiler

fc4c2c5

resource_mgmt: check if underlying state has changed in config handlers

07c7971

redpanda: add wait_ms to cpu profiler admin API

b00cbbb

tests: support wait_ms in admin class

61cde40

tests: add tests for the wait_ms parameter in the cpu profiler API

98cf899

config: add l-value specialization to mock_binding

40e704b

resource_mgmt: add option to filter out results collected before a gi…

5fc9445

…ven timepoint

ballard26 force-pushed the cpu-prof-timeout branch from d92bd01 to 5fc9445 Compare January 17, 2024 22:51

travisdowns previously approved these changes Jan 24, 2024

View reviewed changes

rptest: increase timeout for progress when in debug mode

9d88029

ballard26 dismissed travisdowns’s stale review via 9d88029 January 24, 2024 21:52

StephanDollberg approved these changes Jan 25, 2024

View reviewed changes

piyushredpanda merged commit 4371eda into redpanda-data:dev Jan 25, 2024
16 of 18 checks passed

renovate bot mentioned this pull request May 4, 2024

feat(github-release)!: Update redpanda-operator to v24.1.6 otosky/home-ops#1232

Merged

1 task

		[&]() -> auto { return busy_loop(wait_ms); });

		BOOST_TEST(results[ss::this_shard_id()].samples.size() >= 1);

		@@ -70,12 +70,15 @@ class cpu_profiler : public ss::peering_sharded_service<cpu_profiler> {

		// Collects `shard_results()` for each shard in a node and returns
		// them as a vector.

Add enable override to the CPU profiler API #14468

Add enable override to the CPU profiler API #14468

Conversation

ballard26 commented Oct 26, 2023 • edited

Backports Required

Release Notes

Features

StephanDollberg Oct 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ballard26 Oct 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdowns left a comment

Choose a reason for hiding this comment

travisdowns left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Nov 2, 2023

vbotbuildovich commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Jan 8, 2024 • edited

vbotbuildovich commented Jan 13, 2024 • edited

travisdowns left a comment

Choose a reason for hiding this comment

travisdowns commented Jan 24, 2024

ballard26 commented Jan 25, 2024

ballard26 commented Oct 26, 2023 •

edited

StephanDollberg Oct 27, 2023 •

edited

ballard26 Oct 27, 2023 •

edited

travisdowns left a comment •

edited

vbotbuildovich commented Jan 8, 2024 •

edited

vbotbuildovich commented Jan 13, 2024 •

edited