Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enable override to the CPU profiler API #14468

Merged
merged 9 commits into from
Jan 25, 2024

Conversation

ballard26
Copy link
Contributor

@ballard26 ballard26 commented Oct 26, 2023

This PR adds the wait_ms parameter to the admin API for the CPU profiler. When this parameter is set the API will enable the profiler, wait until wait_ms has passed, disabled the profiler, then return the samples collected during that period.

Fixes #14069

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Features

  • Adds wait_ms parameter to CPU profiler admin API. The API will wait for wait_ms milliseconds then return the profile samples collected during that period of time.

on_enabled_change();
}

co_await ss::sleep(timeout);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how this works (aka we are not streaming yet) does it even make sense to allow wait_ms to be bigger than 2 minutes (plus some leeway)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2mins is how long the internal buffers will take to fill with the default sample period. If someone changes that it could take more or less time. 15mins was selected to give some leeway there. I could technically make the upper bound dynamic and have it be the time it'd take to fill the buffers given the current sample rate. Though that may be confusing from an end-users perspective.

[&]() -> auto { return busy_loop(wait_ms); });

BOOST_TEST(results[ss::this_shard_id()].samples.size() >= 1);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check that profiler is off here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, adding a check for this.

@@ -76,11 +76,18 @@ class cpu_profiler : public ss::peering_sharded_service<cpu_profiler> {
// is called on.
shard_samples shard_results() const;

ss::future<std::vector<shard_samples>> override_and_get_results(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This public function should be commented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment

if (_gate.is_closed()) {
co_return std::vector<shard_samples>{};
}
auto holder = _gate.hold();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set a 15min profile time, holding this gate will prevent the server from being shut down for 15 min, right?

Maybe it's better to drop the gate and grab it against after the sleep (which may occasionally fail if it races with shutdown, which is fine).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it is possible that it'll cause a segfault on shutdown though if the gate isn't held during the wait. I.e, the object is destroyed as a result of shutting down then the task tries to reacquire the gate.

Copy link
Contributor Author

@ballard26 ballard26 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make the wait abort-able then have the stop logic in the class abort any on-going waits. This should avoid all potential segfaults along with having to wait 15mins to shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about gate re-acquisition. Too bad gate doesn't support that scenario: safely checking whether the underlying gate object has gone away (e.g., via reference counting).

Your suggestion sounds good to me.

// If other enable overrides are still on-going don't signal an enable
// change.
if (!is_enabled()) {
on_enabled_change();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it would be simpler if we could just call on_enabled_change() unconditionally anywhere and inside there it would determine if anything needs to happen: i.e, if the seastar level enable != the repdanda enable, rather than having to carefully detect the cases where enablement may have changed in either direction and call it unconditionally.

That said, I think the logic here is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it without any checks would work as well given the logic of on_enabled_change(). This was probably a premature optimization to avoid the unneeded system calls associated with re-enabling the profiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it without any checks would work as well given the logic of on_enabled_change(). This was probably a premature optimization to avoid the unneeded system calls associated with re-enabling the profiler.

Right, I noticed that, though the optimization you point out might be a worthwhile optimization, but it could be done with 1 check inside on_enabled_change which just checks if the states are out of sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, switching to checking out-of-sync states within on_enable_change rather than at the locations its called.

private:
// Used to poll seastar at set intervals to capture all samples
ss::timer<ss::lowres_clock> _query_timer;
ss::gate _gate;

// Used by the API to override cluster config options for a set period of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is the high-level purpose, but I think this comment could be clearer on what the value represents. It's the number of currently active override requests, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this comment was actually for an older mechanism I was using to implement this. Will change it to be more specific to what is happening now.

except requests.exceptions.HTTPError:
pass
else:
assert False, "call with wait_ms > 15min should of failed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected my grammar.


auto results = cp.shard_results();
BOOST_TEST(results.samples.size() >= 1);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add 1 more test (or extend an existing test) which checks that nesting works, e.g., start a sample run with time 2P, then a second one with time P and when the latter completes check that the profiler is still enabled and that when the 2P run finishes it is disabled?

Then I think we'd have good coverage here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a nested override test.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome to see this!

A few minor changes & comments.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obligatory review comment (hate that ctrl+enter starts a review rather than just adding the comment)

if (_gate.is_closed()) {
co_return std::vector<shard_samples>{};
}
auto holder = _gate.hold();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about gate re-acquisition. Too bad gate doesn't support that scenario: safely checking whether the underlying gate object has gone away (e.g., via reference counting).

Your suggestion sounds good to me.

// If other enable overrides are still on-going don't signal an enable
// change.
if (!is_enabled()) {
on_enabled_change();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it without any checks would work as well given the logic of on_enabled_change(). This was probably a premature optimization to avoid the unneeded system calls associated with re-enabling the profiler.

Right, I noticed that, though the optimization you point out might be a worthwhile optimization, but it could be done with 1 check inside on_enabled_change which just checks if the states are out of sync.

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40270#018b8db3-1715-472a-b4a6-e88f9672086e: "rptest.tests.cluster_features_test.FeaturesSingleNodeTest.test_get_features"
"rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=node_add"
"rptest.tests.acls_test.AccessControlListTestUpgrade.test_upgrade_sasl"
"rptest.tests.upgrade_test.UpgradeFromPriorFeatureVersionTest.test_basic_upgrade"
"rptest.tests.upgrade_test.RedpandaInstallerTest.test_install_by_line"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=False.tick_interval=3600000"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default_explicit_after_upgrade.wipe_cache=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_finishes_after_manual_cancellation.delete_topic=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=False"
"rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_explicit_activation"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_recommissioning_one_of_decommissioned_nodes"
"rptest.tests.cluster_features_test.FeaturesMultiNodeUpgradeTest.test_upgrade"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_dynamic.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_node_is_not_allowed_to_join_after_restart.new_bootstrap=True"
"rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.3.next_release=.23.1"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_invalid_destination.num_to_upgrade=2"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40655#018bacc3-bbf5-4677-b2a2-1f3f573ebe7c: "rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

results(std::optional<ss::shard_id> shard_id);
ss::future<std::vector<shard_samples>> results(
std::optional<ss::shard_id> shard_id,
std::optional<ss::lowres_clock::time_point> filter_before = std::nullopt);

// Returns the samples and dropped samples from the shard this function
// is called on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc for the new filter_before option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated doc comment to include filter_before.

@@ -70,12 +70,15 @@ class cpu_profiler : public ss::peering_sharded_service<cpu_profiler> {

// Collects `shard_results()` for each shard in a node and returns
// them as a vector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc for the new filter_before option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs for this option.

@@ -178,6 +186,8 @@ cpu_profiler::override_and_get_results(
// enable the profiler if disabled pre-override.
on_enabled_change();

auto polling_start_time = ss::lowres_clock::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the doc for cpu_profiler::override_and_get_results should be updated now?

Maybe the name could even simply get something like "collect_results_for_period" since what is really doing is collecting results starting now and running through period. The override is just an internal detail of what needs to happen to make that work and indeed does not always occur (if the profiler was already running).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments have been updated. The name definitely makes more sense. Switching to it now.

@ballard26 ballard26 force-pushed the cpu-prof-timeout branch 2 times, most recently from 0d8c1ec to b98da18 Compare January 8, 2024 18:13
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 8, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/43568#018cea77-ecce-40f2-ab88-02d3fd8dff53:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile_with_override"

new failures in https://buildkite.com/redpanda/redpanda/builds/43568#018cea77-eccb-48f1-adcd-a648471be524:

"rptest.tests.recovery_mode_test.RecoveryModeTest.test_recovery_mode"
"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

new failures in https://buildkite.com/redpanda/redpanda/builds/43568#018cea9f-b18f-4504-8946-113507d98804:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile_with_override"

new failures in https://buildkite.com/redpanda/redpanda/builds/43742#018d001d-8317-4547-a10d-46f96c74e2d5:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

new failures in https://buildkite.com/redpanda/redpanda/builds/43742#018d002e-292d-4fe9-8112-5de3ff7aede2:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/43855#018d19ce-e1cb-4a6b-8f69-dceb53004fa5:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

new failures in https://buildkite.com/redpanda/redpanda/builds/43855#018d19df-7589-42bd-a7e9-be9b19439da2:

"rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_chunks"

new failures in https://buildkite.com/redpanda/redpanda/builds/44211#018d3b96-7119-44a3-99b0-ab67ddfe3c1b:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile"

@ballard26 ballard26 force-pushed the cpu-prof-timeout branch 2 times, most recently from abe7644 to d92bd01 Compare January 12, 2024 23:11
travisdowns
travisdowns previously approved these changes Jan 24, 2024
Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, excited about this change!

@travisdowns
Copy link
Member

/ci-repeat

@ballard26
Copy link
Contributor Author

It looks like the unit tests in debug mode got stuck. The unit tests changed in this PR though did pass and no other unit test should be effected by the changes in this PR.

@piyushredpanda piyushredpanda merged commit 4371eda into redpanda-data:dev Jan 25, 2024
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add timeout parameter to CPU profiler endpoint
5 participants