-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add enable override to the CPU profiler API #14468
Conversation
082829c
to
4c1c5ad
Compare
src/v/resource_mgmt/cpu_profiler.cc
Outdated
on_enabled_change(); | ||
} | ||
|
||
co_await ss::sleep(timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how this works (aka we are not streaming yet) does it even make sense to allow wait_ms
to be bigger than 2 minutes (plus some leeway)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2mins is how long the internal buffers will take to fill with the default sample period. If someone changes that it could take more or less time. 15mins was selected to give some leeway there. I could technically make the upper bound dynamic and have it be the time it'd take to fill the buffers given the current sample rate. Though that may be confusing from an end-users perspective.
[&]() -> auto { return busy_loop(wait_ms); }); | ||
|
||
BOOST_TEST(results[ss::this_shard_id()].samples.size() >= 1); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check that profiler is off here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, adding a check for this.
src/v/resource_mgmt/cpu_profiler.h
Outdated
@@ -76,11 +76,18 @@ class cpu_profiler : public ss::peering_sharded_service<cpu_profiler> { | |||
// is called on. | |||
shard_samples shard_results() const; | |||
|
|||
ss::future<std::vector<shard_samples>> override_and_get_results( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This public function should be commented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment
if (_gate.is_closed()) { | ||
co_return std::vector<shard_samples>{}; | ||
} | ||
auto holder = _gate.hold(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set a 15min profile time, holding this gate will prevent the server from being shut down for 15 min, right?
Maybe it's better to drop the gate and grab it against after the sleep (which may occasionally fail if it races with shutdown, which is fine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it is possible that it'll cause a segfault on shutdown though if the gate isn't held during the wait. I.e, the object is destroyed as a result of shutting down then the task tries to reacquire the gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could make the wait abort-able then have the stop logic in the class abort any on-going waits. This should avoid all potential segfaults along with having to wait 15mins to shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about gate re-acquisition. Too bad gate
doesn't support that scenario: safely checking whether the underlying gate object has gone away (e.g., via reference counting).
Your suggestion sounds good to me.
src/v/resource_mgmt/cpu_profiler.cc
Outdated
// If other enable overrides are still on-going don't signal an enable | ||
// change. | ||
if (!is_enabled()) { | ||
on_enabled_change(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it would be simpler if we could just call on_enabled_change()
unconditionally anywhere and inside there it would determine if anything needs to happen: i.e, if the seastar level enable != the repdanda enable, rather than having to carefully detect the cases where enablement may have changed in either direction and call it unconditionally.
That said, I think the logic here is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling it without any checks would work as well given the logic of on_enabled_change()
. This was probably a premature optimization to avoid the unneeded system calls associated with re-enabling the profiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling it without any checks would work as well given the logic of
on_enabled_change()
. This was probably a premature optimization to avoid the unneeded system calls associated with re-enabling the profiler.
Right, I noticed that, though the optimization you point out might be a worthwhile optimization, but it could be done with 1 check inside on_enabled_change
which just checks if the states are out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, switching to checking out-of-sync states within on_enable_change
rather than at the locations its called.
src/v/resource_mgmt/cpu_profiler.h
Outdated
private: | ||
// Used to poll seastar at set intervals to capture all samples | ||
ss::timer<ss::lowres_clock> _query_timer; | ||
ss::gate _gate; | ||
|
||
// Used by the API to override cluster config options for a set period of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is the high-level purpose, but I think this comment could be clearer on what the value represents. It's the number of currently active override requests, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this comment was actually for an older mechanism I was using to implement this. Will change it to be more specific to what is happening now.
except requests.exceptions.HTTPError: | ||
pass | ||
else: | ||
assert False, "call with wait_ms > 15min should of failed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected my grammar.
|
||
auto results = cp.shard_results(); | ||
BOOST_TEST(results.samples.size() >= 1); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add 1 more test (or extend an existing test) which checks that nesting works, e.g., start a sample run with time 2P, then a second one with time P and when the latter completes check that the profiler is still enabled and that when the 2P run finishes it is disabled?
Then I think we'd have good coverage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a nested override test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome to see this!
A few minor changes & comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obligatory review comment (hate that ctrl+enter starts a review rather than just adding the comment)
if (_gate.is_closed()) { | ||
co_return std::vector<shard_samples>{}; | ||
} | ||
auto holder = _gate.hold(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about gate re-acquisition. Too bad gate
doesn't support that scenario: safely checking whether the underlying gate object has gone away (e.g., via reference counting).
Your suggestion sounds good to me.
src/v/resource_mgmt/cpu_profiler.cc
Outdated
// If other enable overrides are still on-going don't signal an enable | ||
// change. | ||
if (!is_enabled()) { | ||
on_enabled_change(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling it without any checks would work as well given the logic of
on_enabled_change()
. This was probably a premature optimization to avoid the unneeded system calls associated with re-enabling the profiler.
Right, I noticed that, though the optimization you point out might be a worthwhile optimization, but it could be done with 1 check inside on_enabled_change
which just checks if the states are out of sync.
4c1c5ad
to
cddc766
Compare
ee1d886
to
f8cffc2
Compare
db3491e
to
0070bf6
Compare
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40270#018b8db3-1715-472a-b4a6-e88f9672086e: "rptest.tests.cluster_features_test.FeaturesSingleNodeTest.test_get_features" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40655#018bacc3-bbf5-4677-b2a2-1f3f573ebe7c: "rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile" |
21c4d6f
to
69e9734
Compare
69e9734
to
438c026
Compare
results(std::optional<ss::shard_id> shard_id); | ||
ss::future<std::vector<shard_samples>> results( | ||
std::optional<ss::shard_id> shard_id, | ||
std::optional<ss::lowres_clock::time_point> filter_before = std::nullopt); | ||
|
||
// Returns the samples and dropped samples from the shard this function | ||
// is called on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc for the new filter_before
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated doc comment to include filter_before
.
@@ -70,12 +70,15 @@ class cpu_profiler : public ss::peering_sharded_service<cpu_profiler> { | |||
|
|||
// Collects `shard_results()` for each shard in a node and returns | |||
// them as a vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc for the new filter_before
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docs for this option.
@@ -178,6 +186,8 @@ cpu_profiler::override_and_get_results( | |||
// enable the profiler if disabled pre-override. | |||
on_enabled_change(); | |||
|
|||
auto polling_start_time = ss::lowres_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the doc for cpu_profiler::override_and_get_results
should be updated now?
Maybe the name could even simply get something like "collect_results_for_period" since what is really doing is collecting results starting now and running through period. The override is just an internal detail of what needs to happen to make that work and indeed does not always occur (if the profiler was already running).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments have been updated. The name definitely makes more sense. Switching to it now.
0d8c1ec
to
b98da18
Compare
new failures in https://buildkite.com/redpanda/redpanda/builds/43568#018cea77-ecce-40f2-ab88-02d3fd8dff53:
new failures in https://buildkite.com/redpanda/redpanda/builds/43568#018cea77-eccb-48f1-adcd-a648471be524:
new failures in https://buildkite.com/redpanda/redpanda/builds/43568#018cea9f-b18f-4504-8946-113507d98804:
new failures in https://buildkite.com/redpanda/redpanda/builds/43742#018d001d-8317-4547-a10d-46f96c74e2d5:
new failures in https://buildkite.com/redpanda/redpanda/builds/43742#018d002e-292d-4fe9-8112-5de3ff7aede2:
new failures in https://buildkite.com/redpanda/redpanda/builds/43855#018d19ce-e1cb-4a6b-8f69-dceb53004fa5:
new failures in https://buildkite.com/redpanda/redpanda/builds/43855#018d19df-7589-42bd-a7e9-be9b19439da2:
new failures in https://buildkite.com/redpanda/redpanda/builds/44211#018d3b96-7119-44a3-99b0-ab67ddfe3c1b:
|
abe7644
to
d92bd01
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43742#018d002e-2930-4473-a6c2-b2e3cf9c9730 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43855#018d19df-758e-40c4-a46d-8c749de184f9 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44211#018d3b96-7119-44a3-99b0-ab67ddfe3c1b ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44211#018d3ba7-5d21-4957-bdb8-30f833ae64f3 |
d92bd01
to
5fc9445
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, excited about this change!
/ci-repeat |
It looks like the unit tests in debug mode got stuck. The unit tests changed in this PR though did pass and no other unit test should be effected by the changes in this PR. |
This PR adds the
wait_ms
parameter to the admin API for the CPU profiler. When this parameter is set the API will enable the profiler, wait untilwait_ms
has passed, disabled the profiler, then return the samples collected during that period.Fixes #14069
Backports Required
Release Notes
Features
wait_ms
parameter to CPU profiler admin API. The API will wait forwait_ms
milliseconds then return the profile samples collected during that period of time.