Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce cpu_usage under CLUSTER SLOT-STATS, to query slot level cpu usage for Redis cluster #11423

Open
kyle-yh-kim opened this issue Oct 23, 2022 · 2 comments

Comments

@kyle-yh-kim
Copy link

kyle-yh-kim commented Oct 23, 2022

This issue is created to de-couple the cpu_usage implementation from the on-going discussion in slot level memory metrics in Introduce slot level metrics to Redis cluster #10472.

What are we introducing

High level changes

cpu_usage will be tracked and introduced under the to-be implemented CLUSTER SLOT-STATS command.

The cpu_usage is calculated by the already tracked duration, which is used to generate INFO commandstats section.

With the introduction of cpu_usage, Redis cluster users are able to identify hot-slots / hot-shards (in association with hot-slots) based on its cpu_usage.

Low level changes

Updated CLUSTER SLOT-STATS response is attached below.

127.0.0.1:6379> CLUSTER SLOT-STATS ORDERBY CPU_USAGE LIMIT 2 DESC
1) (integer) 16381
2) 1) "key_count"
   2) (integer) 2
3) 1) "cpu_usage"
   2) (integer) 1000
4) (integer) 0
5) 1) "key_count"
   2) (integer) 3
6) 1) "cpu_usage"
   2) (integer) 987

Implementation details

Based on the most recent update from the previous thread.

How is it accumulated?

For its initial release, we can leverage CPU time as a proxy unit for CPU utilization. There's already an existing measurement, named duration under call(), which is used to aggregate for an existing counter commandstats. The same value can simply be aggregated under slot level context.

How is it reset?

For its initial release, the accumulated value is reset upon either;

  1. slot ownership change (either the slot is removed or newly added), or
  2. CONFIG RESETSTAT command. This command already exists, with documentation link.

As for its future iterations, we could leverage trailing average as a better reset mechanism alternative. Even better, make the reset mechanism configurable, similar tomaxmemory-policy config.

@madolson
Copy link
Contributor

madolson commented Oct 24, 2022

For its initial release, we can leverage CPU time as a proxy unit for CPU utilization. There's already an existing measurement, named duration under call(), which is used to aggregate for an existing counter commandstats. The same value can simply be aggregated under slot level context.

'Initial release' implies some type of future release might change CPU, I'm not sure we should do that. I'm okay with just having the cpu usage be the same as the cpu usage indicated in cluster metrics permanently.

@kyle-yh-kim
Copy link
Author

Understood. I'd like to tackle this in a different angle - naming convention.

Instead of using cpu_usage, which encapsulates broad utilization of cpu, perhaps it is better to rename as cpu_time.

This way, we achieve the following;

  1. Add extensibility towards future cpu metrics as new metrics can be created under cpu_{insert_your_metric_name} namespace, without backward compatibility concerns.
  2. Remove ambiguity on "What is cpu usage measured by?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants