-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dashboard][k8s] Better CPU reporting when running on K8s #14593
[dashboard][k8s] Better CPU reporting when running on K8s #14593
Conversation
One thing I'm not sure about -- (but can check experimentally tomorrow morning by launching on AWS and seeing what psutil does there) |
^ 2/2 CPU is 100% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Is it possible to add some testing to check the logic? At the very least, we can at least have some unit tests for the parsing/calculation logic.
dashboard/k8s_utils.py
Outdated
return 0.0 | ||
|
||
|
||
def container_cpu_count(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be doing this in the core too? If so, can we move the cpu count part there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray.utils._get_docker_cpus
currently does min(A, B) with
A = quota/period
B = number of cpus indicated by cpuset.cpus
We could add
C = (cpu shares) / 1024
and do min(A, B, C)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that's preferable since I can't think of a case in which the dashboard should think the cluster has a different number of cpus than the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And probably after we do this, we'd want to use ray.utils.get_num_cpus()
in ReporterAgent
rather than psutil.cpu_count
.
But we should probably respect what the tuple in this line is doing:
ray/dashboard/modules/reporter/reporter_agent.py
Lines 116 to 117 in 153dcd3
self._cpu_counts = (psutil.cpu_count(), | |
psutil.cpu_count(logical=False)) |
Does anyone know what this pair of normal count and logical count mean,
and what it should mean in the context of a container running in a K8s pod?
cc @fyrestone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the immediate moment I'll avoid dealing with it by keeping the if IN_KUBERNETES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And probably after we do this, we'd want to use
ray.utils.get_num_cpus()
inReporterAgent
rather thanpsutil.cpu_count
.But we should probably respect what the tuple in this line is doing:
ray/dashboard/modules/reporter/reporter_agent.py
Lines 116 to 117 in 153dcd3
self._cpu_counts = (psutil.cpu_count(), psutil.cpu_count(logical=False)) Does anyone know what this pair of normal count and logical count mean,
and what it should mean in the context of a container running in a K8s pod?
cc @fyrestone
Sorry, I am not familiar with K8s resources. The normal count and logical count are defined here: https://psutil.readthedocs.io/en/latest/#psutil.cpu_count
python/ray/tests/test_dashboard.py
Outdated
@@ -39,6 +93,64 @@ def test_dashboard(shutdown_only): | |||
f"Dashboard output log: {out_log}\n") | |||
|
|||
|
|||
@pytest.mark.skipif( | |||
sys.platform.startswith("win"), reason="No need to test on Windows.") | |||
def test_k8s_cpu(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized this might not be the best place for this new test.
Where in the CI does test_dashboard.py run? (It's not specified in tests/BUILD)
Also, I've meddled with |
docker cpu tests are in |
btw @ijrsvt can you take a look at this? after all the fun we had with docker cpus before :) |
Ugh -- I don't really understand the meaning of Just launched with the default AWS cluster launching @wuisawesome |
Actually will put the K8s logic in ray.utils.get_num_cpus inside an |
927dd00
to
024fb05
Compare
Did a final check on GKE. Works as expected. I think this PR is ready, pending tests. |
def get_k8s_cpus(cpu_share_file_name="/sys/fs/cgroup/cpu/cpu.shares") -> float: | ||
"""Get number of CPUs available for use by this container, in terms of | ||
cgroup cpu shares. | ||
|
||
This is the number of CPUs K8s has assigned to the container based | ||
on pod spec requests and limits. | ||
|
||
Note: using cpu_quota as in _get_docker_cpus() works | ||
only if the user set CPU limit in their pod spec (in addition to CPU | ||
request). Otherwise, the quota is unset. | ||
""" | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting this out from the usual docker logic!
Could someone merge this? (lost track of who has merge permissions these days) |
Why are these changes needed?
I basically copied the logic used in the
docker stats
cli command for thishttps://github.com/docker/cli/blob/c0a6b1c7b30203fbc28cd619acb901a95a80e30e/cli/command/container/stats_helpers.go#L166
Here's what the dashboard looks like when you launch a Ray node on K8s with 2 CPU and add 1 CPUs-worth of stress with
sudo stress --cpu 1 --timeout 30
Roughly 50% usage.
Related issue number
CPU subproblem of #11172
Checks
scripts/format.sh
to lint the changes in this PR.