Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dashboard][k8s] Better CPU reporting when running on K8s #14593

Merged
merged 32 commits into from
Mar 12, 2021

Conversation

DmitriGekhtman
Copy link
Contributor

Why are these changes needed?

I basically copied the logic used in the docker stats cli command for this
https://github.com/docker/cli/blob/c0a6b1c7b30203fbc28cd619acb901a95a80e30e/cli/command/container/stats_helpers.go#L166

Here's what the dashboard looks like when you launch a Ray node on K8s with 2 CPU and add 1 CPUs-worth of stress with
sudo stress --cpu 1 --timeout 30
Screen Shot 2021-03-10 at 12 27 19 AM
Roughly 50% usage.

Related issue number

CPU subproblem of #11172

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Mar 10, 2021

One thing I'm not sure about -- (but can check experimentally tomorrow morning by launching on AWS and seeing what psutil does there)
if fully using 2 out of 2 available CPUs, do we call it 100% usage or 200% usage?
(Current code in this PR assumes it's 100%)

@DmitriGekhtman
Copy link
Contributor Author

One thing I'm not sure about -- (but can check experimentally tomorrow morning by launching on AWS and seeing what psutil does there)
if fully using 2 out of 2 available CPUs, do we call it 100% usage or 200% usage?
(Current code in this PR assumes it's 100%)

^ 2/2 CPU is 100%

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Is it possible to add some testing to check the logic? At the very least, we can at least have some unit tests for the parsing/calculation logic.

@edoakes edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 10, 2021
return 0.0


def container_cpu_count():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be doing this in the core too? If so, can we move the cpu count part there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.utils._get_docker_cpus currently does min(A, B) with
A = quota/period
B = number of cpus indicated by cpuset.cpus
We could add
C = (cpu shares) / 1024
and do min(A, B, C)

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that's preferable since I can't think of a case in which the dashboard should think the cluster has a different number of cpus than the scheduler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably after we do this, we'd want to use ray.utils.get_num_cpus() in ReporterAgent rather than psutil.cpu_count.

But we should probably respect what the tuple in this line is doing:

self._cpu_counts = (psutil.cpu_count(),
psutil.cpu_count(logical=False))

Does anyone know what this pair of normal count and logical count mean,
and what it should mean in the context of a container running in a K8s pod?
cc @fyrestone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the immediate moment I'll avoid dealing with it by keeping the if IN_KUBERNETES

Copy link
Contributor

@fyrestone fyrestone Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably after we do this, we'd want to use ray.utils.get_num_cpus() in ReporterAgent rather than psutil.cpu_count.

But we should probably respect what the tuple in this line is doing:

self._cpu_counts = (psutil.cpu_count(),
psutil.cpu_count(logical=False))

Does anyone know what this pair of normal count and logical count mean,
and what it should mean in the context of a container running in a K8s pod?
cc @fyrestone

Sorry, I am not familiar with K8s resources. The normal count and logical count are defined here: https://psutil.readthedocs.io/en/latest/#psutil.cpu_count

@DmitriGekhtman DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 11, 2021
@@ -39,6 +93,64 @@ def test_dashboard(shutdown_only):
f"Dashboard output log: {out_log}\n")


@pytest.mark.skipif(
sys.platform.startswith("win"), reason="No need to test on Windows.")
def test_k8s_cpu():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized this might not be the best place for this new test.
Where in the CI does test_dashboard.py run? (It's not specified in tests/BUILD)

@DmitriGekhtman
Copy link
Contributor Author

Also, I've meddled with ray.utils._get_docker_cpus() -- is that tested somewhere or should I add additional unit tests for that?

@wuisawesome
Copy link
Contributor

@wuisawesome
Copy link
Contributor

btw @ijrsvt can you take a look at this? after all the fun we had with docker cpus before :)

@DmitriGekhtman
Copy link
Contributor Author

Ugh -- I don't really understand the meaning of cpu.shares.

Just launched with the default AWS cluster launching
(which runs a docker container in a 2 CPU AWS instance)
/sys/fs/cgroup/cpu/cpu.shares reads 1024, not 2048.

@wuisawesome
I'm going to revert this PR to doing the num_cpu computation for K8s outside of core.

@DmitriGekhtman
Copy link
Contributor Author

Actually will put the K8s logic in ray.utils.get_num_cpus inside an if IN_K8S_POD.

@DmitriGekhtman
Copy link
Contributor Author

Did a final check on GKE. Works as expected. I think this PR is ready, pending tests.

@DmitriGekhtman DmitriGekhtman added this to the Serverless Autoscaling milestone Mar 12, 2021
@DmitriGekhtman DmitriGekhtman added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 12, 2021
Comment on lines +467 to +478
def get_k8s_cpus(cpu_share_file_name="/sys/fs/cgroup/cpu/cpu.shares") -> float:
"""Get number of CPUs available for use by this container, in terms of
cgroup cpu shares.

This is the number of CPUs K8s has assigned to the container based
on pod spec requests and limits.

Note: using cpu_quota as in _get_docker_cpus() works
only if the user set CPU limit in their pod spec (in addition to CPU
request). Otherwise, the quota is unset.
"""
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting this out from the usual docker logic!

@ijrsvt ijrsvt self-requested a review March 12, 2021 07:08
@DmitriGekhtman
Copy link
Contributor Author

Could someone merge this? (lost track of who has merge permissions these days)

@edoakes edoakes merged commit a90cffe into ray-project:master Mar 12, 2021
@DmitriGekhtman DmitriGekhtman deleted the k8s-dashboard-cpuv2 branch March 12, 2021 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants