[dashboard][k8s] Better CPU reporting when running on K8s #14593

DmitriGekhtman · 2021-03-10T05:40:01Z

Why are these changes needed?

I basically copied the logic used in the docker stats cli command for this
https://github.com/docker/cli/blob/c0a6b1c7b30203fbc28cd619acb901a95a80e30e/cli/command/container/stats_helpers.go#L166

Here's what the dashboard looks like when you launch a Ray node on K8s with 2 CPU and add 1 CPUs-worth of stress with
sudo stress --cpu 1 --timeout 30

Roughly 50% usage.

Related issue number

CPU subproblem of #11172

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

DmitriGekhtman · 2021-03-10T05:43:11Z

One thing I'm not sure about -- (but can check experimentally tomorrow morning by launching on AWS and seeing what psutil does there)
if fully using 2 out of 2 available CPUs, do we call it 100% usage or 200% usage?
(Current code in this PR assumes it's 100%)

DmitriGekhtman · 2021-03-10T14:55:36Z

One thing I'm not sure about -- (but can check experimentally tomorrow morning by launching on AWS and seeing what psutil does there)
if fully using 2 out of 2 available CPUs, do we call it 100% usage or 200% usage?
(Current code in this PR assumes it's 100%)

^ 2/2 CPU is 100%

edoakes

This looks great! Is it possible to add some testing to check the logic? At the very least, we can at least have some unit tests for the parsing/calculation logic.

wuisawesome · 2021-03-10T22:20:52Z

dashboard/k8s_utils.py

+        return 0.0
+
+
+def container_cpu_count():


Should we be doing this in the core too? If so, can we move the cpu count part there?

ray.utils._get_docker_cpus currently does min(A, B) with
A = quota/period
B = number of cpus indicated by cpuset.cpus
We could add
C = (cpu shares) / 1024
and do min(A, B, C)

What do you think?

I would think that's preferable since I can't think of a case in which the dashboard should think the cluster has a different number of cpus than the scheduler.

And probably after we do this, we'd want to use ray.utils.get_num_cpus() in ReporterAgent rather than psutil.cpu_count.

But we should probably respect what the tuple in this line is doing:

ray/dashboard/modules/reporter/reporter_agent.py

Lines 116 to 117 in 153dcd3

self._cpu_counts = (psutil.cpu_count(),

psutil.cpu_count(logical=False))

Does anyone know what this pair of normal count and logical count mean,
and what it should mean in the context of a container running in a K8s pod?
cc @fyrestone

For the immediate moment I'll avoid dealing with it by keeping the if IN_KUBERNETES

And probably after we do this, we'd want to use ray.utils.get_num_cpus() in ReporterAgent rather than psutil.cpu_count.

But we should probably respect what the tuple in this line is doing:

ray/dashboard/modules/reporter/reporter_agent.py

Lines 116 to 117 in 153dcd3

self._cpu_counts = (psutil.cpu_count(),

psutil.cpu_count(logical=False))

Does anyone know what this pair of normal count and logical count mean,
and what it should mean in the context of a container running in a K8s pod?
cc @fyrestone

Sorry, I am not familiar with K8s resources. The normal count and logical count are defined here: https://psutil.readthedocs.io/en/latest/#psutil.cpu_count

DmitriGekhtman · 2021-03-11T04:21:49Z

python/ray/tests/test_dashboard.py

@@ -39,6 +93,64 @@ def test_dashboard(shutdown_only):
                    f"Dashboard output log: {out_log}\n")


+@pytest.mark.skipif(
+    sys.platform.startswith("win"), reason="No need to test on Windows.")
+def test_k8s_cpu():


Just realized this might not be the best place for this new test.
Where in the CI does test_dashboard.py run? (It's not specified in tests/BUILD)

DmitriGekhtman · 2021-03-11T04:23:56Z

Also, I've meddled with ray.utils._get_docker_cpus() -- is that tested somewhere or should I add additional unit tests for that?

wuisawesome · 2021-03-11T05:41:38Z

docker cpu tests are in test_advanced_3. https://github.com/ray-project/ray/blob/master/python/ray/tests/test_advanced_3.py#L590

wuisawesome · 2021-03-11T05:43:04Z

btw @ijrsvt can you take a look at this? after all the fun we had with docker cpus before :)

DmitriGekhtman · 2021-03-11T20:18:31Z

Ugh -- I don't really understand the meaning of cpu.shares.

Just launched with the default AWS cluster launching
(which runs a docker container in a 2 CPU AWS instance)
/sys/fs/cgroup/cpu/cpu.shares reads 1024, not 2048.

@wuisawesome
I'm going to revert this PR to doing the num_cpu computation for K8s outside of core.

DmitriGekhtman · 2021-03-11T20:29:30Z

Actually will put the K8s logic in ray.utils.get_num_cpus inside an if IN_K8S_POD.

This reverts commit 47ea04c.

This reverts commit d467475.

DmitriGekhtman · 2021-03-12T02:01:02Z

Did a final check on GKE. Works as expected. I think this PR is ready, pending tests.

ijrsvt · 2021-03-12T07:08:00Z

python/ray/_private/utils.py

+def get_k8s_cpus(cpu_share_file_name="/sys/fs/cgroup/cpu/cpu.shares") -> float:
+    """Get number of CPUs available for use by this container, in terms of
+    cgroup cpu shares.
+
+    This is the number of CPUs K8s has assigned to the container based
+    on pod spec requests and limits.
+
+    Note: using cpu_quota as in _get_docker_cpus() works
+    only if the user set CPU limit in their pod spec (in addition to CPU
+    request). Otherwise, the quota is unset.
+    """
+    try:


Thanks for splitting this out from the usual docker logic!

DmitriGekhtman · 2021-03-12T17:41:05Z

Could someone merge this? (lost track of who has merge permissions these days)

DmitriGekhtman added 16 commits February 19, 2021 14:31

random doc typo

1c6fe4d

Merge branch 'master' of https://github.com/ray-project/ray

e3074c7

Merge branch 'master' of https://github.com/ray-project/ray

d46e772

Merge branch 'master' of https://github.com/ray-project/ray

33a62ef

Merge branch 'master' of https://github.com/ray-project/ray

86be6f6

Merge branch 'master' of https://github.com/ray-project/ray

72a9c88

Merge branch 'master' of https://github.com/ray-project/ray

e63c3b5

Merge branch 'master' of https://github.com/ray-project/ray

02bb712

Merge branch 'master' of https://github.com/ray-project/ray

4faed67

Merge branch 'master' of https://github.com/ray-project/ray

5a6036e

Merge branch 'master' of https://github.com/ray-project/ray

43fdcf6

Merge branch 'master' of https://github.com/ray-project/ray

2c27c8f

Merge branch 'master' of https://github.com/ray-project/ray

5d24056

Merge branch 'master' of https://github.com/ray-project/ray

84bf200

Merge branch 'master' of https://github.com/ray-project/ray

86b530d

Merge branch 'master' of https://github.com/ray-project/ray

18ed838

DmitriGekhtman assigned wuisawesome and rkooo567 Mar 10, 2021

DmitriGekhtman assigned edoakes Mar 10, 2021

edoakes reviewed Mar 10, 2021

View reviewed changes

edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 10, 2021

wuisawesome reviewed Mar 10, 2021

View reviewed changes

DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 11, 2021

DmitriGekhtman commented Mar 11, 2021

View reviewed changes

DmitriGekhtman added 14 commits March 11, 2021 15:03

Merge branch 'master' of https://github.com/ray-project/ray

bdf8566

Report K8S CPU usage

ab2e6b7

fix

6664ecf

try except on the other non-private function

b257917

truncate above 100%

05421fb

tests

56e7256

Move cpu logic to core. Amend tests.

2c95d5b

last commit before reversion

7b31eb6

Revert "last commit before reversion"

506d45c

This reverts commit 47ea04c.

Revert "Move cpu logic to core. Amend tests."

6a6f29f

This reverts commit d467475.

move k8s cpu logic, amend test

f64b668

done?

f410e8f

fix

184c47a

tweak

024fb05

DmitriGekhtman force-pushed the k8s-dashboard-cpuv2 branch from 927dd00 to 024fb05 Compare March 11, 2021 23:14

DmitriGekhtman added 2 commits March 11, 2021 16:42

ray.utils is no more

4d5d9b1

add ray/_private to setup-dev

439a3bf

DmitriGekhtman added this to the Serverless Autoscaling milestone Mar 12, 2021

DmitriGekhtman added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 12, 2021

ijrsvt approved these changes Mar 12, 2021

View reviewed changes

ijrsvt self-requested a review March 12, 2021 07:08

edoakes merged commit a90cffe into ray-project:master Mar 12, 2021

DmitriGekhtman deleted the k8s-dashboard-cpuv2 branch March 12, 2021 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dashboard][k8s] Better CPU reporting when running on K8s #14593

[dashboard][k8s] Better CPU reporting when running on K8s #14593

DmitriGekhtman commented Mar 10, 2021

DmitriGekhtman commented Mar 10, 2021 •

edited

DmitriGekhtman commented Mar 10, 2021

edoakes left a comment

wuisawesome Mar 10, 2021

DmitriGekhtman Mar 10, 2021

wuisawesome Mar 10, 2021

DmitriGekhtman Mar 10, 2021 •

edited

DmitriGekhtman Mar 11, 2021 •

edited

fyrestone Mar 11, 2021 •

edited

DmitriGekhtman Mar 11, 2021

DmitriGekhtman commented Mar 11, 2021

wuisawesome commented Mar 11, 2021

wuisawesome commented Mar 11, 2021

DmitriGekhtman commented Mar 11, 2021

DmitriGekhtman commented Mar 11, 2021

DmitriGekhtman commented Mar 12, 2021

ijrsvt Mar 12, 2021

DmitriGekhtman commented Mar 12, 2021

	self._cpu_counts = (psutil.cpu_count(),
	psutil.cpu_count(logical=False))

[dashboard][k8s] Better CPU reporting when running on K8s #14593

[dashboard][k8s] Better CPU reporting when running on K8s #14593

Conversation

DmitriGekhtman commented Mar 10, 2021

Why are these changes needed?

Related issue number

Checks

DmitriGekhtman commented Mar 10, 2021 • edited

DmitriGekhtman commented Mar 10, 2021

edoakes left a comment

Choose a reason for hiding this comment

wuisawesome Mar 10, 2021

Choose a reason for hiding this comment

DmitriGekhtman Mar 10, 2021

Choose a reason for hiding this comment

wuisawesome Mar 10, 2021

Choose a reason for hiding this comment

DmitriGekhtman Mar 10, 2021 • edited

Choose a reason for hiding this comment

DmitriGekhtman Mar 11, 2021 • edited

Choose a reason for hiding this comment

fyrestone Mar 11, 2021 • edited

Choose a reason for hiding this comment

DmitriGekhtman Mar 11, 2021

Choose a reason for hiding this comment

DmitriGekhtman commented Mar 11, 2021

wuisawesome commented Mar 11, 2021

wuisawesome commented Mar 11, 2021

DmitriGekhtman commented Mar 11, 2021

DmitriGekhtman commented Mar 11, 2021

DmitriGekhtman commented Mar 12, 2021

ijrsvt Mar 12, 2021

Choose a reason for hiding this comment

DmitriGekhtman commented Mar 12, 2021

DmitriGekhtman commented Mar 10, 2021 •

edited

DmitriGekhtman Mar 10, 2021 •

edited

DmitriGekhtman Mar 11, 2021 •

edited

fyrestone Mar 11, 2021 •

edited