Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dashboard][kubernetes] Show container's memory info on K8s, not the physical host's. #14499

Merged
merged 15 commits into from
Mar 9, 2021

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Mar 5, 2021

Why are these changes needed?

Shows container's memory info in the dashboard when using Kubernetes, not the physical host's.
K8s dashboard situation looks a little better:
Screen Shot 2021-03-04 at 6 37 19 PM

(1 head, 2 min_workers all with 512Mi memory limit)

Related issue number

Addresses memory subproblem of
#11172

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the if statement? Shouldn't ray.utils.get_system_memory() already do all these checks?

If we have a good reason for deviating from how the core measures memory usage, we should carefully document it, especially now that we're auto-reporting memory resources within ray.available_resources().

Btw the issue also mentions CPU usage. Should this PR also include that?

@DmitriGekhtman
Copy link
Contributor Author

If we have a good reason for deviating from how the core measures memory usage, we should carefully document it, especially now that we're auto-reporting memory resources within ray.available_resources().

I think psutil's definition of percent is different, but sure I'll get rid of the if statement.
I think with this change the percent reported for non-docker set-ups will be slightly lower.

I decided not to deal with CPU right now, as it seems slightly more subtle -- unless you (or anyone else) have a quick fix idea!

@wuisawesome
Copy link
Contributor

ray.utils.get_num_cpus()? Or is the issue that you want to measure usage instead of the scheduler allocations?

@DmitriGekhtman
Copy link
Contributor Author

ray.utils.get_num_cpus()? Or is the issue that you want to measure usage instead of the scheduler allocations?

When I run on minikube with 8 cpus allocated to minikube, attach to the head pod, and run ray.utils.get_num_cpus(), I get 8 instead of the 1 cpu request/limit I've assigned the pod.

@DmitriGekhtman
Copy link
Contributor Author

We do want accurate usage and percent for CPU too.

@wuisawesome
Copy link
Contributor

Isn't that a pretty big issue since it means Ray will incorrectly set num_cpus? Do we have an issue tracking that?

@DmitriGekhtman
Copy link
Contributor Author

Isn't that a pretty big issue since it means Ray will incorrectly set num_cpus? Do we have an issue tracking that?

KubernetesNodeProvider reads from the pod spec: https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/kubernetes/config.py#L93

@DmitriGekhtman
Copy link
Contributor Author

ray.utils.get_num_cpus()? Or is the issue that you want to measure usage instead of the scheduler allocations?

When I run on minikube with 8 cpus allocated to minikube, attach to the head pod, and run ray.utils.get_num_cpus(), I get 8 instead of the 1 cpu request/limit I've assigned the pod.

After looking into it more carefully -- it reads 8 because it does indeed use all 8 CPUs available to it. But it uses at most x CPU's worth of cycles per unit time where x is the limit specified in the pod spec.

@DmitriGekhtman
Copy link
Contributor Author

I think the change in this PR should be pretty uncontroversial.

CPUs I've started looking at -- I'd like to isolate those in a different PR.

@DmitriGekhtman DmitriGekhtman dismissed wuisawesome’s stale review March 8, 2021 16:31

Got rid of the if, deferred CPU to another PR.

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dealing with CPUs in a separate PR sounds good to me

@wuisawesome wuisawesome merged commit 4a7d9e7 into ray-project:master Mar 9, 2021
@DmitriGekhtman DmitriGekhtman deleted the k8s-dashboard-memory branch March 18, 2021 23:23
@AmeerHajAli AmeerHajAli added this to the Serverless Autoscaling milestone Apr 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants