Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kubernetes] [Dashboard] Remove disk data from dashboard when running on K8s. #14676

Merged

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Mar 15, 2021

Why are these changes needed?

The dashboard currently shows incorrect/confusing information in the Disk column when using Ray on K8s.
For example, when running two Ray pods on one physical K8s node, the limit shown for each is the disk limit of the K8s node
and the limit of the K8s node gets counted twice in the "totals" section.

This PR removes the Disk field from the Dashboard when running on Kubernetes -- the Ray dashboard is not the right place to get info on K8s node disk usage. By default K8s does not limit disk usable by a container.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Tested manually, see discussion below.

@@ -209,9 +209,11 @@ const NodeInfo: React.FC<{}> = () => {
// Show GPU features only if there is at least one GPU in cluster.
const showGPUs =
nodes.map((n) => n.gpus).filter((gpus) => gpus.length !== 0).length !== 0;
// Don't show disk on K8s. K8s node disk usage should be monitored elsewhere.
const showDisk = !("KUBERNETES_SERVICE_HOST" in process.env);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work because process.env is hard coded at dashboard build time, and this code will be ran in user's browser... in order to do this you have to make this variable in the dashboard backend (see reporting agent, @kathryn-zhou can help as well)

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense -- so basically my question is how do I get the variable IN_KUBERNETES_POD in reporter_agent.py to NodeInfo.tsx?
https://github.com/ray-project/ray/blob/master/dashboard/modules/reporter/reporter_agent.py#L29

-- by what code path does stuff exported by reporter_agent land in the frontend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes do you know how this works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kathryn-zhou you have added metrics to this reporter agent and pop it up in the frontend before. Can you leave some pointers here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you want to do is have the environment variable check here and report a special number that you use to not display usage on the dashboard:
https://github.com/ray-project/ray/blob/master/dashboard/modules/reporter/reporter_agent.py#L238

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., if we report 0 used and 0 available we don't even render it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable, will try that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added metrics to the frontend before, but @DmitriGekhtman it might be worthwhile to look at how GPU metrics are shown on the dashboard. I believe they are only shown if there is a GPU in the cluster.

@rkooo567 rkooo567 removed their assignment Mar 18, 2021
@DmitriGekhtman DmitriGekhtman force-pushed the k8s-dashboard-turn-off-disk branch 2 times, most recently from 2dcccfa to ac91c90 Compare April 3, 2021 14:45
@DmitriGekhtman
Copy link
Contributor Author

This PR is ready for review.

Disk field gone on K8s when running image with these changes:

Screen Shot 2021-04-03 at 12 18 38 AM

Disk field still present on AWS when running image with these changes:

Screen Shot 2021-03-10 at 12 27 19 AM

@DmitriGekhtman DmitriGekhtman changed the title [Draft] [Kubernetes] [Dashboard] Remove disk data from dashboard when running on K8s. [Kubernetes] [Dashboard] Remove disk data from dashboard when running on K8s. Apr 3, 2021
@AmeerHajAli AmeerHajAli added this to the Serverless Autoscaling milestone Apr 4, 2021
@edoakes
Copy link
Contributor

edoakes commented Apr 5, 2021

@DmitriGekhtman looks good but lint is failing:
https://buildkite.com/ray-project/ray-builders-pr/builds/3760#f3712600-6600-458b-a523-ca3079d7d61e/412-519

Looks like you need to run Prettier to fmt the frontend code. @kathryn-zhou I think you ran into this recently?

@DmitriGekhtman
Copy link
Contributor Author

Lint fixed, good to merge.

@simon-mo simon-mo merged commit 410f768 into ray-project:master Apr 6, 2021
edoakes pushed a commit that referenced this pull request May 3, 2022
…24416)

#14676 disabled the disk usage/total display for Ray nodes on K8s, because Ray nodes on K8s are run as pods, which in general do not use up the entire machine.

However, in some situations, it is useful to run one Ray pod per K8s node and report the disk usage.

This PR adds a flag to enable displaying disk usage in those situations.
DmitriGekhtman added a commit to DmitriGekhtman/ray that referenced this pull request May 3, 2022
…ay-project#24416)

ray-project#14676 disabled the disk usage/total display for Ray nodes on K8s, because Ray nodes on K8s are run as pods, which in general do not use up the entire machine.

However, in some situations, it is useful to run one Ray pod per K8s node and report the disk usage.

This PR adds a flag to enable displaying disk usage in those situations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants