-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kubernetes] [Dashboard] Remove disk data from dashboard when running on K8s. #14676
[Kubernetes] [Dashboard] Remove disk data from dashboard when running on K8s. #14676
Conversation
299e36f
to
ddd2b41
Compare
@@ -209,9 +209,11 @@ const NodeInfo: React.FC<{}> = () => { | |||
// Show GPU features only if there is at least one GPU in cluster. | |||
const showGPUs = | |||
nodes.map((n) => n.gpus).filter((gpus) => gpus.length !== 0).length !== 0; | |||
// Don't show disk on K8s. K8s node disk usage should be monitored elsewhere. | |||
const showDisk = !("KUBERNETES_SERVICE_HOST" in process.env); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't work because process.env is hard coded at dashboard build time, and this code will be ran in user's browser... in order to do this you have to make this variable in the dashboard backend (see reporting agent, @kathryn-zhou can help as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense -- so basically my question is how do I get the variable IN_KUBERNETES_POD
in reporter_agent.py
to NodeInfo.tsx
?
https://github.com/ray-project/ray/blob/master/dashboard/modules/reporter/reporter_agent.py#L29
-- by what code path does stuff exported by reporter_agent
land in the frontend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edoakes do you know how this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kathryn-zhou you have added metrics to this reporter agent and pop it up in the frontend before. Can you leave some pointers here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you want to do is have the environment variable check here and report a special number that you use to not display usage on the dashboard:
https://github.com/ray-project/ray/blob/master/dashboard/modules/reporter/reporter_agent.py#L238
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., if we report 0
used and 0
available we don't even render it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable, will try that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't added metrics to the frontend before, but @DmitriGekhtman it might be worthwhile to look at how GPU metrics are shown on the dashboard. I believe they are only shown if there is a GPU in the cluster.
2dcccfa
to
ac91c90
Compare
@DmitriGekhtman looks good but lint is failing: Looks like you need to run |
ac91c90
to
448a78d
Compare
Lint fixed, good to merge. |
…24416) #14676 disabled the disk usage/total display for Ray nodes on K8s, because Ray nodes on K8s are run as pods, which in general do not use up the entire machine. However, in some situations, it is useful to run one Ray pod per K8s node and report the disk usage. This PR adds a flag to enable displaying disk usage in those situations.
…ay-project#24416) ray-project#14676 disabled the disk usage/total display for Ray nodes on K8s, because Ray nodes on K8s are run as pods, which in general do not use up the entire machine. However, in some situations, it is useful to run one Ray pod per K8s node and report the disk usage. This PR adds a flag to enable displaying disk usage in those situations.
Why are these changes needed?
The dashboard currently shows incorrect/confusing information in the Disk column when using Ray on K8s.
For example, when running two Ray pods on one physical K8s node, the limit shown for each is the disk limit of the K8s node
and the limit of the K8s node gets counted twice in the "totals" section.
This PR removes the Disk field from the Dashboard when running on Kubernetes -- the Ray dashboard is not the right place to get info on K8s node disk usage. By default K8s does not limit disk usable by a container.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.Tested manually, see discussion below.