Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dashboard] Fixes the CPU percent metrics for Dashboard process. #45124

Merged
merged 4 commits into from
May 5, 2024

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented May 3, 2024

The metrics ray_component_cpu_percentage for Dashboard is always 0. This is because every time we create a fresh psutil.Process instance. In the document it says:

When *interval* is 0.0 or None (default) compares process times
to system CPU times elapsed since last call, returning
immediately (non-blocking). That means that the first time
this is called it will return a meaningful 0.0 value.

... so we always get a 0.0 value.

This PR changes to cache the dash proc psutil.Process instance in the head module class so the estimation can be non-0. Note: the readings are CPU percentage since the last call so it's a 5 second (METRICS_RECORD_INTERVAL_S) average.

I tested on my laptop, the value matches output of ps -p $PID -o %cpu and an external psutil python script.

The metrics `ray_component_cpu_percentage` for Dashboard is always 0. This is because every time we create a fresh `psutil.Process` instance. In the document it says:

    When *interval* is 0.0 or None (default) compares process times
    to system CPU times elapsed since last call, returning
    immediately (non-blocking). That means that the first time
    this is called it will return a meaningful 0.0 value.

... so we always get a 0.0 value. Further, in Kubernetes the CPU percentage should be based on the amount of CPUs available to the container. Both issues are already fixed in the `reporter_agent.py` file. We need to make the dashboard metrics comparable to that.

This PR moves the CPU percentage calculation to a shared function in `metrics_utils.py` and uses it in both the agent code and the head code.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
dashboard/metrics_utils.py Outdated Show resolved Hide resolved
dashboard/metrics_utils.py Outdated Show resolved Hide resolved
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do a quick check in other places to see if this bug exists?

dashboard/modules/metrics/metrics_head.py Outdated Show resolved Hide resolved
dashboard/modules/metrics/metrics_head.py Outdated Show resolved Hide resolved
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
Contributor Author

rynewang commented May 4, 2024

That's the only 2 metrics currently exposed by the dash (cpu, mem) both ok now. pls merge

@jjyao jjyao merged commit 027f556 into ray-project:master May 5, 2024
5 checks passed
@rynewang rynewang deleted the fix-dash-cpu branch May 5, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants