Skip to content

[Core] Deflake test_placement_group_reschedule_node_dead with ps wide output#60034

Merged
dayshah merged 1 commit intoray-project:masterfrom
Yicheng-Lu-llll:fix-test-pg-reschedule-ps-aux
Jan 12, 2026
Merged

[Core] Deflake test_placement_group_reschedule_node_dead with ps wide output#60034
dayshah merged 1 commit intoray-project:masterfrom
Yicheng-Lu-llll:fix-test-pg-reschedule-ps-aux

Conversation

@Yicheng-Lu-llll
Copy link
Member

Description

test_placement_group_reschedule_node_dead uses ps aux | grep {node_id} to find and kill a node.

Recently, I found it failed to kill the node because grep found nothing. The root cause is that ps aux truncates long command lines, causing the --node_id parameter to be cut off and grep to fail finding the process(maybe we have longer raylet command now or we changed terminal setting).

This PR use ps auxww which ensures unlimited output width (the ww option means "wide output, use twice for unlimited width")

See the CI failure: it checks after killing. It should have 3 − 1 = 2, but it always has 3 after the timeout.
https://buildkite.com/ray-project/microcheck/builds/35178/steps/canvas?jid=019ba263-a01d-4ef2-9d7e-d3bfb149e638#019ba263-a01d-4ef2-9d7e-d3bfb149e638/L191
https://buildkite.com/ray-project/microcheck/builds/35178/steps/canvas?jid=019ba332-55cb-4669-ae55-99572de4ffbf#019ba332-55cb-4669-ae55-99572de4ffbf/L1348

@Yicheng-Lu-llll Yicheng-Lu-llll requested a review from a team as a code owner January 11, 2026 08:15
@Yicheng-Lu-llll Yicheng-Lu-llll added the go add ONLY when ready to merge, run all tests label Jan 11, 2026
Signed-off-by: yicheng <yicheng@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a test flakiness issue in test_placement_group_reschedule_node_dead by using ps auxww to prevent command line truncation. The change is correct and directly solves the problem described.

I've added one suggestion to refactor the kill_node function to use the psutil library. This would make the implementation more robust, secure, and idiomatic Python by avoiding shell=True and making the process search more specific. This is an optional improvement for better code quality.

@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the fix-test-pg-reschedule-ps-aux branch from 617f092 to e13015d Compare January 11, 2026 08:18
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Jan 11, 2026
@dayshah
Copy link
Contributor

dayshah commented Jan 12, 2026

nice fix!! are there other tests that grep ps aux?

@dayshah dayshah merged commit d082283 into ray-project:master Jan 12, 2026
6 checks passed
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
… output (ray-project#60034)

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
@Yicheng-Lu-llll
Copy link
Member Author

are there other tests that grep ps aux?

Yes! I found a new one:

"ps aux | grep ray::IDLE | grep -v grep",

but it greps ray::IDLE (it is the process title set via setproctitle) so it's safe.

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
… output (ray-project#60034)

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants