Skip to content

[core] Use the correct way to check whether an actor task is running or not#51158

Merged
edoakes merged 3 commits into
masterfrom
20250306-devbox1-tmux-3-ray7
Mar 8, 2025
Merged

[core] Use the correct way to check whether an actor task is running or not#51158
edoakes merged 3 commits into
masterfrom
20250306-devbox1-tmux-3-ray7

Conversation

@kevin85421
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 commented Mar 7, 2025

Why are these changes needed?

Ray core worker processes change the process title by calling setproctitle before and after executing tasks. Then, verify_tasks_running_or_terminated uses the following logic to check whether the Ray actor task is running or not:

if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():

However, the logic is not correct in some cases. To understand the issue, we need to understand the behavior of setproctitle and psutil.Process(pid).name().

  • setproctitle:

    • It calls functions like prctl based on the OS and arch under the hood. It updates /proc/pid/task/tid/comm of the calling thread.
    • If the thread is the leader thread (i.e. main thread in Python), /proc/pid/comm will also be updated.
    • /proc/pid/cmdline and /proc/pid/task/tid/cmdline will always be updated based on my observation.
    • Note that comm has a length limitation (15 chars)
  • psutil.Process(pid).name()

If the actor task is not running on the main thread (in which case setproctitle will not update /proc/pid/comm), and if users pass an options.name to the actor that doesn't start with /proc/pid/comm, then task_name in psutil.Process(pid).name() will evaluate to false, which is incorrect.

def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 force-pushed the 20250306-devbox1-tmux-3-ray7 branch from 3d77e1b to 9ef035b Compare March 7, 2025 19:33
@kevin85421 kevin85421 marked this pull request as ready for review March 7, 2025 19:59
@kevin85421
Copy link
Copy Markdown
Member Author

This is to fix #51058 (comment).

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Mar 7, 2025
@kevin85421 kevin85421 requested a review from edoakes March 7, 2025 22:57
@edoakes edoakes merged commit 4568beb into master Mar 8, 2025
@edoakes edoakes deleted the 20250306-devbox1-tmux-3-ray7 branch March 8, 2025 00:37
edoakes pushed a commit that referenced this pull request Mar 11, 2025
#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.


```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes #51182
Closes #45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
qinyiyan pushed a commit to qinyiyan/ray that referenced this pull request Mar 13, 2025
…1229)

ray-project#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.


```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes ray-project#51182
Closes ray-project#45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
…or not (ray-project#51158)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Ray core worker processes change the process title by calling
`setproctitle` before and after executing tasks. Then,
`verify_tasks_running_or_terminated` uses the following logic to check
whether the Ray actor task is running or not:

```python
if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():
```

However, the logic is not correct in some cases. To understand the
issue, we need to understand the behavior of `setproctitle` and
`psutil.Process(pid).name()`.

* `setproctitle`:
* It calls functions like
[prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390)
based on the OS and arch under the hood. It updates
`/proc/pid/task/tid/comm` of the calling thread.
* If the thread is the leader thread (i.e. main thread in Python),
`/proc/pid/comm` will also be updated.
* `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be
updated based on my observation.
  * Note that `comm` has a length limitation (15 chars)

* `psutil.Process(pid).name()`
*
https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693
  * Try to retrieve `/proc/pid/comm`.
* If `/proc/pid/comm` is truncated, attempt to retrieve the first
element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`.
* If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return
`cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`.

If the actor task is not running on the main thread (in which case
`setproctitle` will not update `/proc/pid/comm`), and if users pass an
`options.name` to the actor that doesn't start with `/proc/pid/comm`,
then `task_name in psutil.Process(pid).name()` will evaluate to false,
which is incorrect.

```python
def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))
```


## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
…1229)

ray-project#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.


```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes ray-project#51182
Closes ray-project#45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
…or not (ray-project#51158)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Ray core worker processes change the process title by calling
`setproctitle` before and after executing tasks. Then,
`verify_tasks_running_or_terminated` uses the following logic to check
whether the Ray actor task is running or not:

```python
if psutil.pid_exists(pid) and task_name in psutil.Process(pid).name():
```

However, the logic is not correct in some cases. To understand the
issue, we need to understand the behavior of `setproctitle` and
`psutil.Process(pid).name()`.

* `setproctitle`:
* It calls functions like
[prctl](https://github.com/dvarrazzo/py-setproctitle/blob/cacf96fafa3da1cd1a5b131b4f8b9997c01518d5/src/spt_status.c#L390)
based on the OS and arch under the hood. It updates
`/proc/pid/task/tid/comm` of the calling thread.
* If the thread is the leader thread (i.e. main thread in Python),
`/proc/pid/comm` will also be updated.
* `/proc/pid/cmdline` and `/proc/pid/task/tid/cmdline` will always be
updated based on my observation.
  * Note that `comm` has a length limitation (15 chars)

* `psutil.Process(pid).name()`
*
https://github.com/giampaolo/psutil/blob/a17550784b0d3175da01cdb02cee1bc6b61637dc/psutil/__init__.py#L664-L693
  * Try to retrieve `/proc/pid/comm`.
* If `/proc/pid/comm` is truncated, attempt to retrieve the first
element of `/proc/pid/cmdline`; otherwise, return `/proc/pid/comm`.
* If `cmdline[0]` begins with the contents of `/proc/pid/comm`, return
`cmdline[0]`; otherwise, return the truncated `/proc/pid/comm`.

If the actor task is not running on the main thread (in which case
`setproctitle` will not update `/proc/pid/comm`), and if users pass an
`options.name` to the actor that doesn't start with `/proc/pid/comm`,
then `task_name in psutil.Process(pid).name()` will evaluate to false,
which is incorrect.

```python
def test_default_thread_options_name_long_comm_2(self, ray_start_regular):
    """
    `/proc/PID/comm` is truncated to 15 characters, so the process title
    is "ray::VeryLongCo". `psutil.Process.name()` doesn't return `cmdline()[0]`
    because `cmdline()[0]` doesn't start with "ray::VeryLongCo". This is the
    implementation detail of `psutil`.
    """
    task_name = "hello"

    @ray.remote(concurrency_groups={"io": 1})
    class VeryLongCommActor:
        def check_long_comm(self):
            pid = os.getpid()
            assert _is_actor_task_running(pid, task_name)
            # The first 15 characters of "ray::VeryLongCommActor"
            assert psutil.Process(pid).name() == "ray::VeryLongCo"
            assert psutil.Process(pid).cmdline()[0] == f"ray::{task_name}"
            return pid

    a = VeryLongCommActor.remote()
    pid = ray.get(a.check_long_comm.options(name=task_name).remote())
    wait_for_condition(lambda: not _is_actor_task_running(pid, task_name))
```

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
…1229)

ray-project#51158 adds tests for process titles. However, the definition of "title"
varies across different operating systems, and the behavior of
`psutil.Process().name()` differs as well.

```
[ref]: https://github.com/dvarrazzo/py-setproctitle
> The process title is usually visible in files such as /proc/PID/cmdline,
/proc/PID/status, /proc/PID/comm, depending on the operating system and
kernel version. This information is used by user-space tools such as ps
and top.
```

For example, in Linux, the `/proc/pid/comm` can't be more than 15 chars.
However, MacOS doesn't have `/proc/pid/comm`. In the following test, the
name of the process in Mac is "python3.9", but the name on Linux is
"ray::A.check".

```python
    def test_main_thread_short_comm(self, ray_start_regular):
        """
        Test that the main thread's comm is not truncated.
        """
        @ray.remote
        class A:
            def check(self):
                pid = os.getpid()
                assert _is_actor_task_running(pid, "check")
                assert psutil.Process(pid).name() == "ray::A.check"
                assert psutil.Process(pid).cmdline()[0] == "ray::A.check"
                return pid
        a = A.remote()
        pid = ray.get(a.check.remote())
        wait_for_condition(lambda: not _is_actor_task_running(pid, "check"))
```

## Related issue number

Closes ray-project#51182
Closes ray-project#45966

On my Mac M3

<img width="1728" alt="image"
src="https://github.com/user-attachments/assets/c8c0addf-9c18-402d-bafd-135ff8f6ea7a"
/>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants