Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ray stop to stop a specific ray start #12264

Closed
joshua-cogliati-inl opened this issue Nov 23, 2020 · 8 comments
Closed

Allow ray stop to stop a specific ray start #12264

joshua-cogliati-inl opened this issue Nov 23, 2020 · 8 comments
Labels
enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity QS Quantsight triage label

Comments

@joshua-cogliati-inl
Copy link

Describe your feature request

The use case is that we have a cluster that is shared by multiple users. We are trying to use ray on this cluster. Since it is shared, to run a program, we request resources (example qsub -l select=4:ncpus=2 ...) and then we start ray on the resources we have been given. Then, when the program is done we need to stop ray.
The issue is, say we request two sets of resources and get the following nodes:
Set1: node1 node2
Set2: node2 node3

Then, when we are done running the program for Set1, we run ray stop on node1 and node2, and that also would kill the ray running on node2 for Set2. That is a problem.

One way to solve this would be if ray stop had some way of specifying which ray to stop, such as by password, or head node and port, or with some other unique identifier.

This was first discussed at:
https://github.com/ray-project/ray/discussions/12103

Note that the slurm documentation ( https://docs.ray.io/en/master/cluster/slurm.html ) for ray suggests: "Clusters managed by Slurm may require that Ray is initialized as a part of the submitted job" which is basically where we run into the problems since if a submitted job has to start things, and you have two jobs, how do you stop the proper ones.

@joshua-cogliati-inl joshua-cogliati-inl added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 23, 2020
@edoakes
Copy link
Contributor

edoakes commented Nov 23, 2020

@richardliaw this seems pretty important for better Slurm support.

@wuisawesome
Copy link
Contributor

I see 3 main ways we could implement this.

  1. Take more advantage of OS utilities to kill processes (i.e. pgrep -P $pid_started_by_slurm | xargs kill).

  2. Change GCS/raylets/redis process names (everything else should kill itself via fate sharing). Not a big fan of this since it will take many timeouts for this to propagate.

  3. Implement an autoscaler node provider which supports namespace isolation (seems like Docker isn't a very popular option in the supercomputing world) and recommend people do that.

@joshua-cogliati-inl
Copy link
Author

As a subset of 1. possibly ray start could take a filename and ray start could write the pids that need to be killed into the file, and then ray stop could take the filename and read the pids and kill them. Or if ray start could otherwise communicate a pid out we could grab that and pass it back to ray stop. Or maybe if ray stop can figure out one process that is part of the started, it could use something like SIGUSER1 (or SIGTERM) on it, and then that process can trap the signal and call SIGKILL/SIGTERM on the rest if there is some process that can keep track of the all the ray children.

@wuisawesome
Copy link
Contributor

I believe the ray start process should be the "root" of Ray's process tree on all nodes. Could you try

ray start --head
echo $! > ray_pid.txt

then

pgrep -P $(cat ray_pid.txt) | xargs kill instead of ray stop, or something along those line?

I think we could implement that fairly easily (minus windows support).

@joshua-cogliati-inl
Copy link
Author

joshua-cogliati-inl commented Nov 23, 2020

That didn't work. The Parent ID ends up being init:

UID         PID   PPID  C STIME TTY      STAT   TIME CMD
jjc     155887      1  0 14:26 pts/0    Sl     0:00 <snip>/ray/raylet/raylet <snip>

On the other hand, pgrep -f <redispassword> | xargs kill managed to get everything except redis-server and plasma_store_server since the redispassword is in the command line for rest of the ray programs.

@joshua-cogliati-inl
Copy link
Author

Considering that ray stop already grabs the cmdline it could check the redis password and only kill the ones that match, we just need to figure out a way to match the redis-server and plasma_store_server.

@ericl ericl added P3 Issue moderate in impact or severity and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 8, 2020
@richardliaw richardliaw added the QS Quantsight triage label label Oct 11, 2022
@mattip
Copy link
Contributor

mattip commented Nov 13, 2022

Related to #11509

@mattip mattip added P2 Important issue, but not time-critical P3 Issue moderate in impact or severity and removed P3 Issue moderate in impact or severity P2 Important issue, but not time-critical labels Nov 15, 2022
@richardliaw
Copy link
Contributor

(redundant to #11509)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity QS Quantsight triage label
Projects
None yet
Development

No branches or pull requests

6 participants