-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow ray stop to stop a specific ray start #12264
Comments
@richardliaw this seems pretty important for better Slurm support. |
I see 3 main ways we could implement this.
|
As a subset of 1. possibly |
I believe the
then
I think we could implement that fairly easily (minus windows support). |
That didn't work. The Parent ID ends up being init:
On the other hand, |
Considering that |
Related to #11509 |
(redundant to #11509) |
Describe your feature request
The use case is that we have a cluster that is shared by multiple users. We are trying to use ray on this cluster. Since it is shared, to run a program, we request resources (example
qsub -l select=4:ncpus=2 ...
) and then we start ray on the resources we have been given. Then, when the program is done we need to stop ray.The issue is, say we request two sets of resources and get the following nodes:
Set1: node1 node2
Set2: node2 node3
Then, when we are done running the program for Set1, we run
ray stop
on node1 and node2, and that also would kill the ray running on node2 for Set2. That is a problem.One way to solve this would be if
ray stop
had some way of specifying which ray to stop, such as by password, or head node and port, or with some other unique identifier.This was first discussed at:
https://github.com/ray-project/ray/discussions/12103
Note that the slurm documentation ( https://docs.ray.io/en/master/cluster/slurm.html ) for ray suggests: "Clusters managed by Slurm may require that Ray is initialized as a part of the submitted job" which is basically where we run into the problems since if a submitted job has to start things, and you have two jobs, how do you stop the proper ones.
The text was updated successfully, but these errors were encountered: