-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray cluster CRD and example CR + multi-ray-cluster operator #12098
Ray cluster CRD and example CR + multi-ray-cluster operator #12098
Conversation
d82a776
to
696a936
Compare
python/ray/autoscaler/kubernetes/operator_configs/example_cluster_resource.yaml
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/kubernetes/operator_configs/example_cluster_resource.yaml
Outdated
Show resolved
Hide resolved
6827805
to
14c8749
Compare
14c8749
to
1aac505
Compare
Currently debugging the following error, which occurs after starting a cluster, shutting it down, and trying to start a new cluster by applying, deleting, applying a raycluster custom resource. The error takes place immediately after the monitor initializes a StandardAutoscaler. edit: Takes place in edit: This probably happens because we can only support one edit: Running
|
d872b14
to
6885a86
Compare
Running the monitor in a subprocess fixed the previous issue -- the code is now functional! |
self.subprocess.terminate() | ||
self.subprocess.join() | ||
# Reinstantiate process with f as target and start. | ||
self.subprocess = mp.Process(name=self.name, target=f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible for the subprocesses to be leaked if operator.py is killed unexpectedly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently if operator.py
is killed unexpectedly, the operator pod will shut down.
Which reminds me -- I was going to make all of the ray-clusters managed by the operator fate-share with the operator.
So then everything would go down.
That's of course not optimal behavior.
Let me also experiment to see what happens if operator.py
is killed unexpectedly but the pod doesn't go down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want to kill all the clusters when operator go down.
It just need to know about them when started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. I think that should work with the code as it is. (When the operator restarts it should create_or_update on each cluster.)
Will test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Python does an OK job of cleaning up processes spawned by the multiprocessing module?
To check, I ran a test script that uses multiprocessing to spawn a dummy process that runs forever. After doing a Ctrl-C keyboard interrupt, the pid of the process is no longer present in the output of ps -ef
.
I'll check that the monitor processes behave in the same way.
Let me know if there's something that can be done to ensure that the processes are cleaned correctly.
(besides rewriting everything to have each cluster's autoscaler .update
in a for loop, which is probably a good idea to implement sooner or later or sooner to replace this subprocess logic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your test sounds good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you were right -- terminating operator.py
did leak the monitor process. (In my test script, the child process was receiving the keyboard interrupt.)
I've now set the monitor subprocess to be a daemon, and that works -- when I run kill -SIGTERM <operator_pid>
it stops the monitor subprocess too.
(If you do kill -SIGKILL <operator_pid>
then the daemon monitor subprocess still leaks.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay then this is still a problem we should fix in the near future.
8909a0d
to
5fe741e
Compare
5fe741e
to
8b60d3d
Compare
b2ac1f7
to
87c99aa
Compare
…ption if CRD missing in operator.
…delete -f is sufficient)
87c99aa
to
a802307
Compare
Why are these changes needed?
Following up on #11929, this PR adds a k8s CRD describing a ray cluster configuration and an example ray cluster CR. A ray cluster CR is pretty much just a reformatted version of one of the current ray cluster configs.
This PR also extends the operator such that it can manage multiple Ray clusters.
Using kubectl, users can create/update/delete clusters and check monitoring logs.
Related issue number
#11929 #11545
Checks
scripts/format.sh
to lint the changes in this PR.Create, update, delete, logging working as expected for me locally.