-
Notifications
You must be signed in to change notification settings - Fork 148
[Ray] Add elasticity and fault tolerance features to jobs launched on ray cluster #572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Include port number in parsed cluster address
… attribute nnodes_rep to AppDef to pass elastic representation of number of nodes
d4l3k
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting into a nicer shape :) I like how you used different return values as state machine inputs
torchx/specs/api.py
Outdated
|
|
||
| name: str | ||
| image: str | ||
| nnodes_rep: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: min_replicas: Optional[int] = None
torchx/schedulers/ray/ray_driver.py
Outdated
| self.reschedule_actor(failed_actor_id) | ||
|
|
||
|
|
||
| def parse_nnodes_rep(actors: List[RayActor]) -> Tuple[int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be nice to get rid of this parsing by passing in min_replicas instead of rep
| return min_nnodes, max_nnodes | ||
|
|
||
|
|
||
| def parse_actor_id_from_error(err: RayActorError) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big fan of this but not sure if there's any better approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems actor_id is the only thing useful we get from the exception without changing ray's code. Only error messages. https://github.com/ray-project/ray/blob/4c5c5763efff50bf9f76ae73a8c0073183dbb0cc/python/ray/exceptions.py#L233
torchx/schedulers/ray/ray_driver.py
Outdated
| except RayActorError as err: | ||
| # reschedule the failed command actor (node failure) | ||
| command_actors_count -= 1 # remove the failed actor | ||
| failed_actor_id: str = parse_actor_id_from_error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of parsing can we use a map between the ObjectRef to the actor info? Seems like the ObjectRefs returned from wait should be equal to the ones passed in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked, but I think there isn't actor related information. https://github.com/ray-project/ray/blob/4c5c5763efff50bf9f76ae73a8c0073183dbb0cc/python/ray/includes/object_ref.pxi#L38
But we can create some issues about this one and the exception one.
torchx/schedulers/ray/ray_driver.py
Outdated
| break # exit | ||
| else: | ||
| raise RuntimeError( | ||
| "Ray actor returns unkown type. This is most likely bug in torchx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit spelling
|
|
||
| def init_placement_groups(self) -> None: | ||
| """Initialize all placement groups needed for this job""" | ||
| replica_ix_of_pg: List[int] = [0] + list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be better to explicitly make the calls here. + we need to wait for the first placement group before creating the remainders otherwise the smaller groups might get scheduled first (unless ray does fifo queuing, which if it does we should document that)
initial = create_placement_group_async(self.replicas[0: self.min_nodes])
pg.wait(itimeout_seconds=...)
self.groups = []
for range(self.min_nodes, self.max_nodes):
self.groups.append(create_placement_group_async(self.replicas[i:i+1]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray doesn't do FIFO, will change.
torchx/schedulers/ray/ray_driver.py
Outdated
| if ( | ||
| command_actors_count == 0 | ||
| ): # all the command actors have finished | ||
| break # exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only breaks the inner loop -- is there a case where active_tasks >0 but command_actors_count == 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, active_tasks may contain some actors which haven't been scheduled successfully in a placement group, even after the job has finished. I guess I have to use return here.
| return self.actor_info_of_id.pop(actor_id) | ||
|
|
||
| def reschedule_actor(self, actor_id: str) -> None: | ||
| """Rescheule a failed actor""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should check the role max_retries/retry_policy here and throw an error if any of the workers exits more than N times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, here if we add a max_retries, it actually means how many node failures we can endure, so it's different from the max_retries in the ray actor's context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the number of working nodes is bigger than min_nnodes, we should allow infinite node failures, unless nodes are failing too frequently. That can be a retry_policy.
| with self.assertRaisesRegex( | ||
| Exception, "RAY_ADDRESS env variable is expected" | ||
| ): | ||
| self._scheduler.list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to add some "state machine" mock tests that runs through that driver loop step by step
torchx/schedulers/ray/ray_driver.py
Outdated
| need_more_actors: bool = True # if need more actors | ||
| command_actors_count: int = 0 # number of created command actors | ||
| # Await return result of remote ray function and initialize new command actors | ||
| while len(self.active_tasks) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we split loop out from run so it has a "_step()" and "run()" method so we can manually step through the behavior from a mocked unit test?
|
@ljjsalt @d4l3k I have been thinking more about this and I am wondering if ray.util.queue is a better way of implementing this. you basically create 2 actors, PlacementGroupManager actor and CommandManager actor and exchange information between them using ray queue. PGManager group actor is responsible for creating the placement group and then putting a message in the queue for the CommandManager actor to process which can then create the command actor. this helps us keep the knowledge of pg creation and command actor creation compartmentalized in these actors. additionally, we can @Remote these fns so both of these can be run in parallel https://docs.ray.io/en/releases-1.2.0/advanced.html#message-passing-using-ray-queue |
The reason that we should use Placement Group is: if we only use ray actors here, even though we can rerun the failed actor after it throws a RayActorError. But this job could potentially lose the computation resource it used to have, then fail the job even the node is recovered. For example, there are 3 nodes each with 1 cpu, job 1 requires minimum nodes of 3, and it’s running with all 3 nodes, another job 2 launched later and it requires 1 cpu, and becomes a pending task. Once a node failure happens, we have to rerun the actor which becomes a pending task after job 2, and job 2 will take that node, job 1 will fail since there aren't enough nodes to restart. Since placement groups rescheduling have the highest priority, this situation won't happen. I didn't get it why to use queue, since we didn't use any inter process communication here. I understand your concern in using Besides, considering each command actor is a finite state machine that has four states(SCHEDULING, FAILING, RUNNING, TASK_COMPLETED). |
yeah, not disagreeing on that. I am just thinking how do we manage the interaction between pg creation and command actor creation I was thinking something like this, this is not the correct code, but more of a thought process on how to do interaction between the logic that is creating placement groups and the logic that is creating command actors and how to keep them compartmentalized |
# step 2. while loop to keep create the rest of PGs incrementatlly,
#go through the rest of the pgs one by one and keep adding them to the queue
#we can deal with the logic of pg failure here or when to stop pg creation if neededActually, this step is not necessary. First, increasing PG one by one can be a problem once the number of nodes is large. The second change: instead of creating the command actor after a placement group has been scheduled, we create all the command actors at the beginning too(with |
Codecov Report
@@ Coverage Diff @@
## main #572 +/- ##
==========================================
+ Coverage 94.85% 95.00% +0.15%
==========================================
Files 66 66
Lines 4042 4144 +102
==========================================
+ Hits 3834 3937 +103
+ Misses 208 207 -1
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
|
If we did want to separate the concerns here with scheduling vs retries we could use the builtin task max_retries config though that has some other implications. I'm not sure that using a queue with two actors would simplify this logic -- state machines that you can step through are nice from a testing perspective |
But we are just running schedule(which cannot go wrong) and |
Summary: Elasticity - the execution of placement groups are pending tasks that will be scheduled by GCS when resources become available. Related PR: #572 Pull Request resolved: #580 Test Plan: Mock cluster scaling with `ray.cluster_utils`. Reviewed By: priyaramani Differential Revision: D38838786 Pulled By: d4l3k fbshipit-source-id: b27073fd6ad4822c121e07de729b839f6cf6291a

Two features for Elastic Distributed Training are added to job launched by TorchX on Ray Cluster in this PR:
The logic of the new
ray_driver.pyis in the plot below:Test plan:
[Note]: This PR is related to the previous PR #559. All future changes will be submitted to this PR.