[Ray] Add elasticity and fault tolerance features to jobs launched on ray cluster #572

ntlm1686 · 2022-08-02T19:43:01Z

Two features for Elastic Distributed Training are added to job launched by TorchX on Ray Cluster in this PR:

Fault Tolerance - Node failure throws RayActorError which can be captured. Placement Groups have built-in fault tolerance, and can recover from node failure automatically.
Elasticity - the execution of placement groups are pending tasks that will be scheduled by GCS when resources become availiable.

The logic of the new ray_driver.py is in the plot below:

Test plan:

[Note]: This PR is related to the previous PR #559. All future changes will be submitted to this PR.

Include port number in parsed cluster address

… attribute nnodes_rep to AppDef to pass elastic representation of number of nodes

…olerance

d4l3k

getting into a nicer shape :) I like how you used different return values as state machine inputs

d4l3k · 2022-08-02T21:35:15Z

torchx/specs/api.py


    name: str
    image: str
+    nnodes_rep: Optional[str] = None


nit: min_replicas: Optional[int] = None

d4l3k · 2022-08-02T21:42:28Z

torchx/schedulers/ray/ray_driver.py

+                    self.reschedule_actor(failed_actor_id)
+
+
+def parse_nnodes_rep(actors: List[RayActor]) -> Tuple[int, int]:


nit: would be nice to get rid of this parsing by passing in min_replicas instead of rep

d4l3k · 2022-08-02T21:42:51Z

torchx/schedulers/ray/ray_driver.py

+    return min_nnodes, max_nnodes
+
+
+def parse_actor_id_from_error(err: RayActorError) -> str:


not a big fan of this but not sure if there's any better approach

It seems actor_id is the only thing useful we get from the exception without changing ray's code. Only error messages. https://github.com/ray-project/ray/blob/4c5c5763efff50bf9f76ae73a8c0073183dbb0cc/python/ray/exceptions.py#L233

d4l3k · 2022-08-02T21:51:22Z

torchx/schedulers/ray/ray_driver.py

+                except RayActorError as err:
+                    # reschedule the failed command actor (node failure)
+                    command_actors_count -= 1  # remove the failed actor
+                    failed_actor_id: str = parse_actor_id_from_error(err)


instead of parsing can we use a map between the ObjectRef to the actor info? Seems like the ObjectRefs returned from wait should be equal to the ones passed in

https://github.com/ray-project/ray/blob/43aa2299e6623c8f8c7c4a1b80133459d0aa68b0/python/ray/includes/object_ref.pxi#L38

I checked, but I think there isn't actor related information. https://github.com/ray-project/ray/blob/4c5c5763efff50bf9f76ae73a8c0073183dbb0cc/python/ray/includes/object_ref.pxi#L38

But we can create some issues about this one and the exception one.

d4l3k · 2022-08-02T21:52:41Z

torchx/schedulers/ray/ray_driver.py

+                            break  # exit
+                    else:
+                        raise RuntimeError(
+                            "Ray actor returns unkown type. This is most likely bug in torchx"


nit spelling

d4l3k · 2022-08-02T22:03:40Z

torchx/schedulers/ray/ray_driver.py

+
+    def init_placement_groups(self) -> None:
+        """Initialize all placement groups needed for this job"""
+        replica_ix_of_pg: List[int] = [0] + list(


it might be better to explicitly make the calls here. + we need to wait for the first placement group before creating the remainders otherwise the smaller groups might get scheduled first (unless ray does fifo queuing, which if it does we should document that)

initial = create_placement_group_async(self.replicas[0: self.min_nodes]) pg.wait(itimeout_seconds=...) self.groups = [] for range(self.min_nodes, self.max_nodes): self.groups.append(create_placement_group_async(self.replicas[i:i+1]))

Ray doesn't do FIFO, will change.

d4l3k · 2022-08-02T22:05:08Z

torchx/schedulers/ray/ray_driver.py

+                        if (
+                            command_actors_count == 0
+                        ):  # all the command actors have finished
+                            break  # exit


this only breaks the inner loop -- is there a case where active_tasks >0 but command_actors_count == 0?

Yes, active_tasks may contain some actors which haven't been scheduled successfully in a placement group, even after the job has finished. I guess I have to use return here.

d4l3k · 2022-08-02T22:06:41Z

torchx/schedulers/ray/ray_driver.py

+        return self.actor_info_of_id.pop(actor_id)

+    def reschedule_actor(self, actor_id: str) -> None:
+        """Rescheule a failed actor"""


we should check the role max_retries/retry_policy here and throw an error if any of the workers exits more than N times

Actually, here if we add a max_retries, it actually means how many node failures we can endure, so it's different from the max_retries in the ray actor's context.

As long as the number of working nodes is bigger than min_nnodes, we should allow infinite node failures, unless nodes are failing too frequently. That can be a retry_policy.

d4l3k · 2022-08-02T22:07:44Z

torchx/schedulers/test/ray_scheduler_test.py

            with self.assertRaisesRegex(
                Exception, "RAY_ADDRESS env variable is expected"
            ):
                self._scheduler.list()


would be good to add some "state machine" mock tests that runs through that driver loop step by step

d4l3k · 2022-08-02T22:09:39Z

torchx/schedulers/ray/ray_driver.py

+        need_more_actors: bool = True  # if need more actors
+        command_actors_count: int = 0  # number of created command actors
+        # Await return result of remote ray function and initialize new command actors
+        while len(self.active_tasks) > 0:


can we split loop out from run so it has a "_step()" and "run()" method so we can manually step through the behavior from a mocked unit test?

…step function from run

atinsood · 2022-08-03T03:52:22Z

@ljjsalt @d4l3k

I have been thinking more about this and I am wondering if ray.util.queue is a better way of implementing this.

you basically create 2 actors, PlacementGroupManager actor and CommandManager actor and exchange information between them using ray queue.

PGManager group actor is responsible for creating the placement group and then putting a message in the queue for the CommandManager actor to process which can then create the command actor. this helps us keep the knowledge of pg creation and command actor creation compartmentalized in these actors.

additionally, we can @Remote these fns so both of these can be run in parallel

https://docs.ray.io/en/releases-1.2.0/advanced.html#message-passing-using-ray-queue

ntlm1686 · 2022-08-03T04:01:12Z

@ljjsalt @d4l3k

I have been thinking more about this and I am wondering if ray.util.queue is a better way of implementing this.

you basically create 2 actors, PlacementGroupManager actor and CommandManager actor and exchange information between them using ray queue.

PGManager group actor is responsible for creating the placement group and then putting a message in the queue for the CommandManager actor to process which can then create the command actor. this helps us keep the knowledge of pg creation and command actor creation compartmentalized in these actors.

additionally, we can @Remote these fns so both of these can be run in parallel

https://docs.ray.io/en/releases-1.2.0/advanced.html#message-passing-using-ray-queue

@atinsood

The reason that we should use Placement Group is: if we only use ray actors here, even though we can rerun the failed actor after it throws a RayActorError. But this job could potentially lose the computation resource it used to have, then fail the job even the node is recovered.

For example, there are 3 nodes each with 1 cpu, job 1 requires minimum nodes of 3, and it’s running with all 3 nodes, another job 2 launched later and it requires 1 cpu, and becomes a pending task. Once a node failure happens, we have to rerun the actor which becomes a pending task after job 2, and job 2 will take that node, job 1 will fail since there aren't enough nodes to restart.

Since placement groups rescheduling have the highest priority, this situation won't happen.

I didn't get it why to use queue, since we didn't use any inter process communication here.

I understand your concern in using ray.wait here. So I wrapped the return value of command(remote) actors with RayResult class. If anything unexpected happen, it throws an error, we will know right away. But based on the logic that ray.wait supposed to work, I think it's right to use it here.

Besides, considering each command actor is a finite state machine that has four states(SCHEDULING, FAILING, RUNNING, TASK_COMPLETED).
For each command actor, it starts with SCHEDULING state, and actor.schedule.remote() becomes an asynchronous step function. When it's SCHEDULING, the next state is RUNNING. When it's RUNNING, the next state is TASK_COMPLETED or FAILING. When it's FAILING, the next state is SCHEDULING.

atinsood · 2022-08-03T04:20:08Z

The reason that we should use Placement Group is

yeah, not disagreeing on that. I am just thinking how do we manage the interaction between pg creation and command actor creation

@ray.remote
class PlacementGroupManager(object):
    def __init__(self, min_nodes, max_nodes, queue):
        self.queue = queue
        pass

    def run() -> PlacementGroup:
        # step 1. create a PG with min replicas

        #put the initial pg in the queue
        self.queue.put(ready)

        # step 2. while loop to keep create the rest of PGs incrementatlly,

        #go through the rest of the pgs one by one and keep adding them to the queue
        #we can deal with the logic of pg failure here or when to stop pg creation if needed


@ray.remote
class CommandActorManager(object):
    def __init__(self, queue, active_workers):
        self.queue = queue
        self.active_workers = []
        pass

    def run(self):
        #some logic on when to stop the while loop, either a poison pill or a signal actor
        while True:
            self.queue.get() #get notified that the pg was created
            # logic to create the command actor and goes here

            active_workers.append(command_actor.exec_module.remote()) # or may be use another queue, pretty sure there is a better way to deal with this

            #once you are done reading all the PGs from the queue or if the queue 





def main() -> None:  # pragma: no cover
    actors: List[RayActor] = load_actor_json("actors.json")
    # pyre-fixme[16]: Module `worker` has no attribute `init`.
    ray.init(address="auto", namespace="torchx-ray")
    q = Queue() # we can set the set of the queue upfront
    pg_manager = PlacementGroupManager(2,4,q) #min, max and queue
    pg_manager.run.remote()
    cmd_actor_manager = CommandActorManager(q, active_workers)
    cmd_actor_manager.run.remote()
    # Await return result of remote ray function
    while len(active_workers) > 0:
        _logger.info(f"running ray.wait on {active_workers}")

        # pyre-fixme[16]: Module `worker` has no attribute `wait`.
        completed_workers, active_workers = ray.wait(active_workers)
        # If a failure occurs the ObjectRef will be marked as completed.
        # Calling ray.get will expose the failure as a RayActorError.
        for object_ref in completed_workers:
            ray.get(object_ref)

I was thinking something like this, this is not the correct code, but more of a thought process on how to do interaction between the logic that is creating placement groups and the logic that is creating command actors and how to keep them compartmentalized

ntlm1686 · 2022-08-03T04:32:04Z

@ljjsalt @d4l3k

I have been thinking more about this and I am wondering if ray.util.queue is a better way of implementing this.

you basically create 2 actors, PlacementGroupManager actor and CommandManager actor and exchange information between them using ray queue.

PGManager group actor is responsible for creating the placement group and then putting a message in the queue for the CommandManager actor to process which can then create the command actor. this helps us keep the knowledge of pg creation and command actor creation compartmentalized in these actors.

additionally, we can @Remote these fns so both of these can be run in parallel

https://docs.ray.io/en/releases-1.2.0/advanced.html#message-passing-using-ray-queue

        # step 2. while loop to keep create the rest of PGs incrementatlly,

        #go through the rest of the pgs one by one and keep adding them to the queue
        #we can deal with the logic of pg failure here or when to stop pg creation if needed

Actually, this step is not necessary. First, increasing PG one by one can be a problem once the number of nodes is large.

Please read the description of the X axis, it only makes the time that a PG creation event added to GCS pending queue longer.

The second change: instead of creating the command actor after a placement group has been scheduled, we create all the command actors at the beginning too(with SCHEDULING state).

codecov · 2022-08-03T15:24:43Z

Codecov Report

Merging #572 (3228b9c) into main (b051e3f) will increase coverage by 0.15%.
The diff coverage is 98.41%.

@@            Coverage Diff             @@
##             main     #572      +/-   ##
==========================================
+ Coverage   94.85%   95.00%   +0.15%     
==========================================
  Files          66       66              
  Lines        4042     4144     +102     
==========================================
+ Hits         3834     3937     +103     
+ Misses        208      207       -1

Impacted Files	Coverage Δ
torchx/schedulers/ray_scheduler.py	`95.26% <ø> (ø)`
torchx/schedulers/ray/ray_driver.py	`97.10% <98.07%> (+1.26%)`	⬆️
torchx/components/dist.py	`96.42% <100.00%> (+7.06%)`	⬆️
torchx/schedulers/ray/ray_common.py	`100.00% <100.00%> (ø)`
torchx/specs/api.py	`98.40% <100.00%> (+<0.01%)`	⬆️
torchx/runner/api.py	`96.85% <0.00%> (+0.01%)`	⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

d4l3k · 2022-08-03T18:15:56Z

If we did want to separate the concerns here with scheduling vs retries we could use the builtin task max_retries config though that has some other implications.

I'm not sure that using a queue with two actors would simplify this logic -- state machines that you can step through are nice from a testing perspective

ntlm1686 · 2022-08-04T15:03:09Z

If we did want to separate the concerns here with scheduling vs retries we could use the builtin task max_retries config though that has some other implications.

I'm not sure that using a queue with two actors would simplify this logic -- state machines that you can step through are nice from a testing perspective

But we are just running schedule(which cannot go wrong) and exec_module functions(which cannot go wrong unless script fails), it it necessary to use max_retries here? Of course, the feature can be easily added.

Summary: Elasticity - the execution of placement groups are pending tasks that will be scheduled by GCS when resources become available. Related PR: #572 Pull Request resolved: #580 Test Plan: Mock cluster scaling with `ray.cluster_utils`. Reviewed By: priyaramani Differential Revision: D38838786 Pulled By: d4l3k fbshipit-source-id: b27073fd6ad4822c121e07de729b839f6cf6291a

ntlm1686 and others added 30 commits June 15, 2022 14:59

Better app-id parsing, add test examples

8088c16

Include port number in parsed cluster address

Merge upstream

f5023b5

Creating placement group in ray_driver parallel

be43c28

Hard code min nnodes and max nnodes

496ee8f

Update ray_driver.py to support dynamic creating placement group; Add…

5afcb7f

… attribute nnodes_rep to AppDef to pass elastic representation of number of nodes

ray_driver read nnodes representation from actors.json

3e0791d

Fix by linter

8c52314

Fix pyre errors

02080f9

Add unit test examples for dist.py and ray_driver.py

13b11c2

Merge branch 'main' into main_loop

32086c0

Update by Linter

059fa0f

Update singelton RayClusterSetup class

e2604e9

Merge from upstream

9a88995

Merge branch 'main' into main_loop

086b4ad

Update by linter

dc03aaf

Fix some pyre

4ca9cfb

Print num_cpus

a6c396c

Use max 2 cpus for test

9eb6c17

Add lock to cluster

4936e95

Comment some tests

c9759c8

Fix placement group creation test

c1337ca

fix by lint

2f78390

Change remove_placement_group api

5c36eaa

Remove os.environ del

871f399

Uncomment actor setup test

82795c9

Fix lock issue

6b13bab

Add back list test

681bc8e

Remove lock

c219789

New way of creating pgs

980e0d3

Fix pyre in ray_drvier.py

9095d3c

Merge branch 'main' of https://github.com/pytorch/torchx into fault-t…

dac8b37

…olerance

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2022

d4l3k requested changes Aug 2, 2022

View reviewed changes

ntlm1686 added 4 commits August 2, 2022 20:26

Fix dead lock; better passing min_nnodes; in ray_driver.py seperate _…

e669f91

…step function from run

Fix lint

817565e

Fix lint

1e56b66

Fix lint/pyre

2ecee45

ntlm1686 and others added 2 commits August 3, 2022 01:35

Delete typescript

20c022d

Fix lint/pyre

3a91ac5

ntlm1686 added 8 commits August 3, 2022 18:58

Add test cases

48f3478

Fix test cases

4ac17cd

Fix lint/pyre

fd804ee

Fix lint/pyre

b84538a

Fix lint/pyre

b3b2932

Remove unused function

d36cb93

Fix lint/pyre

50ebe87

Fix lint/pyre; add more tests

0ff8fe1

Fix dist parse nnodes; fix exception test

3228b9c

ntlm1686 requested a review from d4l3k August 4, 2022 15:16

ntlm1686 changed the title ~~[Ray] Add Elasticity and Fault tolerance features to jobs launched on ray cluster~~ [Ray] Add elasticity and fault tolerance features to jobs launched on ray cluster Aug 4, 2022

ntlm1686 mentioned this pull request Aug 11, 2022

[Ray] Add elasticity to jobs launched on ray cluster #580

Closed

d4l3k closed this Aug 23, 2022

		self.reschedule_actor(failed_actor_id)


		def parse_nnodes_rep(actors: List[RayActor]) -> Tuple[int, int]:

		return min_nnodes, max_nnodes


		def parse_actor_id_from_error(err: RayActorError) -> str:

[Ray] Add elasticity and fault tolerance features to jobs launched on ray cluster #572

[Ray] Add elasticity and fault tolerance features to jobs launched on ray cluster #572

Uh oh!

Conversation

ntlm1686 commented Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntlm1686 Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntlm1686 Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atinsood commented Aug 3, 2022

Uh oh!

ntlm1686 commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atinsood commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntlm1686 commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d4l3k commented Aug 3, 2022

Uh oh!

ntlm1686 commented Aug 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ntlm1686 commented Aug 2, 2022 •

edited

Loading

ntlm1686 Aug 2, 2022 •

edited

Loading

ntlm1686 Aug 2, 2022 •

edited

Loading

ntlm1686 commented Aug 3, 2022 •

edited

Loading

atinsood commented Aug 3, 2022 •

edited

Loading

ntlm1686 commented Aug 3, 2022 •

edited

Loading

codecov bot commented Aug 3, 2022 •

edited

Loading