Elastic failure handling #23

krfricke · 2020-12-07T15:49:04Z

Introduces elastic failure handling. Alive actors are not recreated on error, but stick around and don't have to load data again. We can also choose to not restart failed actors, continuing training with fewer actors.

Depends on #21.

# Conflicts: # examples/simple.py # examples/simple_tune.py # xgboost_ray/main.py # xgboost_ray/tests/release/benchmark_cpu_gpu.py # xgboost_ray/tests/test_fault_tolerance.py # xgboost_ray/tests/utils.py

# Conflicts: # xgboost_ray/main.py

# Conflicts: # xgboost_ray/main.py # xgboost_ray/tests/test_fault_tolerance.py

krfricke · 2020-12-10T19:04:58Z

This should be ready for review @amogkam

amogkam · 2020-12-10T19:14:13Z

Thanks I'll review it today

amogkam

This is a really cool feature and looks good to me overall! The main thing is that there are a lot of different concurrency components (multiprocessing, threading, asyncio.Event, rabit_tracker). I know you have some of this in the Google doc already but it would really help to add some comments here about what each component's responsibilities are and how they interact with each other.

xgboost_ray/util.py

xgboost_ray/main.py

xgboost_ray/tests/test_fault_tolerance.py

xgboost_ray/main.py

xgboost_ray/tests/test_fault_tolerance.py

krfricke · 2020-12-11T10:55:48Z

Thanks a bunch for the thorough review. I addressed all your comments and tried to answer your questions!

richardliaw · 2020-12-11T20:20:11Z

I’ll review this PR later tonight!

…

On Fri, Dec 11, 2020 at 9:11 AM Amog Kamsetty ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In xgboost_ray/main.py <#23 (comment)> : > + **kwargs) + result_dict.update({ + "bst": bst, + "evals_result": evals_result, + "train_n": self._local_n + }) + except XGBoostError: + # Silent fail, will be raised as RayXGBoostTrainingStopped + return + + thread = threading.Thread(target=_train) + thread.daemon = True + thread.start() + while thread.is_alive(): + thread.join(timeout=0) + if self._stop_event.is_set(): Got it, makes sense! Do you think you can add this ^ as a comment in the code? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCRZZI42L7RWJA3DD3IHSTSUJHCZANCNFSM4UQURGRA> .

richardliaw

This looks pretty good overall! A couple comments

The architecture/design could be better commented in the code. I left some comments at places that required a bit more digging.
It'd be great to document this failure model in train(.
We will want to do a bit of refactoring to consolidate failure handling responsibilities in core functions, but I don't think that's a blocker for this PR.

Please ping when you've addressed the remaining comments!

xgboost_ray/main.py

richardliaw · 2020-12-13T04:15:13Z

xgboost_ray/main.py

+    for i in range(len(actors)):
+        actor = actors[i]


for actor in actors?

Since we're overwriting the array item here:

actors[i] = None

we need the index anyway, so I'll probably keep it as is

xgboost_ray/main.py

richardliaw · 2020-12-13T04:19:54Z

xgboost_ray/main.py

+    # Maybe we got a new queue actor
+    wait_queue = [
+        actor.set_queue.remote(_queue) for actor in _actors
+        if actor is not None


can there actually be actor is None here?

Yes! After an actor failed its entry in _actors is set to None in the train() function. _train is then called again, so entries can be None

Hmm, don't all the None entries get filled in L586 to 603?

Only those that are in _failed_actor_ranks - and this set is empty if we continue with fewer actors. See previous comment below

xgboost_ray/main.py

richardliaw · 2020-12-13T04:34:03Z

Actually, 1 question I have is --

when you have 4 actors on 4 nodes, and 1 of the nodes die, where do you check the number of cluster resources to resume with only 3 workers?

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

krfricke · 2020-12-13T19:33:13Z

Since we re-use the existing actors, we currently don't check the cluster resources at all - we just invoke the local train() function on the remaining actors.

Re-scheduling actors once resources are available again will be part of a separate PR. I don't want to make this bigger than it already is.

Thanks a lot for the review by the way! I addressed the comments and just added a bunch of in-code documentation for better understanding of the code.

richardliaw · 2020-12-14T08:27:31Z

xgboost_ray/main.py

+    for i in list(_failed_actor_ranks):
+        if _actors[i] is not None:
+            raise RuntimeError(
+                f"Trying to create actor with rank {i}, but it already "
+                f"exists.")
+        actor = _create_actor(


Question - from what I understand, failed actors will have their rank added to _failed_actor_ranks. When _train is called, don't these actors get recreated?

I'm just specifically looking for the situation where a node dies and new actor creation step is skipped

That's in line 848:

start_actor_ranks = set()

here the set is cleared (I guess we could just .clear() it instead). With it being empty, no actors are restarted

Ah, got it.

richardliaw · 2020-12-14T08:33:42Z

Unfortunately, my cluster yaml (and snapshot) no longer works. One last comment/question about actor recovery -- I keep re-reading the recovery code and it seems like the actors are always restarted. Can you help explain the purpose/code flow of _failed_actor_ranks?

krfricke · 2020-12-14T08:59:41Z

I'll run a couple of tests on a cluster later and post the results here.

krfricke · 2020-12-14T09:06:44Z

Re: _failed_actor_ranks: This set collects the ranks of failed actors. These are restarted on a call to _train. Note that we have the same information in the _actors list, as we could just filter for None values - however, the difference here is that we want to be able to deliberately alter the _failed_actor_ranks in order to re-start only specific ranks.
For instance, for elastic training we currently clear the set, so no actors will be restarted. In future failure handling modes we could only start actors that have resources available. With this object we can thus specify exactly which ranks we want to start and which not.

In the train() function the set is called start_actor_ranks, as it semantically describes which ranks will be started by _train(). In _train() though it is called _failed_actor_ranks as, after starting actors, it collects the ranks of failed actors. Would it be more clear if we renamed _failed_actor_ranks to _start_actor_ranks in _train() as well?

I can also add more comments to this - e.g. annotate the internal variables passed to _train().

krfricke · 2020-12-14T10:49:36Z

Works fine in my tests. Here is the results log for killing 2 nodes:

base) root@ip-172-31-18-125:/release_tests# python benchmark_cpu_gpu.py 10 100 200
2020-12-14 10:42:37,239 INFO worker.py:651 -- Connecting to existing Ray cluster at address: 172.31.18.125:6379
2020-12-14 10:42:37,992 INFO main.py:620 -- [RayXGBoost] Created 10 new actors (10 total actors).
2020-12-14 10:42:40,938 INFO main.py:638 -- [RayXGBoost] Starting XGBoost training.
(pid=4107, ip=172.31.20.197) [10:42:40] task [xgboost.ray]:140004211236752 got new rank 2
(pid=4038, ip=172.31.25.131) [10:42:40] task [xgboost.ray]:140546107764624 got new rank 5
(pid=4005, ip=172.31.24.220) [10:42:40] task [xgboost.ray]:140103861936592 got new rank 4
(pid=4006, ip=172.31.29.248) [10:42:40] task [xgboost.ray]:140411510034704 got new rank 7
(pid=4037, ip=172.31.17.26) [10:42:40] task [xgboost.ray]:139927338557264 got new rank 0
(pid=3993, ip=172.31.20.30) [10:42:40] task [xgboost.ray]:140255472527504 got new rank 3
(pid=4074, ip=172.31.26.109) [10:42:40] task [xgboost.ray]:140038723952528 got new rank 6
(pid=3939, ip=172.31.30.64) [10:42:40] task [xgboost.ray]:140290278254992 got new rank 9
(pid=4107, ip=172.31.30.251) [10:42:40] task [xgboost.ray]:140133743529872 got new rank 8
(pid=4011, ip=172.31.19.132) [10:42:40] task [xgboost.ray]:140660999946128 got new rank 1
E1214 10:42:53.648838  3040  3055 task_manager.cc:323] Task failed: IOError: 2: user code caused exit: Type=ACTOR_TASK, Language=PYTHON, Resources: {}
, function_descriptor={type=PythonFunctionDescriptor, module_name=xgboost_ray.main, class_name=RayXGBoostActor, function_name=train, function_hash=},
task_id=019554c512c2b9b51bd49a5609000000, task_name=RayXGBoostActor.train(), job_id=09000000, num_args=10, num_returns=2, actor_task_spec={actor_id=1b
d49a5609000000, actor_caller_id=ffffffffffffffffffffffff09000000, actor_counter=4}
E1214 10:42:53.650756  3040  3055 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYT
HON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=xgboost_ray.main, class_name=RayXGBoostActor, function_name=pid, f
unction_hash=}, task_id=b9fed017052010151bd49a5609000000, task_name=RayXGBoostActor.pid(), job_id=09000000, num_args=0, num_returns=2, actor_task_spec
={actor_id=1bd49a5609000000, actor_caller_id=ffffffffffffffffffffffff09000000, actor_counter=5}
2020-12-14 10:42:53,729 INFO main.py:489 -- Actor status: 9 alive, 1 dead (10 total)
2020-12-14 10:42:53,729 WARNING main.py:851 -- A Ray actor died during training. Trying to restart and continue training from last checkpoint (restart
 1 of 2). This will use 9 existing actors and start 0 new actors.Sleeping for 10 seconds for cleanup.
2020-12-14 10:43:03,750 INFO main.py:620 -- [RayXGBoost] Created 0 new actors (9 total actors).
2020-12-14 10:43:03,758 INFO main.py:638 -- [RayXGBoost] Starting XGBoost training.
(pid=4107, ip=172.31.20.197) [10:43:03] task [xgboost.ray]:140004211236752 got new rank 2
(pid=4038, ip=172.31.25.131) [10:43:03] task [xgboost.ray]:140546107764624 got new rank 5
(pid=4011, ip=172.31.19.132) [10:43:03] task [xgboost.ray]:140660999946128 got new rank 1
(pid=4037, ip=172.31.17.26) [10:43:03] task [xgboost.ray]:139927338557264 got new rank 0
(pid=4006, ip=172.31.29.248) [10:43:03] task [xgboost.ray]:140411510034704 got new rank 6
(pid=3993, ip=172.31.20.30) [10:43:03] task [xgboost.ray]:140255472527504 got new rank 3
(pid=4005, ip=172.31.24.220) [10:43:03] task [xgboost.ray]:140103861936592 got new rank 4
(pid=4107, ip=172.31.30.251) [10:43:03] task [xgboost.ray]:140133743529872 got new rank 7
(pid=3939, ip=172.31.30.64) [10:43:03] task [xgboost.ray]:140290278254992 got new rank 8
E1214 10:43:15.269685  3040  3055 task_manager.cc:323] Task failed: IOError: 2: user code caused exit: Type=ACTOR_TASK, Language=PYTHON, Resources: {}
, function_descriptor={type=PythonFunctionDescriptor, module_name=xgboost_ray.main, class_name=RayXGBoostActor, function_name=train, function_hash=},
task_id=aff64b4aedc094a1f90e593909000000, task_name=RayXGBoostActor.train(), job_id=09000000, num_args=12, num_returns=2, actor_task_spec={actor_id=f9
0e593909000000, actor_caller_id=ffffffffffffffffffffffff09000000, actor_counter=10}
E1214 10:43:15.271399  3040  3055 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYT
HON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=xgboost_ray.main, class_name=RayXGBoostActor, function_name=pid, f
unction_hash=}, task_id=b55dd8e99d574a31f90e593909000000, task_name=RayXGBoostActor.pid(), job_id=09000000, num_args=0, num_returns=2, actor_task_spec
={actor_id=f90e593909000000, actor_caller_id=ffffffffffffffffffffffff09000000, actor_counter=11}
2020-12-14 10:43:15,367 INFO main.py:489 -- Actor status: 8 alive, 2 dead (10 total)
2020-12-14 10:43:15,367 WARNING main.py:851 -- A Ray actor died during training. Trying to restart and continue training from last checkpoint (restart
 2 of 2). This will use 8 existing actors and start 0 new actors.Sleeping for 10 seconds for cleanup.
2020-12-14 10:43:25,388 INFO main.py:620 -- [RayXGBoost] Created 0 new actors (8 total actors).
2020-12-14 10:43:25,394 INFO main.py:638 -- [RayXGBoost] Starting XGBoost training.
(pid=4006, ip=172.31.29.248) [10:43:25] task [xgboost.ray]:140411510034704 got new rank 5
(pid=3993, ip=172.31.20.30) [10:43:25] task [xgboost.ray]:140255472527504 got new rank 2
(pid=4005, ip=172.31.24.220) [10:43:25] task [xgboost.ray]:140103861936592 got new rank 3
(pid=3939, ip=172.31.30.64) [10:43:25] task [xgboost.ray]:140290278254992 got new rank 7
(pid=4107, ip=172.31.30.251) [10:43:25] task [xgboost.ray]:140133743529872 got new rank 6
(pid=4037, ip=172.31.17.26) [10:43:25] task [xgboost.ray]:139927338557264 got new rank 0
(pid=4011, ip=172.31.19.132) [10:43:25] task [xgboost.ray]:140660999946128 got new rank 1
(pid=4038, ip=172.31.25.131) [10:43:25] task [xgboost.ray]:140546107764624 got new rank 4
(pid=4074, ip=172.31.26.109) [10:42:40] task [xgboost.ray]:140038723952528 got new rank 6
(pid=4074, ip=172.31.26.109) 2020-12-14 10:42:53,643    ERROR worker.py:384 -- SystemExit was raised from the worker
(pid=4074, ip=172.31.26.109) Traceback (most recent call last):
(pid=4074, ip=172.31.26.109)   File "python/ray/_raylet.pyx", line 551, in ray._raylet.task_execution_handler
(pid=4074, ip=172.31.26.109)   File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task
(pid=4074, ip=172.31.26.109)   File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
(pid=4074, ip=172.31.26.109)   File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
(pid=4074, ip=172.31.26.109)   File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
(pid=4074, ip=172.31.26.109)   File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
(pid=4074, ip=172.31.26.109)   File "/root/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 553, in actor_method_executor
(pid=4074, ip=172.31.26.109)     return method(actor, *args, **kwargs)
(pid=4074, ip=172.31.26.109)   File "/root/anaconda3/lib/python3.7/site-packages/xgboost_ray/main.py", line 390, in train
(pid=4074, ip=172.31.26.109)     time.sleep(0.1)
(pid=4074, ip=172.31.26.109)   File "/root/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 381, in sigterm_handler
(pid=4074, ip=172.31.26.109)     sys.exit(1)
(pid=4074, ip=172.31.26.109) SystemExit: 1
(pid=4107, ip=172.31.20.197) [10:42:40] task [xgboost.ray]:140004211236752 got new rank 2
(pid=4107, ip=172.31.20.197) [10:43:03] task [xgboost.ray]:140004211236752 got new rank 2
(pid=4107, ip=172.31.20.197) 2020-12-14 10:43:15,245    ERROR worker.py:384 -- SystemExit was raised from the worker
(pid=4107, ip=172.31.20.197) Traceback (most recent call last):
(pid=4107, ip=172.31.20.197)   File "python/ray/_raylet.pyx", line 551, in ray._raylet.task_execution_handler
(pid=4107, ip=172.31.20.197)   File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task
(pid=4107, ip=172.31.20.197)   File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
(pid=4107, ip=172.31.20.197)   File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
(pid=4107, ip=172.31.20.197)   File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
(pid=4107, ip=172.31.20.197)   File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
(pid=4107, ip=172.31.20.197)   File "/root/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 553, in actor_method_executor
(pid=4107, ip=172.31.20.197)     return method(actor, *args, **kwargs)
(pid=4107, ip=172.31.20.197)   File "/root/anaconda3/lib/python3.7/site-packages/xgboost_ray/main.py", line 390, in train
(pid=4107, ip=172.31.20.197)     time.sleep(0.1)
(pid=4107, ip=172.31.20.197)   File "/root/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 381, in sigterm_handler
(pid=4107, ip=172.31.20.197)     sys.exit(1)
(pid=4107, ip=172.31.20.197) SystemExit: 1
2020-12-14 10:45:12,824 INFO main.py:731 -- [RayXGBoost] Finished XGBoost training on training data with total N=160,000,000.
TRAIN TIME TAKEN: 155.57 seconds
Final training error: 0.4951
TOTAL TIME TAKEN: 155.58 seconds (0.03 for init)
(base) root@ip-172-31-18-125:/release_tests#

Note that the last two errors are sent to the driver some time after the nodes died.

I'll push some cosmetic changes, but other than I think we should be ready to merge.

richardliaw · 2020-12-14T20:10:49Z

nice!

Kai Fricke added 30 commits December 1, 2020 12:01

Update API

366bc4b

Update readme

d352737

Update tests

698ed80

Fix benchmark test

0a0c52a

Fix benchmark test

cc6a46d

Fix benchmark test

7f500f0

Fix benchmark test

d0cd1b2

Fix lint

bfa6c6a

Simplify smoke test

2a6cb0b

Add manual checkpointing test

839281e

almost equal

03322ac

Finish checkpointing test

9005a5c

Remove references to checkpoint_path

65ba0a0

Merge branch 'master' into rabit-checkpointing

0815102

# Conflicts: # examples/simple.py # examples/simple_tune.py # xgboost_ray/main.py # xgboost_ray/tests/release/benchmark_cpu_gpu.py # xgboost_ray/tests/test_fault_tolerance.py # xgboost_ray/tests/utils.py

Re-order imports

b716ca6

Apply suggestions from code review

5bbf145

Switch to ray callbacks

29afb75

More actors per trial

b0356f6

Replace Ray Tune callbacks with xgboost_ray-specific callbacks

8be0753

Demo for failure handling

cd9885a

Merge branch 'master' into tune-checkpointing

5bed369

# Conflicts: # xgboost_ray/main.py

New callbacks and legacy support

9bd4e62

Try to make FT test less flaky

3a48e20

Flaky test debug

6041552

Fix typing

b317fae

Debug outputs

fe058d0

Get queue after ray.get

8bb4e66

Kill queue actors

d04f54b

4 CPUs

d8d0b7b

clean up experiment dir

235817f

Kai Fricke added 6 commits December 10, 2020 11:33

Merge branch 'master' into failure-handling-polling

eb08abc

# Conflicts: # xgboost_ray/main.py # xgboost_ray/tests/test_fault_tolerance.py

logging scope and silent local training error

03a04ce

Refactor checkpoint callback

d6d2740

Checkpoint after training

e75d6ba

Doc additional RayParams

c16adb4

Remove eval metric

0868127

amogkam reviewed Dec 11, 2020

View reviewed changes

Kai Fricke added 2 commits December 11, 2020 11:21

Apply suggestions from code review

bc5e203

Apply suggestions from code review

07b9ade

richardliaw reviewed Dec 13, 2020

View reviewed changes

krfricke and others added 3 commits December 13, 2020 19:48

Update xgboost_ray/main.py

ff584a7

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

Apply suggestions from code review (mostly docs)

86e4397

Added a bunch of in-code documentation

c1fb8b1

richardliaw reviewed Dec 14, 2020

View reviewed changes

Separate logs for failure modes

96a402d

krfricke mentioned this pull request Dec 14, 2020

ML dataset support #27

Merged

richardliaw approved these changes Dec 14, 2020

View reviewed changes

richardliaw merged commit 403262b into ray-project:master Dec 14, 2020

krfricke deleted the failure-handling-polling branch December 14, 2020 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic failure handling #23

Elastic failure handling #23

krfricke commented Dec 7, 2020

krfricke commented Dec 10, 2020

amogkam commented Dec 10, 2020

amogkam left a comment

krfricke commented Dec 11, 2020

richardliaw commented Dec 11, 2020 via email

richardliaw left a comment

richardliaw Dec 13, 2020

krfricke Dec 13, 2020 •

edited

richardliaw Dec 13, 2020

krfricke Dec 13, 2020

richardliaw Dec 14, 2020

krfricke Dec 14, 2020

richardliaw commented Dec 13, 2020

krfricke commented Dec 13, 2020 •

edited

richardliaw Dec 14, 2020

krfricke Dec 14, 2020

richardliaw Dec 14, 2020

richardliaw commented Dec 14, 2020

krfricke commented Dec 14, 2020

krfricke commented Dec 14, 2020 •

edited

krfricke commented Dec 14, 2020

richardliaw commented Dec 14, 2020

Elastic failure handling #23

Elastic failure handling #23

Conversation

krfricke commented Dec 7, 2020

krfricke commented Dec 10, 2020

amogkam commented Dec 10, 2020

amogkam left a comment

Choose a reason for hiding this comment

krfricke commented Dec 11, 2020

richardliaw commented Dec 11, 2020 via email

richardliaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke Dec 13, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw commented Dec 13, 2020

krfricke commented Dec 13, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw commented Dec 14, 2020

krfricke commented Dec 14, 2020

krfricke commented Dec 14, 2020 • edited

krfricke commented Dec 14, 2020

richardliaw commented Dec 14, 2020

krfricke Dec 13, 2020 •

edited

krfricke commented Dec 13, 2020 •

edited

krfricke commented Dec 14, 2020 •

edited