New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic failure handling #23
Elastic failure handling #23
Conversation
# Conflicts: # examples/simple.py # examples/simple_tune.py # xgboost_ray/main.py # xgboost_ray/tests/release/benchmark_cpu_gpu.py # xgboost_ray/tests/test_fault_tolerance.py # xgboost_ray/tests/utils.py
# Conflicts: # xgboost_ray/main.py
# Conflicts: # xgboost_ray/main.py # xgboost_ray/tests/test_fault_tolerance.py
This should be ready for review @amogkam |
Thanks I'll review it today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really cool feature and looks good to me overall! The main thing is that there are a lot of different concurrency components (multiprocessing, threading, asyncio.Event, rabit_tracker). I know you have some of this in the Google doc already but it would really help to add some comments here about what each component's responsibilities are and how they interact with each other.
Thanks a bunch for the thorough review. I addressed all your comments and tried to answer your questions! |
I’ll review this PR later tonight!
…On Fri, Dec 11, 2020 at 9:11 AM Amog Kamsetty ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In xgboost_ray/main.py
<#23 (comment)>
:
> + **kwargs)
+ result_dict.update({
+ "bst": bst,
+ "evals_result": evals_result,
+ "train_n": self._local_n
+ })
+ except XGBoostError:
+ # Silent fail, will be raised as RayXGBoostTrainingStopped
+ return
+
+ thread = threading.Thread(target=_train)
+ thread.daemon = True
+ thread.start()
+ while thread.is_alive():
+ thread.join(timeout=0)
+ if self._stop_event.is_set():
Got it, makes sense! Do you think you can add this ^ as a comment in the
code?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#23 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZI42L7RWJA3DD3IHSTSUJHCZANCNFSM4UQURGRA>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good overall! A couple comments
- The architecture/design could be better commented in the code. I left some comments at places that required a bit more digging.
- It'd be great to document this failure model in
train(
. - We will want to do a bit of refactoring to consolidate failure handling responsibilities in core functions, but I don't think that's a blocker for this PR.
Please ping when you've addressed the remaining comments!
for i in range(len(actors)): | ||
actor = actors[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for actor in actors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're overwriting the array item here:
actors[i] = None
we need the index anyway, so I'll probably keep it as is
# Maybe we got a new queue actor | ||
wait_queue = [ | ||
actor.set_queue.remote(_queue) for actor in _actors | ||
if actor is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can there actually be actor is None
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! After an actor failed its entry in _actors
is set to None in the train()
function. _train
is then called again, so entries can be None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't all the None entries get filled in L586 to 603?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only those that are in _failed_actor_ranks
- and this set is empty if we continue with fewer actors. See previous comment below
Actually, 1 question I have is -- when you have 4 actors on 4 nodes, and 1 of the nodes die, where do you check the number of cluster resources to resume with only 3 workers? |
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Since we re-use the existing actors, we currently don't check the cluster resources at all - we just invoke the local Re-scheduling actors once resources are available again will be part of a separate PR. I don't want to make this bigger than it already is. Thanks a lot for the review by the way! I addressed the comments and just added a bunch of in-code documentation for better understanding of the code. |
for i in list(_failed_actor_ranks): | ||
if _actors[i] is not None: | ||
raise RuntimeError( | ||
f"Trying to create actor with rank {i}, but it already " | ||
f"exists.") | ||
actor = _create_actor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question - from what I understand, failed actors will have their rank added to _failed_actor_ranks
. When _train
is called, don't these actors get recreated?
I'm just specifically looking for the situation where a node dies and new actor creation step is skipped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's in line 848:
start_actor_ranks = set()
here the set is cleared (I guess we could just .clear()
it instead). With it being empty, no actors are restarted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it.
Unfortunately, my cluster yaml (and snapshot) no longer works. One last comment/question about actor recovery -- I keep re-reading the recovery code and it seems like the actors are always restarted. Can you help explain the purpose/code flow of |
I'll run a couple of tests on a cluster later and post the results here. |
Re: In the I can also add more comments to this - e.g. annotate the internal variables passed to |
Works fine in my tests. Here is the results log for killing 2 nodes:
Note that the last two errors are sent to the driver some time after the nodes died. I'll push some cosmetic changes, but other than I think we should be ready to merge. |
nice! |
Introduces elastic failure handling. Alive actors are not recreated on error, but stick around and don't have to load data again. We can also choose to not restart failed actors, continuing training with fewer actors.
Depends on #21.