Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Improvements to fault tolerance #22511

Merged
merged 16 commits into from
Mar 29, 2022

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Feb 19, 2022

Various improvements to Ray Train fault tolerance.

  • Add more log statements for better debugging of Ray Train failure handling.
  • Fixes [Bug] [Train] Cannot reproduce fault-tolerance, script hangs upon any node shutdown #22349. The default torch process group timeout is 30 minutes. If a failure occurs before a gradient synchronization, training will hang for 30 minutes before raising an error and triggering fault tolerance. This PR reduces the default timeout_s to 30 seconds.
    Adds functionality to trigger fault tolerance even if any of the alive workers are hanging by specifying a grace period for alive workers.
    Simplifies fault tolerance by removing backend specific handle_failure. If any workers have failed, all workers will be restarted and training will continue from the last checkpoint.
    Also adds a test for fault tolerance with an actual torch example. When testing locally, the test hangs before the fix, but passes after.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/train/backend.py Outdated Show resolved Hide resolved
python/ray/train/trainer.py Outdated Show resolved Hide resolved
python/ray/train/torch.py Outdated Show resolved Hide resolved
worker_group.start()
logger.info("Setting up distributed backend on all workers.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these logs too verbose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, most people don't run with log level DEBUG, and for very long running training jobs (15+ hours) I think these messages could be useful and it's not feasible to run the training job again.

cc @matthewdeng would also like to hear your thoughts here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think INFO is fine given that this is only logged during failures, which should hopefully be rare.

python/ray/train/utils.py Outdated Show resolved Hide resolved
if at_least_one_failed_worker and len(unfinished) > 0:
# If at least one worker has failed, but there are still workers
# that we are waiting on results from, these workers are hanging.
# Treat these workers as dead workers as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to treat these as dead? Or is it sufficient that we break out of this ray.wait call and only return the failed worker as dead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, we don't need to treat them as dead. Will make the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no, these have to be marked as dead because new actors may not be created. And these actors are no longer usable since they are hanging on a particular method execution.

# If a failure occurs the ObjectRef will be marked as finished.
# Calling ray.get will expose the failure as a RayActorError.
for object_ref in finished:
# Everything in finished has either failed or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or what 😨

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or completed successfully 😁...good catch, updated!

# are alive, but hanging on collective calls because other
# workers have failed.
timeout = (
REMAINING_WORKERS_GRACE_PERIOD_S if at_least_one_failed_worker else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change REMAINING_WORKERS_GRACE_PERIOD_S to support being set as an environment variable? Is 10 seconds too short?

@stale
Copy link

stale bot commented Mar 27, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 27, 2022
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 28, 2022
python/ray/train/backend.py Outdated Show resolved Hide resolved
python/ray/train/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I think you just need to pull in master to fix the lint issues.

return False, dead_worker_indexes
else:
return True, []
at_least_one_failed_worker = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just return False here? Or remove and not at_least_one_failed_worker in line 48 if you want to print all failed workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, updated!

@amogkam amogkam merged commit 0b8c219 into ray-project:master Mar 29, 2022
@amogkam amogkam deleted the train-fault-tolerance-logging branch March 29, 2022 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] [Train] Cannot reproduce fault-tolerance, script hangs upon any node shutdown
4 participants