This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
Define a TrainStepRetryableException and catch it in train_loop.py #6
Labels
enhancement
New feature or request
🚀 Feature
Currently train_loop catches
RuntimeError
fromtrain_step
and retries butException
does not retry. This implies thatRuntimeErrors
are considered retryable versusException
is not. Clearly define Retryable vs NonRetryable exceptions fortrain_step
and use that to decide whether thetrain_step
should be rolled back and retried.Motivation
Better, clearer API and exception handling and retry logic
Pitch
See description
Alternatives
N/A
Additional context
Link to code: https://github.com/pytorch/elastic/blob/master/torchelastic/train_loop.py#L123
The text was updated successfully, but these errors were encountered: