Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Define a TrainStepRetryableException and catch it in train_loop.py #6

Closed
kiukchung opened this issue Dec 2, 2019 · 1 comment
Closed
Labels
enhancement New feature or request

Comments

@kiukchung
Copy link
Contributor

🚀 Feature

Currently train_loop catches RuntimeError from train_step and retries but Exception does not retry. This implies that RuntimeErrors are considered retryable versus Exception is not. Clearly define Retryable vs NonRetryable exceptions for train_step and use that to decide whether the train_step should be rolled back and retried.

Motivation

Better, clearer API and exception handling and retry logic

Pitch

See description

Alternatives

N/A

Additional context

Link to code: https://github.com/pytorch/elastic/blob/master/torchelastic/train_loop.py#L123

@kiukchung kiukchung added the enhancement New feature or request label Dec 2, 2019
@kiukchung
Copy link
Contributor Author

no longer relevant in pet v0.2.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant