Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optional limit on total training time #2333

Closed
wants to merge 1 commit into from

Conversation

jhcross
Copy link
Contributor

@jhcross jhcross commented Jul 16, 2020

Summary:
This change adds a new option (--stop-time-hr) which if specified limits the total training time to that number of hours. In order to stop training within the inner training loop (after the first update exceeding the time limit) the starting time is stored on the trainer.

In addition, in order to persist the training time when when restoring from checkpoints (important because training runs are sometimes killed due to resource constraints), training time already completed is stored as extra state in the checkpoints (though this change is backward compatible with existing checkpoints).

Differential Revision: D22573166

Summary:
This change adds a new option (`--stop-time-hr`) which if specified limits the total training time to that number of hours. In order to stop training within the inner training loop (after the first update exceeding the time limit) the starting time is stored on the trainer.

In addition, in order to persist the training time when when restoring from checkpoints (important because training runs are sometimes killed due to resource constraints), training time already completed is stored as extra state in the checkpoints (though this change is backward compatible with existing checkpoints).

Differential Revision: D22573166

fbshipit-source-id: d5a36c2eb57203e61bccafe29ff38a3424f77dc2
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D22573166

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3655cf2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants