Skip to content

Commit

Permalink
Fix race condition on cleaning checkpoints when save_total_limit set …
Browse files Browse the repository at this point in the history
…to 1 (huggingface#20989)

* Update trainer.py

* fix style

Co-authored-by: Radhwane Chebaane <rchebaane.external@epo.org>
  • Loading branch information
2 people authored and venkat-natchi committed Jan 22, 2023
1 parent 6e61bf7 commit 74934cf
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions src/transformers/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1919,8 +1919,8 @@ def _inner_training_loop(
run_dir = self._get_output_dir(trial)
checkpoints_sorted = self._sorted_checkpoints(use_mtime=False, output_dir=run_dir)

# Delete the last checkpoint when save_total_limit=1 if it's different from the best checkpoint.
if self.state.best_model_checkpoint is not None and self.args.save_total_limit == 1:
# Delete the last checkpoint when save_total_limit=1 if it's different from the best checkpoint and process allowed to save.
if self.args.should_save and self.state.best_model_checkpoint is not None and self.args.save_total_limit == 1:
for checkpoint in checkpoints_sorted:
if checkpoint != self.state.best_model_checkpoint:
logger.info(f"Deleting older checkpoint [{checkpoint}] due to args.save_total_limit")
Expand Down

0 comments on commit 74934cf

Please sign in to comment.