Skip to content

Commit

Permalink
[Train] Add a barrier in RayTrainReportCallback to ensure synchronous…
Browse files Browse the repository at this point in the history
… reporting. (#40875)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
  • Loading branch information
woshiyyya committed Nov 3, 2023
1 parent f08498e commit c1e387f
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions python/ray/train/lightning/_lightning_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,9 @@ def on_train_epoch_end(self, trainer, pl_module) -> None:
checkpoint = Checkpoint.from_directory(tmpdir)
train.report(metrics=metrics, checkpoint=checkpoint)

# Add a barrier to ensure all workers finished reporting here
torch.distributed.barrier()

if self.local_rank == 0:
shutil.rmtree(tmpdir)

Expand Down

0 comments on commit c1e387f

Please sign in to comment.