[Train] LightningTrainer converts relative checkpoint dirpath to absolute path #35894
Labels
bug
Something that is supposed to be working; but isn't
P1
Issue that should be fixed within a few weeks
ray-team-created
Ray Team created
train
Ray Train Related Issue
What happened + What you expected to happen
During restoration, PyTorch Lightning expects all workers to have the same directory structure. But when we specify a relative path for
dirpath
, PyTorch Lightning creates a new folder under the current working directory (therank_x
folder) like.../LightningTrainer_7282d_00000_0_2023-05-26_01-43-54/rank_x/{dirpath}
for each worker.This will cause the internal state of the ModelCheckpoint callback to not be properly restored, resulting in inconsistent NCCL operations that resulted in timeouts. We need to find a proper way in LightningTrainer to convert relative checkpoint dirpaths to absolute paths to remove this issue.
Versions / Dependencies
master
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: