Potential Deadlock During Validation with SequentialSampler in Multi-GPU Training #1960
sneakybatman
started this conversation in
Ideas
Replies: 1 comment 2 replies
-
Hey 👋, Thanks sure feel free to open a PR 👍 I think this would need an update for all 4 training scripts which supports DDP (recognition - TF / PyTorch | detection - TF / PyTorch) At the end it should work with both single gpu and multi where # Detect distributed setup
# variable is set by torchrun
world_size = int(os.environ.get("WORLD_SIZE", 1))
distributed = world_size > 1
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
First of all, thank you for this amazing library — I've been using it extensively for training and fine-tuning models on custom datasets.
Recently, while working with large-scale datasets, I encountered a deadlock issue during the validation step when using multiple GPUs. After some investigation, I found that the root cause lies in the usage of SequentialSampler for the validation dataloader.
The issue arises because SequentialSampler sends the same validation data to all GPUs. However, the validation loss and metrics are only computed on a single rank (typically rank 0). As a result, the other GPUs remain idle and wait indefinitely, leading to a timeout and failure during training — especially when the validation dataset is large.
Proposed Fix:
I was able to resolve this by replacing SequentialSampler with DistributedSampler for the validation set when using multiple GPUs. This ensures each GPU gets a distinct shard of the validation data, avoids the deadlock, and keeps all processes in sync.
Additionally, instead of calculating validation loss and metrics on just one device, they can be computed independently on all devices and then aggregated across processes (e.g., using all_gather or reduce) before logging. This would distribute the load and improve robustness in multi-GPU setups.
Suggestions:
Thanks again for the great work! I'd be happy to help contribute a PR if needed.
Best regards,
Anshuman
Beta Was this translation helpful? Give feedback.
All reactions