Potential Deadlock During Validation with SequentialSampler in Multi-GPU Training #1960

sneakybatman · 2025-06-18T07:19:07Z

sneakybatman
Jun 18, 2025

Hello,

First of all, thank you for this amazing library — I've been using it extensively for training and fine-tuning models on custom datasets.

Recently, while working with large-scale datasets, I encountered a deadlock issue during the validation step when using multiple GPUs. After some investigation, I found that the root cause lies in the usage of SequentialSampler for the validation dataloader.

The issue arises because SequentialSampler sends the same validation data to all GPUs. However, the validation loss and metrics are only computed on a single rank (typically rank 0). As a result, the other GPUs remain idle and wait indefinitely, leading to a timeout and failure during training — especially when the validation dataset is large.

Proposed Fix:

I was able to resolve this by replacing SequentialSampler with DistributedSampler for the validation set when using multiple GPUs. This ensures each GPU gets a distinct shard of the validation data, avoids the deadlock, and keeps all processes in sync.

Additionally, instead of calculating validation loss and metrics on just one device, they can be computed independently on all devices and then aggregated across processes (e.g., using all_gather or reduce) before logging. This would distribute the load and improve robustness in multi-GPU setups.

Suggestions:

Detect multi-GPU setup and switch to DistributedSampler for validation by default, or make it configurable.

Consider supporting validation metric and loss calculation across all devices, followed by aggregation.

Alternatively, document this behavior or add a warning for users running distributed training with large validation sets.

Thanks again for the great work! I'd be happy to help contribute a PR if needed.

Best regards,
Anshuman

felixdittrich92 · 2025-06-18T07:37:55Z

felixdittrich92
Jun 18, 2025
Maintainer

Hey 👋,

Thanks sure feel free to open a PR 👍

I think this would need an update for all 4 training scripts which supports DDP (recognition - TF / PyTorch | detection - TF / PyTorch)

At the end it should work with both single gpu and multi where

    # Detect distributed setup
    # variable is set by torchrun
    world_size = int(os.environ.get("WORLD_SIZE", 1))
    distributed = world_size > 1

distriputed can be used as condition wdyt ? :)

2 replies

sneakybatman Jun 18, 2025
Author

Thats how I have adapted my training script. I have kept it as optional to support both -

if distributed:
        val_sampler = DistributedSampler(val_set, num_replicas=world_size, rank=rank, shuffle=False, drop_last=False)
else:
        val_sampler = SequentialSampler(val_set)

sneakybatman Jun 18, 2025
Author

I will work on the PR in the coming days for all 4 training scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential Deadlock During Validation with SequentialSampler in Multi-GPU Training #1960

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Potential Deadlock During Validation with SequentialSampler in Multi-GPU Training #1960

Uh oh!

sneakybatman Jun 18, 2025

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

felixdittrich92 Jun 18, 2025 Maintainer

Uh oh!

Uh oh!

sneakybatman Jun 18, 2025 Author

Uh oh!

sneakybatman Jun 18, 2025 Author

sneakybatman
Jun 18, 2025

Replies: 1 comment 2 replies

felixdittrich92
Jun 18, 2025
Maintainer

sneakybatman Jun 18, 2025
Author

sneakybatman Jun 18, 2025
Author