DistributedSampler internal asserts if len(dataset) * 2 < number of GPUs #45324
Labels
high priority
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Bug
DistributedSampler crash has been reported in some threads (links below), while its cause has not been identified.
The crash is due to the implementation error in
DistributedSampler
, where it crashes when thelen(dataset) * 2 < num_gpus
. L101 of the following code snippet is the problematic part, asself.total_size - len(indices)
overflowsindices
in such case.pytorch/torch/utils/data/distributed.py
Lines 99 to 105 in 03dde4c
To Reproduce
If dataset_size * 2 < num_gpus, (e.g. dataset_size = 4 and num_gpus =9), following error occurs.
cc @ezyang @gchanan @zou3519 @ssnl @VitalyFedyunin @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski
The text was updated successfully, but these errors were encountered: