RandomSampler / DistributedSampler does not seem really random #64986

Vermeille · 2021-09-14T12:19:33Z

🐛 Bug

Training a net with DataLoader(..., shuffle=True) produces weird sawtooth artifacts in both loss and accuracy (train & test), indicating that the end of an epoch kind of looks like the beginning of the next one. The dataset needs to be big enough to observe this bias.

https://discuss.pytorch.org/t/observing-strange-loss-jumps-between-epochs/64066/15

This looks bad as it clearly biases gradients and might mess with momentum-based optimizers

To Reproduce

I can reproduce this on imagenet and can provide the code, but it just boils down to shuffle=True or RandomSampler withtout replacement. It does not happen with some other datasets AFAIK.

Expected behavior

There shouldn't be any sawtooth shape like that.

Environment

PyTorch Version : latest stable
OS (e.g., Linux): Ubuntu
How you installed PyTorch (conda, pip, source): pip
Python version: 3.8
CUDA/cuDNN version: CUDA 11.0 and 11.1
GPU models and configuration: RTX 2080 (4x and 2x)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23 @ssnl @VitalyFedyunin @ejguan @cbalioglu

The text was updated successfully, but these errors were encountered:

rohan-varma · 2021-09-20T17:32:46Z

For DistributedSampler, have you made sure to call set_epoch at the beginning of each epoch to ensure shuffling is appropriately randomized? Details are here: https://pytorch.org/docs/stable/data.html.

cc @VitalyFedyunin @ejguan regarding RandomSampler/general data loader question.

Vermeille · 2021-09-21T00:08:36Z

That was indeed the issue with DistributedSampler. However, the issue (if any?) with RandomSampler remains. I've ran many experiments, replacing torch.randperm() with python's random.shuffle() and got the same surprising results.

I have no idea why and how those sawtooth shapes happen, but they do happen and that's unwanted in applications such as GANs that are sensitive to sudden changes in gradients.

ejguan · 2021-09-23T17:53:34Z

The logic in RandomSampler is really straight foward.. Without replacement, RandomSampler simply used torch.randperm to shuffle the indices (

pytorch/torch/utils/data/sampler.py

Line 124 in dfbd030

yield from torch.randperm(n, generator=generator).tolist()

)

And, the generator used by the torch.randperm is created with a different seed per epoch (

pytorch/torch/utils/data/sampler.py

Lines 114 to 116 in dfbd030

    
           if self.generator is None: 
        
               generator = torch.Generator() 
        
               generator.manual_seed(int(torch.empty((), dtype=torch.int64).random_().item()))

)

If it's something related to RandomSampler, I think a statistic test on the result of torch.randperm is needed.

ejguan · 2021-09-23T17:57:28Z

That was indeed the issue with DistributedSampler. However, the issue (if any?) with RandomSampler remains. I've ran many experiments, replacing torch.randperm() with python's random.shuffle() and got the same surprising results.

For PyTorch and Python Random, generators on CPU use the same Mersenne Twister algorithm. Could you try to use NumPy generator to test your result? It uses different random algorithm.

Vermeille · 2021-09-23T20:30:16Z

I can, but not immediately, it will take hours of compute. I need that compute for my job and I'm pretty busy these days. I'll update you with further results when I will be able to run those experiments.

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 14, 2021

rohan-varma added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomSampler / DistributedSampler does not seem really random #64986

RandomSampler / DistributedSampler does not seem really random #64986

Vermeille commented Sep 14, 2021 •

edited by pytorch-probot bot

rohan-varma commented Sep 20, 2021

Vermeille commented Sep 21, 2021

ejguan commented Sep 23, 2021 •

edited

ejguan commented Sep 23, 2021

Vermeille commented Sep 23, 2021

RandomSampler / DistributedSampler does not seem really random #64986

RandomSampler / DistributedSampler does not seem really random #64986

Comments

Vermeille commented Sep 14, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

rohan-varma commented Sep 20, 2021

Vermeille commented Sep 21, 2021

ejguan commented Sep 23, 2021 • edited

ejguan commented Sep 23, 2021

Vermeille commented Sep 23, 2021

Vermeille commented Sep 14, 2021 •

edited by pytorch-probot bot

ejguan commented Sep 23, 2021 •

edited