Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomSampler / DistributedSampler does not seem really random #64986

Open
Vermeille opened this issue Sep 14, 2021 · 5 comments
Open

RandomSampler / DistributedSampler does not seem really random #64986

Vermeille opened this issue Sep 14, 2021 · 5 comments
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@Vermeille
Copy link

Vermeille commented Sep 14, 2021

馃悰 Bug

Training a net with DataLoader(..., shuffle=True) produces weird sawtooth artifacts in both loss and accuracy (train & test), indicating that the end of an epoch kind of looks like the beginning of the next one. The dataset needs to be big enough to observe this bias.

https://discuss.pytorch.org/t/observing-strange-loss-jumps-between-epochs/64066/15

This looks bad as it clearly biases gradients and might mess with momentum-based optimizers

To Reproduce

I can reproduce this on imagenet and can provide the code, but it just boils down to shuffle=True or RandomSampler withtout replacement. It does not happen with some other datasets AFAIK.

Expected behavior

There shouldn't be any sawtooth shape like that.

Environment

  • PyTorch Version : latest stable
  • OS (e.g., Linux): Ubuntu
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.8
  • CUDA/cuDNN version: CUDA 11.0 and 11.1
  • GPU models and configuration: RTX 2080 (4x and 2x)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23 @ssnl @VitalyFedyunin @ejguan @cbalioglu

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 14, 2021
@rohan-varma rohan-varma added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Sep 20, 2021
@rohan-varma
Copy link
Member

For DistributedSampler, have you made sure to call set_epoch at the beginning of each epoch to ensure shuffling is appropriately randomized? Details are here: https://pytorch.org/docs/stable/data.html.

cc @VitalyFedyunin @ejguan regarding RandomSampler/general data loader question.

@Vermeille
Copy link
Author

That was indeed the issue with DistributedSampler. However, the issue (if any?) with RandomSampler remains. I've ran many experiments, replacing torch.randperm() with python's random.shuffle() and got the same surprising results.

I have no idea why and how those sawtooth shapes happen, but they do happen and that's unwanted in applications such as GANs that are sensitive to sudden changes in gradients.

@ejguan
Copy link
Contributor

ejguan commented Sep 23, 2021

The logic in RandomSampler is really straight foward.. Without replacement, RandomSampler simply used torch.randperm to shuffle the indices (

yield from torch.randperm(n, generator=generator).tolist()
)

And, the generator used by the torch.randperm is created with a different seed per epoch (

if self.generator is None:
generator = torch.Generator()
generator.manual_seed(int(torch.empty((), dtype=torch.int64).random_().item()))
)

If it's something related to RandomSampler, I think a statistic test on the result of torch.randperm is needed.

@ejguan
Copy link
Contributor

ejguan commented Sep 23, 2021

That was indeed the issue with DistributedSampler. However, the issue (if any?) with RandomSampler remains. I've ran many experiments, replacing torch.randperm() with python's random.shuffle() and got the same surprising results.

For PyTorch and Python Random, generators on CPU use the same Mersenne Twister algorithm. Could you try to use NumPy generator to test your result? It uses different random algorithm.

@Vermeille
Copy link
Author

I can, but not immediately, it will take hours of compute. I need that compute for my job and I'm pretty busy these days. I'll update you with further results when I will be able to run those experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

4 participants