Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset not shuffled (when not in DDP mode) #5622

Closed
2 tasks done
werner-duvaud opened this issue Nov 12, 2021 · 2 comments · Fixed by #5623
Closed
2 tasks done

Dataset not shuffled (when not in DDP mode) #5622

werner-duvaud opened this issue Nov 12, 2021 · 2 comments · Fixed by #5623
Labels
bug Something isn't working

Comments

@werner-duvaud
Copy link
Contributor

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Hi,

Thanks for the great implementation.

I think there is a bug, the shuffling of the DataLoader during training (when not in DDP mode) is not working. The dataset is sampled in index order from 0 to dataset length.

See Minimal Reproducible Example, when printing the index in the __getitem__ method in the LoadImagesAndLabels(Dataset) class during training, the index goes from 0 to dataset length at each epoch.
Because of that, samples are presented to the network in the sorted order of their filename.

Expected behaviour:
Index is sampled randomly.

Nb: Using the mosaic hyperparameter adds some add hoc randomness to the image returned by the dataloader but someone might disable it.

Environment

  • Yolov5 master branch def7a0f
  • Ubuntu 20
  • Single GPU (RTX 3060)

Minimal Reproducible Example

git clone https://github.com/ultralytics/yolov5.git
# Add print(index) at this line https://github.com/ultralytics/yolov5/blob/def7a0fd19c1629903c3b073b4df265407719a07/utils/datasets.py#L553
python train.py --img 64 --batch 4 --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1

It will print index during epoch from 0 to dataset length.

Additional

Attempt to explain the bug:

It happens here:

yolov5/utils/datasets.py

Lines 111 to 120 in def7a0f

sampler = torch.utils.data.distributed.DistributedSampler(dataset) if rank != -1 else None
loader = torch.utils.data.DataLoader if image_weights else InfiniteDataLoader
# Use torch.utils.data.DataLoader() if dataset.properties will update during training else InfiniteDataLoader()
dataloader = loader(dataset,
batch_size=batch_size,
num_workers=nw,
sampler=sampler,
pin_memory=True,
collate_fn=LoadImagesAndLabels.collate_fn4 if quad else LoadImagesAndLabels.collate_fn)
return dataloader, dataset

We are using torch.utils.data.DataLoader which according to the documentation has a shuffle argument that defaults to False. (Not a great PyTorch choice)
So to have our dataset shuffled, we want to set shuffle to True. But in distributed case, we are passing a custom sampler to the sampler argument and from the documentation of the sampler argument: If specified, shuffle must not be specified..

So a solutions, similar to what the pytorch documentation is doing at the end of here:

sampler = DistributedSampler(dataset) if is_distributed else None
loader = DataLoader(dataset, shuffle=(sampler is None),
                    sampler=sampler)
for epoch in range(start_epoch, n_epochs):
    if is_distributed:
        sampler.set_epoch(epoch)
    train(loader)

is to add the following argument to the datalaoder shuffle=(sampler is None).
Adding a shuffle argument to create_dataloader might be useful if we want to keep the current behaviour for the val task.

Nb: It is also confusing that DistributedSampler has shuffle defaulting to True but not DataLoader.

I have submitted the one line PR:

I know shuffling or not the training dataset can alter the results, I don't know if the yolov5v6 results used the shuffled dataset or not.

Maybe not shuffling the dataset is the desired feature but then DistributedSampler should not be shuffled too and adding a CLI option could be useful.

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@werner-duvaud werner-duvaud added the bug Something isn't working label Nov 12, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2021

👋 Hello @werner-duvaud, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

@werner-duvaud thanks for the detailed analysis. Yes I agree we probably want to shuffle by default (we already have enough command line arguments, no reason to add more here).

The complication stems from the infinite dataloader in place and the nested shuffle arguments as you mentioned. The impact on default training is unclear as the follow-on mosaic images are not just shuffled by randomly selected from the dataset.

Thank you for the PR! Since this impacts training we should train a model before and after the PR to quantify the effects before we can merge.

@glenn-jocher glenn-jocher linked a pull request Nov 13, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants