Dataset not shuffled (when not in DDP mode) #5622

werner-duvaud · 2021-11-12T05:18:29Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Hi,

Thanks for the great implementation.

I think there is a bug, the shuffling of the DataLoader during training (when not in DDP mode) is not working. The dataset is sampled in index order from 0 to dataset length.

See Minimal Reproducible Example, when printing the index in the __getitem__ method in the LoadImagesAndLabels(Dataset) class during training, the index goes from 0 to dataset length at each epoch.
Because of that, samples are presented to the network in the sorted order of their filename.

Expected behaviour:
Index is sampled randomly.

Nb: Using the mosaic hyperparameter adds some add hoc randomness to the image returned by the dataloader but someone might disable it.

Environment

Yolov5 master branch def7a0f
Ubuntu 20
Single GPU (RTX 3060)

Minimal Reproducible Example

git clone https://github.com/ultralytics/yolov5.git
# Add print(index) at this line https://github.com/ultralytics/yolov5/blob/def7a0fd19c1629903c3b073b4df265407719a07/utils/datasets.py#L553
python train.py --img 64 --batch 4 --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1

It will print index during epoch from 0 to dataset length.

Additional

Attempt to explain the bug:

It happens here:

yolov5/utils/datasets.py

Lines 111 to 120 in def7a0f

    
           sampler = torch.utils.data.distributed.DistributedSampler(dataset) if rank != -1 else None 
        
           loader = torch.utils.data.DataLoader if image_weights else InfiniteDataLoader 
        
           # Use torch.utils.data.DataLoader() if dataset.properties will update during training else InfiniteDataLoader() 
        
           dataloader = loader(dataset, 
        
                               batch_size=batch_size, 
        
                               num_workers=nw, 
        
                               sampler=sampler, 
        
                               pin_memory=True, 
        
                               collate_fn=LoadImagesAndLabels.collate_fn4 if quad else LoadImagesAndLabels.collate_fn) 
        
           return dataloader, dataset

We are using torch.utils.data.DataLoader which according to the documentation has a shuffle argument that defaults to False. (Not a great PyTorch choice)
So to have our dataset shuffled, we want to set shuffle to True. But in distributed case, we are passing a custom sampler to the sampler argument and from the documentation of the sampler argument: If specified, shuffle must not be specified..

So a solutions, similar to what the pytorch documentation is doing at the end of here:

sampler = DistributedSampler(dataset) if is_distributed else None
loader = DataLoader(dataset, shuffle=(sampler is None),
                    sampler=sampler)
for epoch in range(start_epoch, n_epochs):
    if is_distributed:
        sampler.set_epoch(epoch)
    train(loader)

is to add the following argument to the datalaoder shuffle=(sampler is None).
Adding a shuffle argument to create_dataloader might be useful if we want to keep the current behaviour for the val task.

Nb: It is also confusing that DistributedSampler has shuffle defaulting to True but not DataLoader.

I have submitted the one line PR:

I know shuffling or not the training dataset can alter the results, I don't know if the yolov5v6 results used the shuffled dataset or not.

Maybe not shuffling the dataset is the desired feature but then DistributedSampler should not be shuffled too and adding a CLI option could be useful.

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

github-actions · 2021-11-12T05:19:11Z

👋 Hello @werner-duvaud, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2021-11-12T12:19:27Z

@werner-duvaud thanks for the detailed analysis. Yes I agree we probably want to shuffle by default (we already have enough command line arguments, no reason to add more here).

The complication stems from the infinite dataloader in place and the nested shuffle arguments as you mentioned. The impact on default training is unclear as the follow-on mosaic images are not just shuffled by randomly selected from the dataset.

Thank you for the PR! Since this impacts training we should train a model before and after the PR to quantify the effects before we can merge.

werner-duvaud added the bug Something isn't working label Nov 12, 2021

werner-duvaud mentioned this issue Nov 12, 2021

Default DataLoader shuffle=True for training #5623

Merged

glenn-jocher linked a pull request Nov 13, 2021 that will close this issue

Default DataLoader shuffle=True for training #5623

Merged

glenn-jocher closed this as completed in #5623 Nov 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset not shuffled (when not in DDP mode) #5622

Dataset not shuffled (when not in DDP mode) #5622

werner-duvaud commented Nov 12, 2021

github-actions bot commented Nov 12, 2021 •

edited by glenn-jocher

Loading

glenn-jocher commented Nov 12, 2021

Dataset not shuffled (when not in DDP mode) #5622

Dataset not shuffled (when not in DDP mode) #5622

Comments

werner-duvaud commented Nov 12, 2021

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

github-actions bot commented Nov 12, 2021 • edited by glenn-jocher Loading

Requirements

Environments

Status

glenn-jocher commented Nov 12, 2021

github-actions bot commented Nov 12, 2021 •

edited by glenn-jocher

Loading