Can you add a num_workers option for the Dataloader? #141

Jaiczay · 2019-03-20T16:38:34Z

Can you please add a num_workers option for the Dataloader to speed up the data loading process?

I tried it myself with this tutorial https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel, but I couldn't get it to work.

I don't really get what the part at dataset.py Line 210 - 212 does.

    img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2)  # BGR to RGB and cv2 to pytorch
    img_all = np.ascontiguousarray(img_all, dtype=np.float32)
    img_all /= 255.0

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2019-03-20T17:25:09Z

@Jaiczay if I had a dime for everytime someone didn't know what some section does... I don't have time to be a teacher, and line 210 is already commented with an explanation, so I've simply added comments to the next two to avert the next question by someone else down the line.

You'd have to convert this implementation to use the PyTorch dataloader in order to access the num_workers argument. We ditched it in the past because it was too slow. I don't know if this is still the case or not, but what causes you to believe that the data loading process is a chokepoint?

Do you have profiler results to show?

Jaiczay · 2019-03-20T17:33:38Z

Because my GPU runs on 30-40% and only on CPU thread runs on 95%

glenn-jocher · 2019-03-20T17:37:32Z

Ah this is strange. Are you using a single GPU or multiple? Can you try something like https://github.com/spyder-ide/spyder-line-profiler to figure out exactly which areas are causing the slowdown?

glenn-jocher · 2019-03-20T18:03:28Z

@Jaiczay I don't have access to a local GPU, but if I run one batch of default COCO training on my Macbook Pro, I see the dataloader uses up about 240 ms. If this is true then yes you might be correct that the dataloader is a chokepoint in the training process. As a reference on a GCP instance with one P100, a batch nominally takes about 600 ms to process.

If I dig deeper in the dataloader, it seems the slowest part of the process is simply loading the jpegs (which are compressed of course, hence the slow speed). Not much to do about things there unless you were to decompress all the jpegs (which might be a good idea if you plan to do lots of training). np.loadtxt() might also be replaced some faster code as well. Ok, so I'll look into replacing np.load() and multithreading the dataloader. I should have some answers in the next day or two.

glenn-jocher · 2019-03-20T18:37:15Z

Good news. I was able to replace np.loadtxt() with python code which reduced the labels loading time from 19ms to 7ms. This shaves off 12 ms from the 240 ms batch times (5% faster). This update is now in commit 9885903.

I'll try multithreading the dataloader next, though this will surely take longer to complete.

Jaiczay · 2019-03-20T20:46:26Z

Thx for the quick answer! I got it running with the Pytorch dataloader, but the loss values are all nan so at least I don't get an error any more, but I will try out your fix first bevor I fix that.
Overall it's not that important, I just wanted to get a little bit more out of my new GPU.

glenn-jocher · 2019-03-21T12:37:58Z

@Jaiczay I've got wonderful news. I re-added support for the PyTorch DataLoader, including num_workers argument, and tested the data loading speeds in various configurations, with excellent speed improvements observed. Updates are in 70fe220.

IMPORTANT: Note that cv2.setNumThreads(0) must be set when using num_workers>0 in order to prevent opencv from trying to multithread on its own. train.py does this automatically now:

yolov3/train.py

Lines 44 to 48 in 0fb2653

    
           # Dataloader 
        
           if num_workers > 0: 
        
               cv2.setNumThreads(0)  # to prevent OpenCV from multithreading 
        
           dataset = LoadImagesAndLabels(train_path, img_size=img_size, augment=True) 
        
           dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)

https://support.apple.com/kb/SP776?locale=en_US&viewlocale=en_US
Machine type: 2018 MacBook Pro (6 physical CPU cores, 12 vCPUs, 16 GB memory)
CPU platform: 2.2GHz 6-core Intel Core i7, Turbo Boost up to 4.1GHz, with 9MB shared L3 cache
GPUs: None
HDD: 256 GB SSD

`num_workers`	`cv2.setNumThreads(0)`	DataLoader speed (ms/batch)
0	`False`	206ms (this repo default)
0	`True`	291ms
1	`True`	252ms
2	`True`	131ms
4	`True`	75ms
6	`True`	57ms
8	`True`	54ms
10	`True`	52ms
12	`True`	51ms

Jaiczay · 2019-03-21T13:12:58Z

Wow, that's great!

Thank you btw for the awesome repo, this is by far the best Pytorch implementation of YOLOv3!

Jaiczay · 2019-03-21T14:34:59Z

@glenn-jocher I think you need to update the test.py as well, because when I continue training I suddenly become a mAP of 0.5 and before it was around 0.94

glenn-jocher · 2019-03-21T15:08:11Z

Hmmm yes I think there might be a problem in train.py, maybe in the target loading order. Since they are coming in asynchronously now there may be some sort of issue in assigning targets to images. So your resumed training may be bringing your mAP to zero eventually. I'll try and sort it out later today.

test.py currently works fine, for example with yolov3.weights. But yes I should migrate that over to the dataloader also for faster speed.

glenn-jocher · 2019-03-21T16:31:10Z

Current workaround is not to use MultiGPU.

ydixon · 2019-03-22T21:11:38Z

@glenn-jocher I haven't look into details from the update. I just wanted to point out if you are using Dataloader from the pytorch library, the worker threads might mess up the random seeds. Let's say you have 4 worker threads, you'll might end up with the same augmentation for the 4 threads. And it's also almost impossible to get deterministic behavior (if that's a concern) without modifying the Dataloader class or simply write your own multi-processing dataloader. If all this is already considered, I guess its all good :)

longxianlei · 2019-03-23T13:40:07Z

Can you please add a num_workers option for the Dataloader to speed up the data loading process?

I tried it myself with this tutorial https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel, but I couldn't get it to work.

I don't really get what the part at dataset.py Line 210 - 212 does.
    img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2)  # BGR to RGB and cv2 to pytorch
    img_all = np.ascontiguousarray(img_all, dtype=np.float32)
    img_all /= 255.0

@Jaiczay These operations are just to convert BGR to RGB, so, batch_size X w X h X channel. The channel dim was just inversed. Then transpose to batch_size X channel X w X h.

glenn-jocher self-assigned this Mar 21, 2019

This was referenced Mar 21, 2019

Multi-GPU Training #21

Closed

Multi GPU Error #94

Closed

multi_gpu #135

Merged

glenn-jocher mentioned this issue Mar 21, 2019

Zero mAP, Precision, Recall #146

Closed

glenn-jocher mentioned this issue Mar 22, 2019

Initial multigpu support #121

Closed

glenn-jocher closed this as completed Mar 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you add a num_workers option for the Dataloader? #141

Can you add a num_workers option for the Dataloader? #141

Jaiczay commented Mar 20, 2019

glenn-jocher commented Mar 20, 2019

Jaiczay commented Mar 20, 2019

glenn-jocher commented Mar 20, 2019

glenn-jocher commented Mar 20, 2019 •

edited

glenn-jocher commented Mar 20, 2019 •

edited

Jaiczay commented Mar 20, 2019

glenn-jocher commented Mar 21, 2019 •

edited

Jaiczay commented Mar 21, 2019

Jaiczay commented Mar 21, 2019

glenn-jocher commented Mar 21, 2019

glenn-jocher commented Mar 21, 2019 •

edited

ydixon commented Mar 22, 2019

longxianlei commented Mar 23, 2019

Can you add a num_workers option for the Dataloader? #141

Can you add a num_workers option for the Dataloader? #141

Comments

Jaiczay commented Mar 20, 2019

glenn-jocher commented Mar 20, 2019

Jaiczay commented Mar 20, 2019

glenn-jocher commented Mar 20, 2019

glenn-jocher commented Mar 20, 2019 • edited

glenn-jocher commented Mar 20, 2019 • edited

Jaiczay commented Mar 20, 2019

glenn-jocher commented Mar 21, 2019 • edited

Jaiczay commented Mar 21, 2019

Jaiczay commented Mar 21, 2019

glenn-jocher commented Mar 21, 2019

glenn-jocher commented Mar 21, 2019 • edited

ydixon commented Mar 22, 2019

longxianlei commented Mar 23, 2019

glenn-jocher commented Mar 20, 2019 •

edited

glenn-jocher commented Mar 20, 2019 •

edited

glenn-jocher commented Mar 21, 2019 •

edited

glenn-jocher commented Mar 21, 2019 •

edited