Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you add a num_workers option for the Dataloader? #141

Closed
Jaiczay opened this issue Mar 20, 2019 · 13 comments
Closed

Can you add a num_workers option for the Dataloader? #141

Jaiczay opened this issue Mar 20, 2019 · 13 comments
Assignees

Comments

@Jaiczay
Copy link

Jaiczay commented Mar 20, 2019

Can you please add a num_workers option for the Dataloader to speed up the data loading process?

I tried it myself with this tutorial https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel, but I couldn't get it to work.

I don't really get what the part at dataset.py Line 210 - 212 does.

    img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2)  # BGR to RGB and cv2 to pytorch
    img_all = np.ascontiguousarray(img_all, dtype=np.float32)
    img_all /= 255.0
@glenn-jocher
Copy link
Member

@Jaiczay if I had a dime for everytime someone didn't know what some section does... I don't have time to be a teacher, and line 210 is already commented with an explanation, so I've simply added comments to the next two to avert the next question by someone else down the line.

You'd have to convert this implementation to use the PyTorch dataloader in order to access the num_workers argument. We ditched it in the past because it was too slow. I don't know if this is still the case or not, but what causes you to believe that the data loading process is a chokepoint?

Do you have profiler results to show?

@Jaiczay
Copy link
Author

Jaiczay commented Mar 20, 2019

Because my GPU runs on 30-40% and only on CPU thread runs on 95%

@glenn-jocher
Copy link
Member

Ah this is strange. Are you using a single GPU or multiple? Can you try something like https://github.com/spyder-ide/spyder-line-profiler to figure out exactly which areas are causing the slowdown?

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 20, 2019

@Jaiczay I don't have access to a local GPU, but if I run one batch of default COCO training on my Macbook Pro, I see the dataloader uses up about 240 ms. If this is true then yes you might be correct that the dataloader is a chokepoint in the training process. As a reference on a GCP instance with one P100, a batch nominally takes about 600 ms to process.
Screenshot 2019-03-20 at 19 57 00

If I dig deeper in the dataloader, it seems the slowest part of the process is simply loading the jpegs (which are compressed of course, hence the slow speed). Not much to do about things there unless you were to decompress all the jpegs (which might be a good idea if you plan to do lots of training). np.loadtxt() might also be replaced some faster code as well. Ok, so I'll look into replacing np.load() and multithreading the dataloader. I should have some answers in the next day or two.
Screenshot 2019-03-20 at 20 00 16

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 20, 2019

Good news. I was able to replace np.loadtxt() with python code which reduced the labels loading time from 19ms to 7ms. This shaves off 12 ms from the 240 ms batch times (5% faster). This update is now in commit 9885903.
Screenshot 2019-03-20 at 20 34 59

I'll try multithreading the dataloader next, though this will surely take longer to complete.

@Jaiczay
Copy link
Author

Jaiczay commented Mar 20, 2019

Thx for the quick answer! I got it running with the Pytorch dataloader, but the loss values are all nan so at least I don't get an error any more, but I will try out your fix first bevor I fix that.
Overall it's not that important, I just wanted to get a little bit more out of my new GPU.

@glenn-jocher glenn-jocher self-assigned this Mar 21, 2019
@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 21, 2019

@Jaiczay I've got wonderful news. I re-added support for the PyTorch DataLoader, including num_workers argument, and tested the data loading speeds in various configurations, with excellent speed improvements observed. Updates are in 70fe220.

IMPORTANT: Note that cv2.setNumThreads(0) must be set when using num_workers>0 in order to prevent opencv from trying to multithread on its own. train.py does this automatically now:

yolov3/train.py

Lines 44 to 48 in 0fb2653

# Dataloader
if num_workers > 0:
cv2.setNumThreads(0) # to prevent OpenCV from multithreading
dataset = LoadImagesAndLabels(train_path, img_size=img_size, augment=True)
dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)

https://support.apple.com/kb/SP776?locale=en_US&viewlocale=en_US
Machine type: 2018 MacBook Pro (6 physical CPU cores, 12 vCPUs, 16 GB memory)
CPU platform: 2.2GHz 6-core Intel Core i7, Turbo Boost up to 4.1GHz, with 9MB shared L3 cache
GPUs: None
HDD: 256 GB SSD

num_workers cv2.setNumThreads(0) DataLoader speed (ms/batch)
0 False 206ms (this repo default)
0 True 291ms
1 True 252ms
2 True 131ms
4 True 75ms
6 True 57ms
8 True 54ms
10 True 52ms
12 True 51ms

@Jaiczay
Copy link
Author

Jaiczay commented Mar 21, 2019

Wow, that's great!

Thank you btw for the awesome repo, this is by far the best Pytorch implementation of YOLOv3!

This was referenced Mar 21, 2019
@Jaiczay
Copy link
Author

Jaiczay commented Mar 21, 2019

@glenn-jocher I think you need to update the test.py as well, because when I continue training I suddenly become a mAP of 0.5 and before it was around 0.94

@glenn-jocher
Copy link
Member

Hmmm yes I think there might be a problem in train.py, maybe in the target loading order. Since they are coming in asynchronously now there may be some sort of issue in assigning targets to images. So your resumed training may be bringing your mAP to zero eventually. I'll try and sort it out later today.

test.py currently works fine, for example with yolov3.weights. But yes I should migrate that over to the dataloader also for faster speed.

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 21, 2019

Current workaround is not to use MultiGPU.

@ydixon
Copy link

ydixon commented Mar 22, 2019

@glenn-jocher I haven't look into details from the update. I just wanted to point out if you are using Dataloader from the pytorch library, the worker threads might mess up the random seeds. Let's say you have 4 worker threads, you'll might end up with the same augmentation for the 4 threads. And it's also almost impossible to get deterministic behavior (if that's a concern) without modifying the Dataloader class or simply write your own multi-processing dataloader. If all this is already considered, I guess its all good :)

@longxianlei
Copy link

Can you please add a num_workers option for the Dataloader to speed up the data loading process?

I tried it myself with this tutorial https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel, but I couldn't get it to work.

I don't really get what the part at dataset.py Line 210 - 212 does.

    img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2)  # BGR to RGB and cv2 to pytorch
    img_all = np.ascontiguousarray(img_all, dtype=np.float32)
    img_all /= 255.0

@Jaiczay These operations are just to convert BGR to RGB, so, batch_size X w X h X channel. The channel dim was just inversed. Then transpose to batch_size X channel X w X h.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants