Multi GPU training(data parallel) #28

SunQpark · 2018-10-02T12:00:05Z

This PR handles two multi processing.
The first is multi-CPU for data loading, and the second is using multi-GPU(data parallel).

Multi-CPU is simply done by adding n_cpu argument in the config file and pasing it as num_workers option of pytorch native DataLoader.

Multi-GPU can be controlled with n_gpuoption in config file, which replaces the previous gpu index and cuda option. Training without gpu is still possible with "n_gpu": 0

Specifying the GPU indices to use is possible externally with the environmental variable CUDA_VISIBLE_DEVICES=0,1.
I considered adding GPU indices into config file instead of num_GPU option and setting that on the train.py, but that would save GPU indices to the checkpoint, which can be problematic when resuming.

Tested on 3 machines,
my laptop: pytorch 0.4.1, no GPU
server1: pytorch 0.4.1, 8 * Tesla K80, cuda 9.1
server2: pytorch 0.4.0, 4 * GTX 1080, cuda 9.0

It worked fine for all of conditions I have tested, but I heard that one of my friend saying that giving non-zero value to the num_workers option raised exception for her machine. So, please tell me if anything goes wrong.

I'll update the README file later

borgesa · 2018-10-02T12:25:54Z

I agree with your reasoning for using "n_gpu" instead of indices. Could you maybe add a one sentence summary of this reasoning in README.md, maybe in addition to either a reference or snippet of how the "CUDA_VISIBLE_DEVICES" prefix is to be used?

Will try running your code.

SunQpark · 2018-10-02T12:36:18Z

Oh yes I will add that, thank you for quick response.
I just got a new idea that adding it to a command line option, such as --device 0,1,2,3.
How do you think it would be?

borgesa · 2018-10-02T12:42:54Z

That sounds like a nice solution. I often have to select GPUs, as more people are using the main server I work on.

Speaking of which: there seems to be a problem with that server right now, so I have not yet run your code. Currently waiting for a reply from the IT department. Will do the checks you request ASAP (but might be tomorrow) and report back.

borgesa · 2018-10-02T15:28:04Z

Hi,

Did some testing. Looks good.

GPUs

The example model works well on a server with 4 x Nvidia Quadro P6000.

Played around with the settings and argparse arguments. Warning messages seems to be handled nicely (e.g. when less GPUs are available than what config.json specify).

CPU threads

Seems to work (although I have not monitored threads).

Maybe rename "n_cpu" to "n_cpu_threads" or "n_cpu_workers"? Maybe the latter, as some people are familiar with "num_workers".

borgesa

Nice work.

SunQpark · 2018-10-02T16:05:26Z

I'm glad that test could be done fast.
Let's just call it num_workers as the original DataLoader does.
Now I think this PR is stable enough to be merged.

SunQpark added 2 commits October 2, 2018 20:20

multi-cpu on data_loader

08e61ce

multi-gpu training(data parallel)

50c8809

SunQpark requested review from victoresque and borgesa October 2, 2018 12:01

SunQpark added the enhancement New feature or request label Oct 2, 2018

SunQpark added 2 commits October 2, 2018 22:10

cli option for selecting GPU indices

0310428

Update README.md

d9dc7ca

SunQpark mentioned this pull request Oct 2, 2018

Add combined command line options for fine-tuning #27

Merged

SunQpark added 2 commits October 3, 2018 00:55

delete --force option, add comments

c4e6d77

rename n_cpu to num_workers

4e03658

borgesa approved these changes Oct 2, 2018

View reviewed changes

SunQpark merged commit a6d09a9 into victoresque:master Oct 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU training(data parallel) #28

Multi GPU training(data parallel) #28

SunQpark commented Oct 2, 2018 •

edited

borgesa commented Oct 2, 2018

SunQpark commented Oct 2, 2018

borgesa commented Oct 2, 2018

borgesa commented Oct 2, 2018

borgesa left a comment

SunQpark commented Oct 2, 2018

Multi GPU training(data parallel) #28

Multi GPU training(data parallel) #28

Conversation

SunQpark commented Oct 2, 2018 • edited

borgesa commented Oct 2, 2018

SunQpark commented Oct 2, 2018

borgesa commented Oct 2, 2018

borgesa commented Oct 2, 2018

GPUs

CPU threads

borgesa left a comment

Choose a reason for hiding this comment

SunQpark commented Oct 2, 2018

SunQpark commented Oct 2, 2018 •

edited