Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU training(data parallel) #28

Merged
merged 6 commits into from Oct 2, 2018
Merged

Conversation

SunQpark
Copy link
Collaborator

@SunQpark SunQpark commented Oct 2, 2018

This PR handles two multi processing.
The first is multi-CPU for data loading, and the second is using multi-GPU(data parallel).

Multi-CPU is simply done by adding n_cpu argument in the config file and pasing it as num_workers option of pytorch native DataLoader.

Multi-GPU can be controlled with n_gpuoption in config file, which replaces the previous gpu index and cuda option. Training without gpu is still possible with "n_gpu": 0

Specifying the GPU indices to use is possible externally with the environmental variable CUDA_VISIBLE_DEVICES=0,1.
I considered adding GPU indices into config file instead of num_GPU option and setting that on the train.py, but that would save GPU indices to the checkpoint, which can be problematic when resuming.

Tested on 3 machines,
my laptop: pytorch 0.4.1, no GPU
server1: pytorch 0.4.1, 8 * Tesla K80, cuda 9.1
server2: pytorch 0.4.0, 4 * GTX 1080, cuda 9.0

It worked fine for all of conditions I have tested, but I heard that one of my friend saying that giving non-zero value to the num_workers option raised exception for her machine. So, please tell me if anything goes wrong.

I'll update the README file later

@SunQpark SunQpark added the enhancement New feature or request label Oct 2, 2018
@borgesa
Copy link
Collaborator

borgesa commented Oct 2, 2018

I agree with your reasoning for using "n_gpu" instead of indices. Could you maybe add a one sentence summary of this reasoning in README.md, maybe in addition to either a reference or snippet of how the "CUDA_VISIBLE_DEVICES" prefix is to be used?

Will try running your code.

@SunQpark
Copy link
Collaborator Author

SunQpark commented Oct 2, 2018

Oh yes I will add that, thank you for quick response.
I just got a new idea that adding it to a command line option, such as --device 0,1,2,3.
How do you think it would be?

@borgesa
Copy link
Collaborator

borgesa commented Oct 2, 2018

That sounds like a nice solution. I often have to select GPUs, as more people are using the main server I work on.

Speaking of which: there seems to be a problem with that server right now, so I have not yet run your code. Currently waiting for a reply from the IT department. Will do the checks you request ASAP (but might be tomorrow) and report back.

@borgesa
Copy link
Collaborator

borgesa commented Oct 2, 2018

Hi,

Did some testing. Looks good.

GPUs

The example model works well on a server with 4 x Nvidia Quadro P6000.

Played around with the settings and argparse arguments. Warning messages seems to be handled nicely (e.g. when less GPUs are available than what config.json specify).

CPU threads

Seems to work (although I have not monitored threads).

Maybe rename "n_cpu" to "n_cpu_threads" or "n_cpu_workers"? Maybe the latter, as some people are familiar with "num_workers".

Copy link
Collaborator

@borgesa borgesa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

@SunQpark
Copy link
Collaborator Author

SunQpark commented Oct 2, 2018

I'm glad that test could be done fast.
Let's just call it num_workers as the original DataLoader does.
Now I think this PR is stable enough to be merged.

@SunQpark SunQpark merged commit a6d09a9 into victoresque:master Oct 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants