New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU training(data parallel) #28
Conversation
I agree with your reasoning for using "n_gpu" instead of indices. Could you maybe add a one sentence summary of this reasoning in README.md, maybe in addition to either a reference or snippet of how the "CUDA_VISIBLE_DEVICES" prefix is to be used? Will try running your code. |
Oh yes I will add that, thank you for quick response. |
That sounds like a nice solution. I often have to select GPUs, as more people are using the main server I work on. Speaking of which: there seems to be a problem with that server right now, so I have not yet run your code. Currently waiting for a reply from the IT department. Will do the checks you request ASAP (but might be tomorrow) and report back. |
Hi, Did some testing. Looks good. GPUsThe example model works well on a server with 4 x Nvidia Quadro P6000. Played around with the settings and argparse arguments. Warning messages seems to be handled nicely (e.g. when less GPUs are available than what config.json specify). CPU threadsSeems to work (although I have not monitored threads). Maybe rename "n_cpu" to "n_cpu_threads" or "n_cpu_workers"? Maybe the latter, as some people are familiar with "num_workers". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
I'm glad that test could be done fast. |
This PR handles two multi processing.
The first is multi-CPU for data loading, and the second is using multi-GPU(data parallel).
Multi-CPU
is simply done by addingn_cpu
argument in the config file and pasing it asnum_workers
option of pytorch nativeDataLoader
.Multi-GPU
can be controlled withn_gpu
option in config file, which replaces the previous gpu index and cuda option. Training without gpu is still possible with"n_gpu": 0
Specifying the GPU indices to use is possible externally with the environmental variable
CUDA_VISIBLE_DEVICES=0,1
.I considered adding GPU indices into config file instead of
num_GPU
option and setting that on thetrain.py
, but that would save GPU indices to the checkpoint, which can be problematic when resuming.Tested on 3 machines,
my laptop: pytorch 0.4.1, no GPU
server1: pytorch 0.4.1, 8 * Tesla K80, cuda 9.1
server2: pytorch 0.4.0, 4 * GTX 1080, cuda 9.0
It worked fine for all of conditions I have tested, but I heard that one of my friend saying that giving non-zero value to the
num_workers
option raised exception for her machine. So, please tell me if anything goes wrong.I'll update the README file later