[ distribution ] How to use multiple GPU on each replica ? #54

ZhuFengdaaa · 2016-04-26T09:49:21Z

The Code Here shows how to set each replica which has a single tower that uses one GPU. I'm wondering if there is a way changing this code a little bit to make use of multiple GPU on one machine like that example.

The way I currently used for using all GPU on a worker machine is starting the number of workers that equal to the number of GPUs. then the workers can communicate to each other as if they are not on one machine. That is slower than if I can start a woker that control more than one GPU.

bhack · 2016-04-28T10:42:40Z

/cc @windreamer

ZhuFengdaaa · 2016-05-04T13:30:17Z

I'm now wondering if my idea above is wrong, that is, it is better that we use multiple workers controlling each GPU on the machine than we use one worker to control all the GPUs on that machine.

ZhiyuTan88 · 2016-05-05T09:55:44Z

Your idea is right. I modify codes which can training with each worker(replica) corresponding to one GPU, it works for this issue.

ZhuFengdaaa · 2016-05-06T04:54:38Z

@Stewarttzy So how much is the training speed improved ? And can you show us how you implement it, please ?

heibaidaolx123 · 2016-06-23T02:44:38Z

I tried running 16 workers on 2 machines with K80 GPU, and 2 ps jobs, one for each machine.
The training is much slower than that of running just 2 workers.
@ZhuFengdaaa Have you solve the speed issue?

ZhuFengdaaa · 2016-06-28T10:40:16Z

@heibaidaolx123 No, I find the same problem as you did. The more workers, slower the speed. @sguada I have seen you said at another issue that there will be a performance update for Tensorflow, when will it be released ?

aselle · 2016-06-28T14:26:02Z

Did you find a solution to your question?

sguada · 2016-06-28T16:09:35Z

Running 2 PS is a bad idea, since the variables are assigned in round-robin fashion, all the weights go to one PS while all the biases go to the other PS. When using PS make sure the load is balanced. You should be able to use either 1 PS or 3 PS for better balance.

TF 0.9 release should increase the speed, we are working in the multi-gpu multi-replica case.

heibaidaolx123 · 2016-06-29T11:02:24Z

@sguada I've just tried TF 0.9, and the training remained as slow as using TF 0.8.
I also tried using just one PS, and got no improvement.
To make it clear, I used 'imagenet_distributed_train.py'. I have 2 machines, each with 8 GPUs and linked with IB. And I set CUDA_VISIBLE_DEVICES to run multiple worker on a single machine.
For 16 workers and 2 PS (equally distributed on 2 nodes), the speed is about 8.3 examples/sec for each worker, and totally about 133 examples/sec.
For 16 workers and 1 PS, almost the same speed as above.
For 8 workers and 1 PS on a single machine, the speed is about 11 examples/sec for each worker, and totally 88 examples/sec.
For 4 workers and 1 PS on a single machine, the speed is about 18 examples/sec for each worker, and totally 72 examples/sec.
For 2 workers and 1 PS on a single machine, the speed is about 20 examples/sec for each worker, and totally 40 examples/sec.
For 1 workers and 1 PS on a single machine, the speed is about 22 examples/sec for each worker.
So for imagenet distributed training, more workers lead to slower speed for a single worker.

AIROBOTAI · 2017-01-12T17:49:40Z

Is there any answer to this question?
@ZhuFengdaaa You said in your question that:

... The way I currently used for using all GPU on a worker machine is starting the number of workers that equal to the number of GPUs. ...

How do you make this work? Suppose there are two machines, each with 4 GPUs, thus overall 8 GPUs. So do you mean you start 8 workers to employ all GPUs? Do you still need to explicitly assign different workers to different devices in your codes, e.g. using tf.device context manager? Could you introduce your approach in more detail?

ZhuFengdaaa · 2017-01-24T06:37:43Z

@AIROBOTAI Yes, I start 8 workers, and assign each workers to each device explicitly use tf.device.

girving · 2017-02-06T22:46:39Z

@sguada What's the status of this issue?

junshi15 · 2017-02-09T14:23:55Z

Here is my understanding of https://github.com/tensorflow/models/blob/master/slim/deployment/model_deploy.py

Let's say you have 2 workers and 1 parameter server, and each worker has 4 clones (GPUs). The worker aggregates the clone gradients then sends them to the parameter server. Then the PS updates the weights. This is all good.

The problem is the GPUs share the weights at PS, so each GPU fetches weights independently at the next forward pass. This generates a lot of traffic due to communications between every GPU and the PS. It would probably be faster to limit the connections at the level of worker and PS, such that individual GPU does not talk with PS directly. Once the weights reach a worker, they are distributed among clones internally. Distributed Caffe does exactly this kind of hierarchical broadcast. It saves quite a bit network bandwidth if you have multiple GPUs in a worker.

AIROBOTAI · 2017-02-09T14:49:09Z

@junshi15 I think you are right about internal weight distributing. Btw, it seems that Caffe does not have distributed version yet.

junshi15 · 2017-02-09T17:19:43Z

@AIROBOTAI You are correct, the official BVLC Caffe does not extend beyond a single node. At the risk of self-promotion, I was referring to Yahoo version of it (https://github.com/yahoo/caffe/tree/master), being part of CaffeOnSpark (https://github.com/yahoo/CaffeOnSpark). Both ethernet and infiniband connections are supported.

AIROBOTAI · 2017-02-09T17:26:14Z

@junshi15 Thx for your clarificatioin!

AIROBOTAI · 2017-02-09T17:30:34Z

Hi @ZhuFengdaaa, I found your modified distributed_train.py (if this is how you use multi-gpu on each replica) and wrote a comment there. Since distributed TF only needs one chief, thus I think changing is_chief = (FLAGS.task_id == 0) to is_chief = (FLAGS.task_id == 0 and FLAGS.gpu_id == 0) is better. Could anyone comment on this?

AIROBOTAI · 2017-02-10T13:31:15Z

@ZhuFengdaaa sorry, just realized that task_id is unique for each worker, so your codes are right.

AIROBOTAI · 2017-02-11T16:50:32Z

Hi @heibaidaolx123, I'd like to know if you have tested the speed benchmark using TF v1.0. And does the speed get faster?

manjunaths · 2017-04-12T08:18:51Z

Hello,
Any update on this issue ? Is a single worker that runs using multi-gpus in a distributed multi-node setting possible now ?

Is there an example ?

weixsong · 2017-05-05T11:38:13Z

Hi, does anyone know how to start 2 workers and each worker control 8 GPU, is there any example code to follow?

ppwwyyxx · 2017-05-05T15:01:35Z

@weixsong https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

jke-zq · 2017-11-26T10:08:18Z

@ppwwyyxx The benchmarks only perform "in-graph replication" across the GPUs in a single worker and asynchronous training across workers.
Is there any way to change it to perform synchronously training across workers? like mentioned in:https://stackoverflow.com/questions/39595747/tensorflow-distributed-training-hybrid-with-multi-gpu-methodology
Any hints will be appreciated.

ccyjava · 2017-12-28T06:35:56Z

@ZhuFengdaaa can you share your example code for "I start 8 workers, and assign each workers to each device explicitly use tf.device“ , it is exactly same problem i meet, thanks a lot

alextp · 2018-02-07T23:15:31Z

Closing this issue. It's straightforward to use multiple GPUs in each replica; just build a graph which assigns work to all the GPUs. There are utilities to do this (used by the inception_train file pointed to in the top).

Higher-level APIs to make this easier are being worked on.

Please reopen if you think it's too soon to close this.

cheng-wen-long · 2018-08-28T09:46:00Z

hello, could you share me with your the code of inception_distribute_train.py. I also meet this problem. Thanks very much @ZhuFengdaaa

mrry mentioned this issue Apr 26, 2016

[ distribution ] How to use multiple GPU on each replica ? tensorflow/tensorflow#2106

Closed

aselle added stat:awaiting response Waiting on input from the contributor performance labels Jun 28, 2016

sguada self-assigned this Jul 1, 2016

michaelisard removed the stat:awaiting response Waiting on input from the contributor label Jul 25, 2016

aselle added type:bug Bug in the code and removed performance labels Jan 28, 2017

alextp closed this as completed Feb 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ distribution ] How to use multiple GPU on each replica ? #54

[ distribution ] How to use multiple GPU on each replica ? #54

ZhuFengdaaa commented Apr 26, 2016

bhack commented Apr 28, 2016

ZhuFengdaaa commented May 4, 2016

ZhiyuTan88 commented May 5, 2016

ZhuFengdaaa commented May 6, 2016

heibaidaolx123 commented Jun 23, 2016

ZhuFengdaaa commented Jun 28, 2016

aselle commented Jun 28, 2016

sguada commented Jun 28, 2016

heibaidaolx123 commented Jun 29, 2016

AIROBOTAI commented Jan 12, 2017

ZhuFengdaaa commented Jan 24, 2017

girving commented Feb 6, 2017

junshi15 commented Feb 9, 2017 •

edited

AIROBOTAI commented Feb 9, 2017

junshi15 commented Feb 9, 2017

AIROBOTAI commented Feb 9, 2017

AIROBOTAI commented Feb 9, 2017 •

edited

AIROBOTAI commented Feb 10, 2017

AIROBOTAI commented Feb 11, 2017

manjunaths commented Apr 12, 2017 •

edited

weixsong commented May 5, 2017

ppwwyyxx commented May 5, 2017

jke-zq commented Nov 26, 2017

ccyjava commented Dec 28, 2017

alextp commented Feb 7, 2018

cheng-wen-long commented Aug 28, 2018

[ distribution ] How to use multiple GPU on each replica ? #54

[ distribution ] How to use multiple GPU on each replica ? #54

Comments

ZhuFengdaaa commented Apr 26, 2016

bhack commented Apr 28, 2016

ZhuFengdaaa commented May 4, 2016

ZhiyuTan88 commented May 5, 2016

ZhuFengdaaa commented May 6, 2016

heibaidaolx123 commented Jun 23, 2016

ZhuFengdaaa commented Jun 28, 2016

aselle commented Jun 28, 2016

sguada commented Jun 28, 2016

heibaidaolx123 commented Jun 29, 2016

AIROBOTAI commented Jan 12, 2017

ZhuFengdaaa commented Jan 24, 2017

girving commented Feb 6, 2017

junshi15 commented Feb 9, 2017 • edited

AIROBOTAI commented Feb 9, 2017

junshi15 commented Feb 9, 2017

AIROBOTAI commented Feb 9, 2017

AIROBOTAI commented Feb 9, 2017 • edited

AIROBOTAI commented Feb 10, 2017

AIROBOTAI commented Feb 11, 2017

manjunaths commented Apr 12, 2017 • edited

weixsong commented May 5, 2017

ppwwyyxx commented May 5, 2017

jke-zq commented Nov 26, 2017

ccyjava commented Dec 28, 2017

alextp commented Feb 7, 2018

cheng-wen-long commented Aug 28, 2018

junshi15 commented Feb 9, 2017 •

edited

AIROBOTAI commented Feb 9, 2017 •

edited

manjunaths commented Apr 12, 2017 •

edited