Tensorflow (TF) Serving on Multi-GPU box #311

Immexxx · 2017-02-07T21:25:05Z

The issue is for a single model (not multi-model)

What is the best way to serve requests from TF Serving in a multi-GPU box. TF is able to see all the GPUs but uses only one (Default Inception v3 model - from ~ 03/2016 I believe). (Is there some configuration in TF Serving that I am not aware of.)

Is this dependent on how the model is loaded and exported?

For inference (and not training) - Is there an example of a saved model being loaded onto multiple GPUs (with a single CPU) where the CPU splits the GPU load among multiple GPUs instead of using only one GPU.

If the model is trained on a single GPU machine, when it is being used for inference in TF Serving, can it be used on a multi GPU box where all GPUs are being used?

jorgemf · 2017-02-08T13:43:06Z

TF Serving only executes the graph loaded. If you graph is using multiple GPUs TF serving will use them, if you graph is only using one GPU (most common case) then you will be able to run it in one GPU in TF Serving. Probably your best solution is to build a script which loads your graph once per GPU, then uses some code in the CPU to split the batch data into the GPU graphs and finally export the whole graph with support for multiGPU.

I am not aware of other solution as it could be to replicate the same graph in every GPU in TF serving, but I guess it could be possible by modifying the serving code. Although I don't think this will be supported anytime soon and you will have to do it by yourself.

jlertle · 2017-02-25T03:26:43Z

Using CUDA_VISIBLE_DEVICES environment variable you can pin Model Server processes to a specific GPU. Run each of them on a separate port and throw a load balancer in front.

If you are handling any significant load you probably want to run model server with --enable_batching

Immexxx · 2017-03-24T21:21:19Z

@jlertle : Thanks for the comment. But even with that, if it is a 4 GPU box, you would have to run 4 servers - which does not seem like the most efficient way to do it. Also, you have to run this on 4 different ports, which in turn adds routing complexity.

bhack · 2017-04-23T09:07:46Z

@Immexxx I think that there are not so much resources invested in serving and ecosystem repositories comparing to TF. It is comprensible cause this are a little bit in conflict with selling managed services.

Immexxx · 2017-07-05T18:04:56Z

Guessing I could close this issue based on the above responses, but will be cool for the system to "intelligently" figure out load - for example, a graph requiring only one GPU - and scale it based on available resources - if multiple GPUs are available, replicate the graph - and serve it.

Could be a flag when launching TF Serving.

In summary, will leave this open for now - in case someone (including me) is looking for an interesting project. In general, TF's handling of GPUs - grabbing all the memory of all GPUs and 'locking' them up but using only a subset for compute does not seem like the most elegant way of doing this. A flag for "auto-replicate and serve" would be nice.

vitalyli · 2017-10-17T03:32:21Z

Let's re-open this; With 8GPU machine, it would be really helpful if the sever was simply alternating requests across N available GPUs.
For most of us without TPUs, the only way to speed things up is using GPUs. The difference can be 5x faster on GPUs.

zacharynevin · 2017-10-22T19:37:29Z

+1. This would be incredibly useful. Currently I have a model that I am trying to run on a p2.16xlarge instance but its only utilizing one of the GPUs. It would be incredibly useful if the batch could be split automatically between all the GPUs.

zacharynevin · 2017-10-22T20:43:32Z

In the end I simply used client-side load balancing and had a separate tensorflow serving server for each GPU

gawinghe · 2017-11-03T06:10:18Z

@zacharynevin . me too. tensorflow serving is a wasting of multi gpus

mattfeury · 2018-01-18T23:08:43Z

another +1 here for the exact use case as above. running this on a p2 instance with extra gpus would be an incredibly easy way to scale up. but right now it requires an instance per gpu

deadeyegoodwin · 2018-04-13T22:18:32Z

A natural way to do this would be to use the Session option visible_device_list to create a session for each GPU and then serve the same model on each GPU.

Unfortunately due to a significant TensorFlow issue (described in tensorflow/tensorflow#8136 and many related bugs), it seems that you cannot have multiple Sessions in the same process with different visible_device_list settings. There doesn't seem to be any traction to getting it fixed as all the related bugs have been closed without any code changes.

rundembear · 2018-05-04T09:19:42Z

I placed a comment on one of the many closed tickets referred to by @deadeyegoodwin. The comments on closing those tickets speak as if the functionality must be this way. Yet this works absolutely fine in Tensorflow 1.3!!! We have an app where we call list_local_devices() to get the list of available GPUS. We then create a Session per-GPU, each running the same graph, and use visible_device_list to make each Session see the GPU assigned to it as GPU 0. We have this running in live production code with zero problems.
I just started trying to upgrade to Tensorflow 1.8 and ran smack into this problem. So I feel like more explanation is needed beyond saying that this doesn't work. It used to work. It got broken at some point. Please return it to working, or explain why something changed between 1.3 and 1.5 which necessitated this change. I have an idea for a workaround which I am going to try, I will post the workaround if it succeeds.

rundembear · 2018-05-04T10:26:50Z

The problem is NOT with list_local_devices!! I ran my code once, got the result we use, which is just the string description. Then I commented out the call to list_local_devices and I still get the same crash with the same error:

F tensorflow/core/common_runtime/gpu/gpu_id_manager.cc:45] Check failed: cuda_gpu_id.value() == result.first->second (1 vs. 0)Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

I am going to try some earlier versions of Tensorflow to see if I can identify where this went bad.

t27 · 2018-05-29T12:12:27Z

This issue still appears on the latest versions.

I'm using an AWS g3.8xlarge instance which has 2 GPUs.

TF serving is able to detect both GPUs and initialise them but while running the model it only uses 1 GPU to the maximum.

We are on version 1.7, even though the client sends upto 32 requests in parallel, the model server only uses the first GPU (check screenshot from nvidia-smi)

Seeing that this ticket is open since quite some time, Is the external load balancer solution(that @zacharynevin suggested)the only solution at the moment?

@rickragv @cyberwillis

qiaohaijun · 2018-06-06T11:09:14Z

Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

when I use multi GPU in tf1.6, I met the error

sugartom · 2018-07-28T18:07:23Z

@jlertle @zacharynevin , hi, I would like to ask what commands you were using to run a separate tensorflow serving server for each GPU.

I have a machine with two 1080 Tis. My TF-Serving is able to correctly identify both of the GPUs when the CUDA_VISIBLE_DEVICES flag is not set. And I was trying to run one TF-Serving server per GPU, so I opened two terminals to run the following two commands:

CUDA_VISIBLE_DEVICES=0 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/path/to/inception_model

CUDA_VISIBLE_DEVICES=1 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=mnist --model_base_path=/path/to/mnist_model

The first command was running fine, but the second one will reach error says

terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
[1] 4021 abort (core dumped) CUDA_VISIBLE_DEVICES=1 --port=9001 --model_name=mnist

And idea or suggestions? Thanks in advance!

zacharynevin · 2018-07-28T19:44:34Z

Hi @sugartom, can you post the output of nvidia-smi?

sugartom · 2018-07-28T20:10:23Z

@zacharynevin , thanks for your reply, my output of nvidia-smi (without TF-Serving running) is:

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1089 G /usr/lib/xorg/Xorg 225MiB |
| 0 1892 G compiz 103MiB |
| 0 5429 G /usr/lib/firefox/firefox 2MiB |
+-----------------------------------------------------------------------------+

Please let me know if you need any more information. Thanks!

jlertle · 2018-07-28T20:29:05Z

@sugartom the commands and output of nvidia-smi look good. I'm not sure what the issue could be at this time

sugartom · 2018-07-28T20:30:47Z

@jlertle , thanks for your reply. May I ask what's your TF version? I am using v1.2...

jlertle · 2018-07-28T20:35:50Z

I'm currently on v1.6 but have used this method since ~v0.8.

Have you tried running only on the second GPU?

Perhaps try this:

export CUDA_VISIBLE_DEVICES=1 && bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=mnist --model_base_path=/path/to/mnist_model

sugartom · 2018-07-28T20:41:33Z

Yes, already tried that. My TF-Serving runs perfectly on either gpu:0 or gpu:1 alone, and that's why it confused me so much that running two TF-Serving instances didn't work...

Immexxx · 2018-07-29T07:32:06Z

@sugartom

It is certainly possible to run two instances (at least on a P100 and V100)
Can you try swapping the models to see if there is some issue with the model - like running mnist on GPU 0, to rule out model/model-environment issues
If that does not work, you can try swapping the ports / using different ports
Do you see this error when you are trying to start up the TF Serving or when you are trying to process an image through it? If it is the former and you are still seeing the error, you could try setting the gpu_memory_fraction to see if that helps

sugartom · 2018-07-29T21:23:28Z

@Immexxx , hi, thanks for your reply!
For your suggestions:

I swapped my models, my exported inception is able to run on either gpu:0 or gpu:1, and my exported mnist is also able to run on either gpu:0 or gpu:1.
Tested on a few different ports, still no luck...
I saw this error message when starting the TF-Serving server. Namely, the first server is always running normally (log msg: I tensorflow_serving/model_servers/main.cc:343] Running ModelServer at 0.0.0.0:9100 ...) on its gpu (either gpu:0 or gpu:1), but the second server can't finish its start on its gpu (either gpu:1 or gpu:0). So I haven't reached the image processing step yet, since my second server is always reporting error.
And I tried to limit the gpu memory as you suggested using set_per_process_gpu_memory_fraction in main.cc. I was able to limit the gpu memory (according to nvidia-smi, it says both GPU usage around 6117MiB / 11176MiB if I set gpu fraction of 0.5). But still, running two separate TF-Serving servers will lead the second server reporting Resource temporarily unavailable error...

Anyway, thanks again for your suggestions :-)

Harshini-Gadige · 2018-10-25T17:48:34Z

Is this still an issue ?

wydwww · 2018-11-18T08:35:58Z

@Immexxx @jorgemf

If you graph is using multiple GPUs TF serving will use them, if you graph is only using one GPU (most common case) then you will be able to run it in one GPU in TF Serving.

Do you mean when a model is trained in model parallelism, then it can be served with multiple GPUs?
One major motivation of model parallelism is that the model is too large to fit in one GPU, since a model is trained with 1 GPU for example, is it still necessary to use multiple GPUs in serving, comparing with the current load balancer method?

wydwww · 2018-12-08T08:39:14Z

@Immexxx
Regarding to your latest post,

Currently, TF Serving does not seem to be able to exploit the compute on all on GPU nodes if the graph is set up to run on only one GPU.

According to Figure 4 of a ICML2017 paper about device placement, utilising 2 or 4 GPUs will only sightly reduce latency (from 0.1 to ~0.07) comparing with single-GPU in model parallelism training.

Is this the goal of utilising multiple GPUs for one request, which is to reduce the minimal latency?

Thanks

srikanthcognitifAI · 2019-01-08T15:00:57Z

In the TensorRT github page they have mentioned "Multi-GPU support. TRTIS can distribute inferencing across all system GPUs." ,is this possible with TFserving already??

xuzheyuan624 · 2019-04-03T06:48:30Z

How to use an appointed GPU to run tensorflow serving with docker.I dont't want to take uo all gpus when running tf serving.So does anybody know?In addition, my model is writen by tf.keras.

aaroey · 2019-04-03T14:35:36Z

@xuzheyuan624 could you try to set environment variable CUDA_VISIBLE_DEVICES and see if it works?

xuzheyuan624 · 2019-04-05T02:51:18Z

@aaroey I tried like this:
#!/bin/bash export CUDA_VISIBLE_DEVICES=5 && \ docker run --runtime=nvidia -p 8500:8500 \ --mount type=bind,\ source=/home/xzy/keras-yolov3/yolov3,\ target=/models/yolov3 \ -e MODEL_NAME=yolov3 -t tensorflow/serving:1.12.0-gpu &
but it didn't work.

t27 · 2019-04-05T07:27:46Z

You need to use the nvidia runtime for docker, and export the NVIDIA_VISIBLE_DEVICES environment variable inside the container when you run it

Sample command

docker run  --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8501:8501 \
  --mount type=bind,source=/path/to/my_model/,target=/models/my_model \
  -e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu

If you use a config file for tensorflow serving, you can use the following command while running the docker

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8500:8500 -p 8501:8501 \
  --mount type=bind,source=/path/to/my_model/,target=/models/my_model \
  --mount type=bind,source=/path/to/my/models.config,target=/models/models.config \
  -t tensorflow/serving:latest-gpu --model_config_file=/models/models.config

Source: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/docker.md#serving-with-docker-using-your-gpu

t27 · 2019-04-05T07:30:36Z

Until tensorflow serving incorporates the multi gpu load balancing, the best approach to do multi gpu inference for TensorflowServing is by having dockerised tensorflow serving containers for each gpu (#311 (comment)) and then writing your own load balancing logic which has a list of all the url-port number combinations and dispatches requests accordingly

xuzheyuan624 · 2019-04-07T12:37:19Z

@t27 thanks! It's very useful!

troycheng · 2019-04-08T04:07:41Z

Did someone meet throughput decreasing problem when using load balancing over multiple tf-serving instances? For only one instance, using batching with fine tuning it can reach 500 qps, but for 8 intances, It can't reach 500 * 8 = 4000 qps, even worse, it can't reach 500 qps. I know batching is a main cause but till now I do not know why and also not find a better solution to make full use of multiple gpu devices

372046933 · 2020-04-21T06:51:46Z

@troycheng
I met up with the same problem. See kubernetes/kubernetes#87277 and kubernetes/kubernetes#88986. Can you paste your batching_parameters_file and the CPU limit on each instance. Pay attention to num_batch_threads especially. If it's bigger than the available CPUs. Most of the resources are busy waiting for condition_variable. See https://github.com/tensorflow/tensorflow/blob/v2.1.0/tensorflow/core/kernels/batching_util/shared_batch_scheduler.h#L497. We have almost fixed the scale up problem by tuning the batching parameter. But the scale up is still not linear(About 0.85* NUM_INSTANCE). I changed the wait time and now I'm recompiling serving to see if there is any improvement.

tobegit3hub · 2020-04-27T08:27:39Z

If we only serve one model with one model version in multiple-gpu server, we have implemented the functionality to automatically start multiple session to load the model which has been edit for binding different gpu devices. This can utilize all the gpu devices for one model and rebalance the inference requests by simple strategies like RoundRobin.

However, the default implementation of servable in TensorFlow Serving manages only one Session by use TensorFlow C API to load SavedModel. We have to implement the new Servable or SourceAdapter to support this.

There are some issues for multiple models or multiple model versions if we allocate all gpu resource to one model version. It is related to the usage and design of this project. We are willing to contribute what we have done if we have the clear design for multiple gpu support.

Uncle-Yuanl · 2020-11-13T09:45:55Z

@tobegit3hub

If we only serve one model with one model version in multiple-gpu server, we have implemented the functionality to automatically start multiple session to load the model which has been edit for binding different gpu devices. This can utilize all the gpu devices for one model and rebalance the inference requests by simple strategies like RoundRobin.

Thank you for your information. So if there is already a demo or related code in this project??
Or can you explain the details of the method above.

Thank you very much!

UsharaniPagadala · 2021-10-21T15:30:47Z

@Immexxx

Closing this issue, Please feel free to reopen if this still exists.Thanks
and also refer to this documentation, hope it helps

Antonio-hi · 2022-06-16T07:03:53Z

This is still an issue

raminmohammadi · 2023-01-13T18:37:36Z

I am facing the same problem on TF 2.11, I have to GPUs but its only picking one during the serving.

bhack mentioned this issue Jul 23, 2017

Docker image with prebuilt tensorflow_model_server, ready to run. #513

Closed

jlewi mentioned this issue Dec 20, 2017

Should Kubeflow publish and maintain TF Serving Docker Images? kubeflow/kubeflow#50

Closed

gautamvasudevan mentioned this issue Aug 7, 2018

Problems of using multi GPUs and implementing load balance among them #859

Closed

Harshini-Gadige added the comp:gpu label Oct 25, 2018

Harshini-Gadige added the stat:awaiting response label Oct 25, 2018

wydwww mentioned this issue Dec 4, 2018

tensorflow model trained on 'N'-GPUs requires same number of GPU to get served using tf-serving #977

Closed

Harshini-Gadige added the type:bug label Jan 18, 2019

Harshini-Gadige assigned minglotus-6 Mar 13, 2019

Harshini-Gadige added the stat:awaiting tensorflower label Mar 13, 2019

misterpeddy unassigned minglotus-6 Jun 13, 2019

misterpeddy added the needs prio label Jun 13, 2019

misterpeddy added the type:performance Performance Issue label Nov 18, 2019

gowthamkpr mentioned this issue Apr 2, 2020

Enable Multi-GPU load balancing #1588

Closed

UsharaniPagadala self-assigned this Oct 21, 2021

UsharaniPagadala closed this as completed Oct 21, 2021

pindinagesh mentioned this issue Mar 22, 2022

tf serving performace is so slow #1989

Closed

pindinagesh mentioned this issue Apr 21, 2022

How to set conf make per request just in on core? #1998

Closed

singhniraj08 mentioned this issue Mar 23, 2023

how to use all gpus ? #2111

Closed

singhniraj08 mentioned this issue Jul 17, 2023

When my model exceeds the maximum capacity of a single gpu, can I fragment the model to multiple gpu? #2159

Closed

Tensorflow (TF) Serving on Multi-GPU box #311

Tensorflow (TF) Serving on Multi-GPU box #311

Comments

Immexxx commented Feb 7, 2017 • edited Loading

jorgemf commented Feb 8, 2017

jlertle commented Feb 25, 2017 • edited Loading

Immexxx commented Mar 24, 2017

bhack commented Apr 23, 2017

Immexxx commented Jul 5, 2017

vitalyli commented Oct 17, 2017 • edited Loading

zacharynevin commented Oct 22, 2017

zacharynevin commented Oct 22, 2017

gawinghe commented Nov 3, 2017 • edited Loading

mattfeury commented Jan 18, 2018

deadeyegoodwin commented Apr 13, 2018

rundembear commented May 4, 2018

rundembear commented May 4, 2018

t27 commented May 29, 2018 • edited Loading

qiaohaijun commented Jun 6, 2018

sugartom commented Jul 28, 2018

zacharynevin commented Jul 28, 2018

sugartom commented Jul 28, 2018 • edited Loading

jlertle commented Jul 28, 2018

sugartom commented Jul 28, 2018

jlertle commented Jul 28, 2018

sugartom commented Jul 28, 2018

Immexxx commented Jul 29, 2018

sugartom commented Jul 29, 2018

Harshini-Gadige commented Oct 25, 2018

wydwww commented Nov 18, 2018 • edited Loading

wydwww commented Dec 8, 2018

srikanthcognitifAI commented Jan 8, 2019

xuzheyuan624 commented Apr 3, 2019 • edited Loading

aaroey commented Apr 3, 2019

xuzheyuan624 commented Apr 5, 2019 • edited Loading

t27 commented Apr 5, 2019

t27 commented Apr 5, 2019

xuzheyuan624 commented Apr 7, 2019

troycheng commented Apr 8, 2019 • edited Loading

372046933 commented Apr 21, 2020

tobegit3hub commented Apr 27, 2020

Uncle-Yuanl commented Nov 13, 2020

UsharaniPagadala commented Oct 21, 2021 • edited Loading

Antonio-hi commented Jun 16, 2022

raminmohammadi commented Jan 13, 2023 • edited Loading

Immexxx commented Feb 7, 2017 •

edited

Loading

jlertle commented Feb 25, 2017 •

edited

Loading

vitalyli commented Oct 17, 2017 •

edited

Loading

gawinghe commented Nov 3, 2017 •

edited

Loading

t27 commented May 29, 2018 •

edited

Loading

sugartom commented Jul 28, 2018 •

edited

Loading

wydwww commented Nov 18, 2018 •

edited

Loading

xuzheyuan624 commented Apr 3, 2019 •

edited

Loading

xuzheyuan624 commented Apr 5, 2019 •

edited

Loading

troycheng commented Apr 8, 2019 •

edited

Loading

UsharaniPagadala commented Oct 21, 2021 •

edited

Loading

raminmohammadi commented Jan 13, 2023 •

edited

Loading