Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow (TF) Serving on Multi-GPU box #311

Closed
Immexxx opened this issue Feb 7, 2017 · 42 comments
Closed

Tensorflow (TF) Serving on Multi-GPU box #311

Immexxx opened this issue Feb 7, 2017 · 42 comments

Comments

@Immexxx
Copy link

Immexxx commented Feb 7, 2017

The issue is for a single model (not multi-model)

  1. What is the best way to serve requests from TF Serving in a multi-GPU box. TF is able to see all the GPUs but uses only one (Default Inception v3 model - from ~ 03/2016 I believe). (Is there some configuration in TF Serving that I am not aware of.)

Is this dependent on how the model is loaded and exported?

For inference (and not training) - Is there an example of a saved model being loaded onto multiple GPUs (with a single CPU) where the CPU splits the GPU load among multiple GPUs instead of using only one GPU.

  1. If the model is trained on a single GPU machine, when it is being used for inference in TF Serving, can it be used on a multi GPU box where all GPUs are being used?
@jorgemf
Copy link

jorgemf commented Feb 8, 2017

TF Serving only executes the graph loaded. If you graph is using multiple GPUs TF serving will use them, if you graph is only using one GPU (most common case) then you will be able to run it in one GPU in TF Serving. Probably your best solution is to build a script which loads your graph once per GPU, then uses some code in the CPU to split the batch data into the GPU graphs and finally export the whole graph with support for multiGPU.

I am not aware of other solution as it could be to replicate the same graph in every GPU in TF serving, but I guess it could be possible by modifying the serving code. Although I don't think this will be supported anytime soon and you will have to do it by yourself.

@jlertle
Copy link

jlertle commented Feb 25, 2017

Using CUDA_VISIBLE_DEVICES environment variable you can pin Model Server processes to a specific GPU. Run each of them on a separate port and throw a load balancer in front.

If you are handling any significant load you probably want to run model server with --enable_batching

@Immexxx
Copy link
Author

Immexxx commented Mar 24, 2017

@jlertle : Thanks for the comment. But even with that, if it is a 4 GPU box, you would have to run 4 servers - which does not seem like the most efficient way to do it. Also, you have to run this on 4 different ports, which in turn adds routing complexity.

@bhack
Copy link

bhack commented Apr 23, 2017

@Immexxx I think that there are not so much resources invested in serving and ecosystem repositories comparing to TF. It is comprensible cause this are a little bit in conflict with selling managed services.

@Immexxx
Copy link
Author

Immexxx commented Jul 5, 2017

Guessing I could close this issue based on the above responses, but will be cool for the system to "intelligently" figure out load - for example, a graph requiring only one GPU - and scale it based on available resources - if multiple GPUs are available, replicate the graph - and serve it.

Could be a flag when launching TF Serving.

In summary, will leave this open for now - in case someone (including me) is looking for an interesting project. In general, TF's handling of GPUs - grabbing all the memory of all GPUs and 'locking' them up but using only a subset for compute does not seem like the most elegant way of doing this. A flag for "auto-replicate and serve" would be nice.

@vitalyli
Copy link
Contributor

vitalyli commented Oct 17, 2017

Let's re-open this; With 8GPU machine, it would be really helpful if the sever was simply alternating requests across N available GPUs.
For most of us without TPUs, the only way to speed things up is using GPUs. The difference can be 5x faster on GPUs.

@zacharynevin
Copy link

+1. This would be incredibly useful. Currently I have a model that I am trying to run on a p2.16xlarge instance but its only utilizing one of the GPUs. It would be incredibly useful if the batch could be split automatically between all the GPUs.

@zacharynevin
Copy link

In the end I simply used client-side load balancing and had a separate tensorflow serving server for each GPU

@gawinghe
Copy link

gawinghe commented Nov 3, 2017

@zacharynevin . me too. tensorflow serving is a wasting of multi gpus

@mattfeury
Copy link

another +1 here for the exact use case as above. running this on a p2 instance with extra gpus would be an incredibly easy way to scale up. but right now it requires an instance per gpu

@deadeyegoodwin
Copy link

A natural way to do this would be to use the Session option visible_device_list to create a session for each GPU and then serve the same model on each GPU.

Unfortunately due to a significant TensorFlow issue (described in tensorflow/tensorflow#8136 and many related bugs), it seems that you cannot have multiple Sessions in the same process with different visible_device_list settings. There doesn't seem to be any traction to getting it fixed as all the related bugs have been closed without any code changes.

@rundembear
Copy link

I placed a comment on one of the many closed tickets referred to by @deadeyegoodwin. The comments on closing those tickets speak as if the functionality must be this way. Yet this works absolutely fine in Tensorflow 1.3!!! We have an app where we call list_local_devices() to get the list of available GPUS. We then create a Session per-GPU, each running the same graph, and use visible_device_list to make each Session see the GPU assigned to it as GPU 0. We have this running in live production code with zero problems.
I just started trying to upgrade to Tensorflow 1.8 and ran smack into this problem. So I feel like more explanation is needed beyond saying that this doesn't work. It used to work. It got broken at some point. Please return it to working, or explain why something changed between 1.3 and 1.5 which necessitated this change. I have an idea for a workaround which I am going to try, I will post the workaround if it succeeds.

@rundembear
Copy link

The problem is NOT with list_local_devices!! I ran my code once, got the result we use, which is just the string description. Then I commented out the call to list_local_devices and I still get the same crash with the same error:

F tensorflow/core/common_runtime/gpu/gpu_id_manager.cc:45] Check failed: cuda_gpu_id.value() == result.first->second (1 vs. 0)Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

I am going to try some earlier versions of Tensorflow to see if I can identify where this went bad.

@t27
Copy link

t27 commented May 29, 2018

This issue still appears on the latest versions.

I'm using an AWS g3.8xlarge instance which has 2 GPUs.

TF serving is able to detect both GPUs and initialise them but while running the model it only uses 1 GPU to the maximum.

We are on version 1.7, even though the client sends upto 32 requests in parallel, the model server only uses the first GPU (check screenshot from nvidia-smi)
06_09_21

Seeing that this ticket is open since quite some time, Is the external load balancer solution(that @zacharynevin suggested)the only solution at the moment?

@rickragv @cyberwillis

@qiaohaijun
Copy link

Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

when I use multi GPU in tf1.6, I met the error

@sugartom
Copy link

@jlertle @zacharynevin , hi, I would like to ask what commands you were using to run a separate tensorflow serving server for each GPU.

I have a machine with two 1080 Tis. My TF-Serving is able to correctly identify both of the GPUs when the CUDA_VISIBLE_DEVICES flag is not set. And I was trying to run one TF-Serving server per GPU, so I opened two terminals to run the following two commands:

CUDA_VISIBLE_DEVICES=0 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/path/to/inception_model

CUDA_VISIBLE_DEVICES=1 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=mnist --model_base_path=/path/to/mnist_model

The first command was running fine, but the second one will reach error says

terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
[1] 4021 abort (core dumped) CUDA_VISIBLE_DEVICES=1 --port=9001 --model_name=mnist

And idea or suggestions? Thanks in advance!

@zacharynevin
Copy link

Hi @sugartom, can you post the output of nvidia-smi?

@sugartom
Copy link

sugartom commented Jul 28, 2018

@zacharynevin , thanks for your reply, my output of nvidia-smi (without TF-Serving running) is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.59 Driver Version: 390.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 On | N/A |
| 23% 31C P8 15W / 250W | 334MiB / 11176MiB | 28% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 23% 34C P8 15W / 250W | 2MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1089 G /usr/lib/xorg/Xorg 225MiB |
| 0 1892 G compiz 103MiB |
| 0 5429 G /usr/lib/firefox/firefox 2MiB |
+-----------------------------------------------------------------------------+

Please let me know if you need any more information. Thanks!

@jlertle
Copy link

jlertle commented Jul 28, 2018

@sugartom the commands and output of nvidia-smi look good. I'm not sure what the issue could be at this time

@sugartom
Copy link

@jlertle , thanks for your reply. May I ask what's your TF version? I am using v1.2...

@jlertle
Copy link

jlertle commented Jul 28, 2018

I'm currently on v1.6 but have used this method since ~v0.8.

Have you tried running only on the second GPU?

Perhaps try this:

export CUDA_VISIBLE_DEVICES=1 && bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=mnist --model_base_path=/path/to/mnist_model

@sugartom
Copy link

Yes, already tried that. My TF-Serving runs perfectly on either gpu:0 or gpu:1 alone, and that's why it confused me so much that running two TF-Serving instances didn't work...

@Immexxx
Copy link
Author

Immexxx commented Jul 29, 2018

@sugartom

  1. It is certainly possible to run two instances (at least on a P100 and V100)
  2. Can you try swapping the models to see if there is some issue with the model - like running mnist on GPU 0, to rule out model/model-environment issues
  3. If that does not work, you can try swapping the ports / using different ports
  4. Do you see this error when you are trying to start up the TF Serving or when you are trying to process an image through it? If it is the former and you are still seeing the error, you could try setting the gpu_memory_fraction to see if that helps

@sugartom
Copy link

@Immexxx , hi, thanks for your reply!
For your suggestions:

  1. I swapped my models, my exported inception is able to run on either gpu:0 or gpu:1, and my exported mnist is also able to run on either gpu:0 or gpu:1.

  2. Tested on a few different ports, still no luck...

  3. I saw this error message when starting the TF-Serving server. Namely, the first server is always running normally (log msg: I tensorflow_serving/model_servers/main.cc:343] Running ModelServer at 0.0.0.0:9100 ...) on its gpu (either gpu:0 or gpu:1), but the second server can't finish its start on its gpu (either gpu:1 or gpu:0). So I haven't reached the image processing step yet, since my second server is always reporting error.
    And I tried to limit the gpu memory as you suggested using set_per_process_gpu_memory_fraction in main.cc. I was able to limit the gpu memory (according to nvidia-smi, it says both GPU usage around 6117MiB / 11176MiB if I set gpu fraction of 0.5). But still, running two separate TF-Serving servers will lead the second server reporting Resource temporarily unavailable error...

Anyway, thanks again for your suggestions :-)

@Harshini-Gadige
Copy link

Is this still an issue ?

@wydwww
Copy link

wydwww commented Nov 18, 2018

@Immexxx @jorgemf

If you graph is using multiple GPUs TF serving will use them, if you graph is only using one GPU (most common case) then you will be able to run it in one GPU in TF Serving.

Do you mean when a model is trained in model parallelism, then it can be served with multiple GPUs?
One major motivation of model parallelism is that the model is too large to fit in one GPU, since a model is trained with 1 GPU for example, is it still necessary to use multiple GPUs in serving, comparing with the current load balancer method?

@wydwww
Copy link

wydwww commented Dec 8, 2018

@Immexxx
Regarding to your latest post,

Currently, TF Serving does not seem to be able to exploit the compute on all on GPU nodes if the graph is set up to run on only one GPU.

According to Figure 4 of a ICML2017 paper about device placement, utilising 2 or 4 GPUs will only sightly reduce latency (from 0.1 to ~0.07) comparing with single-GPU in model parallelism training.

Is this the goal of utilising multiple GPUs for one request, which is to reduce the minimal latency?

Thanks

@srikanthcognitifAI
Copy link

In the TensorRT github page they have mentioned "Multi-GPU support. TRTIS can distribute inferencing across all system GPUs." ,is this possible with TFserving already??

@xuzheyuan624
Copy link

xuzheyuan624 commented Apr 3, 2019

How to use an appointed GPU to run tensorflow serving with docker.I dont't want to take uo all gpus when running tf serving.So does anybody know?In addition, my model is writen by tf.keras.

@aaroey
Copy link
Member

aaroey commented Apr 3, 2019

@xuzheyuan624 could you try to set environment variable CUDA_VISIBLE_DEVICES and see if it works?

@xuzheyuan624
Copy link

xuzheyuan624 commented Apr 5, 2019

@aaroey I tried like this:
#!/bin/bash export CUDA_VISIBLE_DEVICES=5 && \ docker run --runtime=nvidia -p 8500:8500 \ --mount type=bind,\ source=/home/xzy/keras-yolov3/yolov3,\ target=/models/yolov3 \ -e MODEL_NAME=yolov3 -t tensorflow/serving:1.12.0-gpu &
but it didn't work.

@t27
Copy link

t27 commented Apr 5, 2019

You need to use the nvidia runtime for docker, and export the NVIDIA_VISIBLE_DEVICES environment variable inside the container when you run it

Sample command

docker run  --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8501:8501 \
  --mount type=bind,source=/path/to/my_model/,target=/models/my_model \
  -e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu

If you use a config file for tensorflow serving, you can use the following command while running the docker

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8500:8500 -p 8501:8501 \
  --mount type=bind,source=/path/to/my_model/,target=/models/my_model \
  --mount type=bind,source=/path/to/my/models.config,target=/models/models.config \
  -t tensorflow/serving:latest-gpu --model_config_file=/models/models.config

Source: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/docker.md#serving-with-docker-using-your-gpu

@t27
Copy link

t27 commented Apr 5, 2019

Until tensorflow serving incorporates the multi gpu load balancing, the best approach to do multi gpu inference for TensorflowServing is by having dockerised tensorflow serving containers for each gpu (#311 (comment)) and then writing your own load balancing logic which has a list of all the url-port number combinations and dispatches requests accordingly

@xuzheyuan624
Copy link

@t27 thanks! It's very useful!

@troycheng
Copy link

troycheng commented Apr 8, 2019

Did someone meet throughput decreasing problem when using load balancing over multiple tf-serving instances? For only one instance, using batching with fine tuning it can reach 500 qps, but for 8 intances, It can't reach 500 * 8 = 4000 qps, even worse, it can't reach 500 qps. I know batching is a main cause but till now I do not know why and also not find a better solution to make full use of multiple gpu devices

@372046933
Copy link

@troycheng
I met up with the same problem. See kubernetes/kubernetes#87277 and kubernetes/kubernetes#88986. Can you paste your batching_parameters_file and the CPU limit on each instance. Pay attention to num_batch_threads especially. If it's bigger than the available CPUs. Most of the resources are busy waiting for condition_variable. See https://github.com/tensorflow/tensorflow/blob/v2.1.0/tensorflow/core/kernels/batching_util/shared_batch_scheduler.h#L497. We have almost fixed the scale up problem by tuning the batching parameter. But the scale up is still not linear(About 0.85* NUM_INSTANCE). I changed the wait time and now I'm recompiling serving to see if there is any improvement.

@tobegit3hub
Copy link
Contributor

If we only serve one model with one model version in multiple-gpu server, we have implemented the functionality to automatically start multiple session to load the model which has been edit for binding different gpu devices. This can utilize all the gpu devices for one model and rebalance the inference requests by simple strategies like RoundRobin.

However, the default implementation of servable in TensorFlow Serving manages only one Session by use TensorFlow C API to load SavedModel. We have to implement the new Servable or SourceAdapter to support this.

There are some issues for multiple models or multiple model versions if we allocate all gpu resource to one model version. It is related to the usage and design of this project. We are willing to contribute what we have done if we have the clear design for multiple gpu support.

@Uncle-Yuanl
Copy link

@tobegit3hub

If we only serve one model with one model version in multiple-gpu server, we have implemented the functionality to automatically start multiple session to load the model which has been edit for binding different gpu devices. This can utilize all the gpu devices for one model and rebalance the inference requests by simple strategies like RoundRobin.

Thank you for your information. So if there is already a demo or related code in this project??
Or can you explain the details of the method above.

Thank you very much!

@UsharaniPagadala UsharaniPagadala self-assigned this Oct 21, 2021
@UsharaniPagadala
Copy link

UsharaniPagadala commented Oct 21, 2021

@Immexxx

Closing this issue, Please feel free to reopen if this still exists.Thanks
and also refer to this documentation, hope it helps

@Antonio-hi
Copy link

This is still an issue

@raminmohammadi
Copy link

raminmohammadi commented Jan 13, 2023

I am facing the same problem on TF 2.11, I have to GPUs but its only picking one during the serving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests