Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow (TF) Serving on Multi-GPU box #311

Open
Immexxx opened this issue Feb 7, 2017 · 36 comments

Comments

@Immexxx
Copy link

@Immexxx Immexxx commented Feb 7, 2017

The issue is for a single model (not multi-model)

  1. What is the best way to serve requests from TF Serving in a multi-GPU box. TF is able to see all the GPUs but uses only one (Default Inception v3 model - from ~ 03/2016 I believe). (Is there some configuration in TF Serving that I am not aware of.)

Is this dependent on how the model is loaded and exported?

For inference (and not training) - Is there an example of a saved model being loaded onto multiple GPUs (with a single CPU) where the CPU splits the GPU load among multiple GPUs instead of using only one GPU.

  1. If the model is trained on a single GPU machine, when it is being used for inference in TF Serving, can it be used on a multi GPU box where all GPUs are being used?
@jorgemf

This comment has been minimized.

Copy link

@jorgemf jorgemf commented Feb 8, 2017

TF Serving only executes the graph loaded. If you graph is using multiple GPUs TF serving will use them, if you graph is only using one GPU (most common case) then you will be able to run it in one GPU in TF Serving. Probably your best solution is to build a script which loads your graph once per GPU, then uses some code in the CPU to split the batch data into the GPU graphs and finally export the whole graph with support for multiGPU.

I am not aware of other solution as it could be to replicate the same graph in every GPU in TF serving, but I guess it could be possible by modifying the serving code. Although I don't think this will be supported anytime soon and you will have to do it by yourself.

@jlertle

This comment has been minimized.

Copy link

@jlertle jlertle commented Feb 25, 2017

Using CUDA_VISIBLE_DEVICES environment variable you can pin Model Server processes to a specific GPU. Run each of them on a separate port and throw a load balancer in front.

If you are handling any significant load you probably want to run model server with --enable_batching

@Immexxx

This comment has been minimized.

Copy link
Author

@Immexxx Immexxx commented Mar 24, 2017

@jlertle : Thanks for the comment. But even with that, if it is a 4 GPU box, you would have to run 4 servers - which does not seem like the most efficient way to do it. Also, you have to run this on 4 different ports, which in turn adds routing complexity.

@bhack

This comment has been minimized.

Copy link

@bhack bhack commented Apr 23, 2017

@Immexxx I think that there are not so much resources invested in serving and ecosystem repositories comparing to TF. It is comprensible cause this are a little bit in conflict with selling managed services.

@Immexxx

This comment has been minimized.

Copy link
Author

@Immexxx Immexxx commented Jul 5, 2017

Guessing I could close this issue based on the above responses, but will be cool for the system to "intelligently" figure out load - for example, a graph requiring only one GPU - and scale it based on available resources - if multiple GPUs are available, replicate the graph - and serve it.

Could be a flag when launching TF Serving.

In summary, will leave this open for now - in case someone (including me) is looking for an interesting project. In general, TF's handling of GPUs - grabbing all the memory of all GPUs and 'locking' them up but using only a subset for compute does not seem like the most elegant way of doing this. A flag for "auto-replicate and serve" would be nice.

@vitalyli

This comment has been minimized.

Copy link
Contributor

@vitalyli vitalyli commented Oct 17, 2017

Let's re-open this; With 8GPU machine, it would be really helpful if the sever was simply alternating requests across N available GPUs.
For most of us without TPUs, the only way to speed things up is using GPUs. The difference can be 5x faster on GPUs.

@zacharynevin

This comment has been minimized.

Copy link

@zacharynevin zacharynevin commented Oct 22, 2017

+1. This would be incredibly useful. Currently I have a model that I am trying to run on a p2.16xlarge instance but its only utilizing one of the GPUs. It would be incredibly useful if the batch could be split automatically between all the GPUs.

@zacharynevin

This comment has been minimized.

Copy link

@zacharynevin zacharynevin commented Oct 22, 2017

In the end I simply used client-side load balancing and had a separate tensorflow serving server for each GPU

@gawinghe

This comment has been minimized.

Copy link

@gawinghe gawinghe commented Nov 3, 2017

@zacharynevin . me too. tensorflow serving is a wasting of multi gpus

@mattfeury

This comment has been minimized.

Copy link

@mattfeury mattfeury commented Jan 18, 2018

another +1 here for the exact use case as above. running this on a p2 instance with extra gpus would be an incredibly easy way to scale up. but right now it requires an instance per gpu

@deadeyegoodwin

This comment has been minimized.

Copy link

@deadeyegoodwin deadeyegoodwin commented Apr 13, 2018

A natural way to do this would be to use the Session option visible_device_list to create a session for each GPU and then serve the same model on each GPU.

Unfortunately due to a significant TensorFlow issue (described in tensorflow/tensorflow#8136 and many related bugs), it seems that you cannot have multiple Sessions in the same process with different visible_device_list settings. There doesn't seem to be any traction to getting it fixed as all the related bugs have been closed without any code changes.

@rundembear

This comment has been minimized.

Copy link

@rundembear rundembear commented May 4, 2018

I placed a comment on one of the many closed tickets referred to by @deadeyegoodwin. The comments on closing those tickets speak as if the functionality must be this way. Yet this works absolutely fine in Tensorflow 1.3!!! We have an app where we call list_local_devices() to get the list of available GPUS. We then create a Session per-GPU, each running the same graph, and use visible_device_list to make each Session see the GPU assigned to it as GPU 0. We have this running in live production code with zero problems.
I just started trying to upgrade to Tensorflow 1.8 and ran smack into this problem. So I feel like more explanation is needed beyond saying that this doesn't work. It used to work. It got broken at some point. Please return it to working, or explain why something changed between 1.3 and 1.5 which necessitated this change. I have an idea for a workaround which I am going to try, I will post the workaround if it succeeds.

@rundembear

This comment has been minimized.

Copy link

@rundembear rundembear commented May 4, 2018

The problem is NOT with list_local_devices!! I ran my code once, got the result we use, which is just the string description. Then I commented out the call to list_local_devices and I still get the same crash with the same error:

F tensorflow/core/common_runtime/gpu/gpu_id_manager.cc:45] Check failed: cuda_gpu_id.value() == result.first->second (1 vs. 0)Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

I am going to try some earlier versions of Tensorflow to see if I can identify where this went bad.

@t27

This comment has been minimized.

Copy link

@t27 t27 commented May 29, 2018

This issue still appears on the latest versions.

I'm using an AWS g3.8xlarge instance which has 2 GPUs.

TF serving is able to detect both GPUs and initialise them but while running the model it only uses 1 GPU to the maximum.

We are on version 1.7, even though the client sends upto 32 requests in parallel, the model server only uses the first GPU (check screenshot from nvidia-smi)
06_09_21

Seeing that this ticket is open since quite some time, Is the external load balancer solution(that @zacharynevin suggested)the only solution at the moment?

@rickragv @cyberwillis

@qiaohaijun

This comment has been minimized.

Copy link

@qiaohaijun qiaohaijun commented Jun 6, 2018

Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

when I use multi GPU in tf1.6, I met the error

@sugartom

This comment has been minimized.

Copy link

@sugartom sugartom commented Jul 28, 2018

@jlertle @zacharynevin , hi, I would like to ask what commands you were using to run a separate tensorflow serving server for each GPU.

I have a machine with two 1080 Tis. My TF-Serving is able to correctly identify both of the GPUs when the CUDA_VISIBLE_DEVICES flag is not set. And I was trying to run one TF-Serving server per GPU, so I opened two terminals to run the following two commands:

CUDA_VISIBLE_DEVICES=0 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/path/to/inception_model

CUDA_VISIBLE_DEVICES=1 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=mnist --model_base_path=/path/to/mnist_model

The first command was running fine, but the second one will reach error says

terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
[1] 4021 abort (core dumped) CUDA_VISIBLE_DEVICES=1 --port=9001 --model_name=mnist

And idea or suggestions? Thanks in advance!

@zacharynevin

This comment has been minimized.

Copy link

@zacharynevin zacharynevin commented Jul 28, 2018

Hi @sugartom, can you post the output of nvidia-smi?

@sugartom

This comment has been minimized.

Copy link

@sugartom sugartom commented Jul 28, 2018

@zacharynevin , thanks for your reply, my output of nvidia-smi (without TF-Serving running) is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.59 Driver Version: 390.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 On | N/A |
| 23% 31C P8 15W / 250W | 334MiB / 11176MiB | 28% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 23% 34C P8 15W / 250W | 2MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1089 G /usr/lib/xorg/Xorg 225MiB |
| 0 1892 G compiz 103MiB |
| 0 5429 G /usr/lib/firefox/firefox 2MiB |
+-----------------------------------------------------------------------------+

Please let me know if you need any more information. Thanks!

@jlertle

This comment has been minimized.

Copy link

@jlertle jlertle commented Jul 28, 2018

@sugartom the commands and output of nvidia-smi look good. I'm not sure what the issue could be at this time

@sugartom

This comment has been minimized.

Copy link

@sugartom sugartom commented Jul 28, 2018

@jlertle , thanks for your reply. May I ask what's your TF version? I am using v1.2...

@jlertle

This comment has been minimized.

Copy link

@jlertle jlertle commented Jul 28, 2018

I'm currently on v1.6 but have used this method since ~v0.8.

Have you tried running only on the second GPU?

Perhaps try this:

export CUDA_VISIBLE_DEVICES=1 && bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=mnist --model_base_path=/path/to/mnist_model

@sugartom

This comment has been minimized.

Copy link

@sugartom sugartom commented Jul 28, 2018

Yes, already tried that. My TF-Serving runs perfectly on either gpu:0 or gpu:1 alone, and that's why it confused me so much that running two TF-Serving instances didn't work...

@Immexxx

This comment has been minimized.

Copy link
Author

@Immexxx Immexxx commented Jul 29, 2018

@sugartom

  1. It is certainly possible to run two instances (at least on a P100 and V100)
  2. Can you try swapping the models to see if there is some issue with the model - like running mnist on GPU 0, to rule out model/model-environment issues
  3. If that does not work, you can try swapping the ports / using different ports
  4. Do you see this error when you are trying to start up the TF Serving or when you are trying to process an image through it? If it is the former and you are still seeing the error, you could try setting the gpu_memory_fraction to see if that helps
@sugartom

This comment has been minimized.

Copy link

@sugartom sugartom commented Jul 29, 2018

@Immexxx , hi, thanks for your reply!
For your suggestions:

  1. I swapped my models, my exported inception is able to run on either gpu:0 or gpu:1, and my exported mnist is also able to run on either gpu:0 or gpu:1.

  2. Tested on a few different ports, still no luck...

  3. I saw this error message when starting the TF-Serving server. Namely, the first server is always running normally (log msg: I tensorflow_serving/model_servers/main.cc:343] Running ModelServer at 0.0.0.0:9100 ...) on its gpu (either gpu:0 or gpu:1), but the second server can't finish its start on its gpu (either gpu:1 or gpu:0). So I haven't reached the image processing step yet, since my second server is always reporting error.
    And I tried to limit the gpu memory as you suggested using set_per_process_gpu_memory_fraction in main.cc. I was able to limit the gpu memory (according to nvidia-smi, it says both GPU usage around 6117MiB / 11176MiB if I set gpu fraction of 0.5). But still, running two separate TF-Serving servers will lead the second server reporting Resource temporarily unavailable error...

Anyway, thanks again for your suggestions :-)

@Harshini-Gadige

This comment has been minimized.

Copy link

@Harshini-Gadige Harshini-Gadige commented Oct 25, 2018

Is this still an issue ?

@Immexxx

This comment has been minimized.

Copy link
Author

@Immexxx Immexxx commented Oct 26, 2018

Yes, I still think it is (but happy to be corrected)

Currently, TF Serving does not seem to be able to exploit the compute on all on GPU nodes if the graph is set up to run on only one GPU.
Having said that, it depends on the direction of evolution of TF Serving and if this does not make sense for TF Serving, we can close the issue.

Given the cost of GPUs and the way GPUs are being made available by vendors and cloud providers, this would be nice to have

@wydwww

This comment has been minimized.

Copy link

@wydwww wydwww commented Nov 18, 2018

@Immexxx @jorgemf

If you graph is using multiple GPUs TF serving will use them, if you graph is only using one GPU (most common case) then you will be able to run it in one GPU in TF Serving.

Do you mean when a model is trained in model parallelism, then it can be served with multiple GPUs?
One major motivation of model parallelism is that the model is too large to fit in one GPU, since a model is trained with 1 GPU for example, is it still necessary to use multiple GPUs in serving, comparing with the current load balancer method?

@wydwww

This comment has been minimized.

Copy link

@wydwww wydwww commented Dec 8, 2018

@Immexxx
Regarding to your latest post,

Currently, TF Serving does not seem to be able to exploit the compute on all on GPU nodes if the graph is set up to run on only one GPU.

According to Figure 4 of a ICML2017 paper about device placement, utilising 2 or 4 GPUs will only sightly reduce latency (from 0.1 to ~0.07) comparing with single-GPU in model parallelism training.

Is this the goal of utilising multiple GPUs for one request, which is to reduce the minimal latency?

Thanks

@srikanthcognitifAI

This comment has been minimized.

Copy link

@srikanthcognitifAI srikanthcognitifAI commented Jan 8, 2019

In the TensorRT github page they have mentioned "Multi-GPU support. TRTIS can distribute inferencing across all system GPUs." ,is this possible with TFserving already??

@xuzheyuan624

This comment has been minimized.

Copy link

@xuzheyuan624 xuzheyuan624 commented Apr 3, 2019

How to use an appointed GPU to run tensorflow serving with docker.I dont't want to take uo all gpus when running tf serving.So does anybody know?In addition, my model is writen by tf.keras.

@aaroey

This comment has been minimized.

Copy link
Member

@aaroey aaroey commented Apr 3, 2019

@xuzheyuan624 could you try to set environment variable CUDA_VISIBLE_DEVICES and see if it works?

@xuzheyuan624

This comment has been minimized.

Copy link

@xuzheyuan624 xuzheyuan624 commented Apr 5, 2019

@aaroey I tried like this:
#!/bin/bash export CUDA_VISIBLE_DEVICES=5 && \ docker run --runtime=nvidia -p 8500:8500 \ --mount type=bind,\ source=/home/xzy/keras-yolov3/yolov3,\ target=/models/yolov3 \ -e MODEL_NAME=yolov3 -t tensorflow/serving:1.12.0-gpu &
but it didn't work.

@t27

This comment has been minimized.

Copy link

@t27 t27 commented Apr 5, 2019

You need to use the nvidia runtime for docker, and export the NVIDIA_VISIBLE_DEVICES environment variable inside the container when you run it

Sample command

docker run  --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8501:8501 \
  --mount type=bind,source=/path/to/my_model/,target=/models/my_model \
  -e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu

If you use a config file for tensorflow serving, you can use the following command while running the docker

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8500:8500 -p 8501:8501 \
  --mount type=bind,source=/path/to/my_model/,target=/models/my_model \
  --mount type=bind,source=/path/to/my/models.config,target=/models/models.config \
  -t tensorflow/serving:latest-gpu --model_config_file=/models/models.config

Source: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/docker.md#serving-with-docker-using-your-gpu

@t27

This comment has been minimized.

Copy link

@t27 t27 commented Apr 5, 2019

Until tensorflow serving incorporates the multi gpu load balancing, the best approach to do multi gpu inference for TensorflowServing is by having dockerised tensorflow serving containers for each gpu (#311 (comment)) and then writing your own load balancing logic which has a list of all the url-port number combinations and dispatches requests accordingly

@xuzheyuan624

This comment has been minimized.

Copy link

@xuzheyuan624 xuzheyuan624 commented Apr 7, 2019

@t27 thanks! It's very useful!

@troycheng

This comment has been minimized.

Copy link

@troycheng troycheng commented Apr 8, 2019

Did someone meet throughput decreasing problem when using load balancing over multiple tf-serving instances? For only one instance, using batching with fine tuning it can reach 500 qps, but for 8 intances, It can't reach 500 * 8 = 4000 qps, even worse, it can't reach 500 qps. I know batching is a main cause but till now I do not know why and also not find a better solution to make full use of multiple gpu devices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.