Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batching has nearly zero affect. #1312

Closed
farzaa opened this issue Apr 12, 2019 · 4 comments

Comments

@farzaa
Copy link

@farzaa farzaa commented Apr 12, 2019

Hello! I am using batching and am running tf-serving via the Docker container tensorflow/serving:1.13.0-gpu.

My batch.config file looks like this:

max_batch_size { value:  32 }
batch_timeout_micros { value : 0 }
num_batch_threads { value : 64 }
allowed_batch_sizes : 1
allowed_batch_sizes : 2
allowed_batch_sizes : 8
allowed_batch_sizes : 32
max_enqueued_batches { value : 100000000}

And I run everything by doing tensorflow_model_server --model_base_path=/models/object-detect --rest_api_port=8501 --port=8081 --enable_batching --batching_parameters_file=batch.config.

The GPU is a Tesla P100 and the system has 8 cores running tf 1.13.1.

When I send 1000 concurrent requests to the server, it takes around 35 seconds without batching. With batching, it takes nearly the same exact time - about 34.5 seconds.

I know that the batch.config file needs to be fine-tuned a bunch by hand, and I have messed with it a lot and tuned numbers around, but nothing seems to actually effect runtimes.

I saw some other posts mention that building tf-serving from source fixes the issue but it has not for me.

Any advice would be great!

@troycheng

This comment has been minimized.

Copy link

@troycheng troycheng commented Apr 16, 2019

I also met the problem and did a lot of test, and now I can get some benefits from batching, from <200 images per gpu to nearly 500 images per gpu.

I guess you may probably post 1 image per request (or maybe other kind of data, whatever ). If so, setting batch_timeout_micros to 0 means server will not wait other requests to form a batch, and it will work just the same as no batching.

You can set batch_timeout_micros to a few milliseconds, i.e. batch_timeout_micros {value : 5000} (means to wait at most 5ms to merge later requests as a batch), and then fine tune the others.

For fully use gpu devices, you can form batch at client side, which means request 16 or 32 images per request. It will much more efficient than forming batch at server side.

And here is a post relative, you may also find some help here.

@Harshini-Gadige

This comment has been minimized.

Copy link

@Harshini-Gadige Harshini-Gadige commented Apr 16, 2019

For CPU, one can set batch_timeout_micros to 0. Then experiment with batch_timeout_micros values in the 1-10 millisecond (1000-10000 microsecond) range.

Since your scenario is with GPU, please find below approach.

  1. Temporarily set batch_timeout_micros to infinity while you tune max_batch_size to achieve the desired balance between throughput and average latency. Consider values in the hundreds or thousands.
  2. For online serving, tune batch_timeout_micros to rein in tail latency. The idea is that batches normally get filled to max_batch_size, but occasionally when there is a lapse in incoming requests, to avoid introducing a latency spike it makes sense to process whatever's in the queue even if it represents an underfull batch. The best value for batch_timeout_micros is typically a few milliseconds, and depends on your context and goals. Zero is a value to consider; it works well for some workloads. (For bulk processing jobs, choose a large value, perhaps a few seconds, to ensure good throughput but not wait too long for the final (and likely underfull) batch.)
@Harshini-Gadige

This comment has been minimized.

Copy link

@Harshini-Gadige Harshini-Gadige commented Apr 19, 2019

Closing this issue as it is in "awaiting response" for 3 days. Feel free to add your comments and we will reopen.

@thinhlx1993

This comment has been minimized.

Copy link

@thinhlx1993 thinhlx1993 commented May 10, 2019

Hello! I am using batching and am running tf-serving via the Docker container tensorflow/serving:1.13.0-gpu.

My batch.config file looks like this:

max_batch_size { value:  32 }
batch_timeout_micros { value : 0 }
num_batch_threads { value : 64 }
allowed_batch_sizes : 1
allowed_batch_sizes : 2
allowed_batch_sizes : 8
allowed_batch_sizes : 32
max_enqueued_batches { value : 100000000}

And I run everything by doing tensorflow_model_server --model_base_path=/models/object-detect --rest_api_port=8501 --port=8081 --enable_batching --batching_parameters_file=batch.config.

The GPU is a Tesla P100 and the system has 8 cores running tf 1.13.1.

When I send 1000 concurrent requests to the server, it takes around 35 seconds without batching. With batching, it takes nearly the same exact time - about 34.5 seconds.

I know that the batch.config file needs to be fine-tuned a bunch by hand, and I have messed with it a lot and tuned numbers around, but nothing seems to actually effect runtimes.

I saw some other posts mention that building tf-serving from source fixes the issue but it has not for me.

Any advice would be great!

Did you find any ideas to resolve this problem?

I have the same problem.

I have tested my model to predict the embedding of image. It takes only 0.12s for 50 images with batch size 50. But when I convert the Keras model to Tensorflow SavedModel and serve by Tensorflow serving. It takes 3s to calculate embeddings.

My batch.config file

max_batch_size { value: 128 }
batch_timeout_micros { value: 3000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 8 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.