Serving multiple models on GPU yields incorrect results under load #335

bparker-github · 2017-02-24T18:58:20Z

Serving is built for GPU and deployed for multiple models using the model_config_file command line option. Many of the responses were incorrect when two models were active and being requested at the same time. This initially appeared to be a 'under-load' problem, but appears to be more of a 'concurrent-request' problem. The current theory is that the GPU mismanages memory for multiple models. Since it is very difficult to land two requests to the server at the exact same time, it was only really discovered when sending many requests for separate models at the same time.

Hardware info:

GPU: Titan X (Pascal), driver 375.26
Cuda: 8.0
Cudnn: 5.1.5
Bazel: 0.4.4

This issue was also reproduced on an AWS server,

GPU: Tesla K80, driver 367.57
Cuda: 8.0
Cudnn: 5.1.5
Bazel: 0.4.2

Code is on master, commit f100a35.
Build process for GPU comes from issue 186.

Change first line in tools/bazel.rc from
build:cuda --crosstool_top=@org_tensorflow//third_party/gpus/crosstool to
build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
Run bazel clean --expunge && export TF_NEED_CUDA=1.
Configure tensorflow using python3 and defaults otherwise (using GPU options).
Build using bazel build -c opt --config=cuda --verbose_failures //tensorflow_serving/model_servers:tensorflow_model_server

Once built, the server is run using

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_config_file=ModelProtoConfigTest.txt --file_system_poll_wait_seconds=600

Note, the issue occurs whether batching is enabled or not.

Contents of ModelProtoConfigTest.txt:

model_config_list: {

  config: {
    name: "TestA",
    base_path: "../Data/Test/A",
    model_platform: "tensorflow"
  },
  config: {
    name: "TestB",
    base_path: "../Data/Test/B",
    model_platform: "tensorflow"
  },

}

The model is a simple multiplication model built just to test this issue. The input is input, and the output is matmul4.

input = tf.placeholder(tf.float32, shape=[1,1])

matrix1 = tf.Variable(tf.truncated_normal([1, 1024]))
matrix2 = tf.Variable(tf.truncated_normal([1024, 1024]))
matrix3 = tf.Variable(tf.truncated_normal([1024, 1024]))
matrix4 = tf.Variable(tf.truncated_normal([1024, 1]))

matmul1 = tf.matmul(input, matrix1)
matmul2 = tf.matmul(matmul1, matrix2)
matmul3 = tf.matmul(matmul2, matrix3)
matmul4 = tf.matmul(matmul3, matrix4)

Note, each time this model is exported, the variables matrix1-4 are all randomly filled with tf.truncated_normal, and therefore expect to have different weights every export.

Saved model is created from the steps outlined in the tutorial for exporting models.

Once this model is exported twice, the separate models are loaded and the server is run. When a client is run against a single model at a time, all answers returned are consistent. If there are two clients requesting different models at the same time, the answers are wrong. For this particular model, if an expected answer is -5,992.991, the returned answer is substantially different like 3,986.742

The clients test by setting up n tasks and running them all as fast as possible to mimic the load of serving to production.

Any help identifying where the issue is in this setup or locating a bug would be very helpful!

The text was updated successfully, but these errors were encountered:

dongwang218 · 2017-03-07T20:47:36Z

+1 for the incorrect result on gpu with concurrent requests.

Thanks @BNoodle for creating the simple test case. I have the same issue but in a complicated setup. I have two saved model sessions, one for decoding jpeg and the second will take the decoded tensor to run on the second session. Multiple client calls will give inconsistent results. If build serving on cpu, the results are always consistent. It looks like a gpu memory handling issue. cc @chrisolston

dongwang218 · 2017-03-07T20:58:07Z

To clarify, there is no inconsistency issue on gpu with many concurrent calls if there is only one bundle. The issues only appears with multiple bundles.

fxding · 2017-03-22T07:52:00Z

same issue here.
get the correct result, when there is only one model. but get different result when run it with two models

bparker-github · 2017-04-12T14:18:24Z

I've updated to the latest commit 05ebb72 and can confirm this issue is still occurring. Any help or update would be appreciated. If this is a base Tensorflow issue I can post to that issue tracker as well, but I'm currently unsure of the root cause/location of this bug.

nfiedel · 2017-04-14T19:36:11Z

+1 thanks @BNoodle for the test case. There are two possibilities that I can think of:
(1) An issue with GPUs loading multiple models (sessions) concurrently.
(2) An issue with GPUs performing computation on multiple models concurrently.

It sounds like the tests thus far include both 1/2 simultaneously. One option, both as a possible mitigation and to test a theory would be to use the BatchingParameters config proto to enable batching, and set the num_batch_threads to one (1). This will configure batching, as well as exactly one shared thread across all models, causing any requests on the GPU to be serialized (one model at a time). You should also set the batch_timeout_micros=0 (for no delay, at least for testing) and max_batch_size to something reasonable (depends on your outer tensor dimension, typically 100-1000). If this approach works for you, you can tune the batch size and timeout to tradeoff latency/throughput. Either way, please report back. Thanks!

kinhunt · 2017-04-15T16:11:12Z

same problem, unpredictable results. tried the suggested options:
--model_config_file=/data/tf/models.conf --use_saved_model=false --batch_timeout_micros=0 --num_batch_threads=1 --enable_batching=true --max_batch_size=1000

the problem is still there.

bparker-github · 2017-04-18T12:46:02Z

@nfiedel I can confirm both point one and two are existing issues, if loading a model under stress is considered similar enough.

Testing 1: Start server with versions 1 of TestA and TestB. Start hammering the server with many TestA requests and aspire version 2 of TestB. Once that's loaded, stop the the TestA requests. TestB requests will permanently give consistently incorrect results, while TestA results are consistently correct.

Testing 2: I'm unsure how to add entirely new models to the server once it's already running, but having no versions in TestB on startup will cause the server to hang after starting TestA. After trying that, I still see these issues.

As @kinhunt mentioned, I'm unable to get the batching parameters to help in either case. Side note: this issues occurs whether batching is enabled or not.

yves-m · 2017-05-17T15:52:32Z

Having the same problem here !

@nfiedel Is someone working on this issue ?

Thanks.

kirilg · 2017-05-18T18:26:13Z

Yes, there's a bug in TensorFlow and a proposed fix is our for review. It should be pushed to the TF repo soon and then we'll sync TF Serving to include it as well. I would estimate it should land sometime early next week.

same stream id to use the same cuda stream objects. This avoids confusing the per-device memory allocator in ways that cause memory corruption. Fixes tensorflow/serving#335. PiperOrigin-RevId: 157258318

bparker-github · 2017-06-02T19:20:24Z

Update: After merging the latest master commit 6c096a4396f1dfca81b245922e3066084a6a43c7 and testing both the basic test above as well as a more complex test I wrote, I can confirm that I am no longer seeing any issues with incorrect/broken responses. Multiple models seems to be working as intended.

Side-note: I am unable to get the branch built as-is. Before building, I must comment out the line referencing the nccl_kernels (was line 87 as of posting) inside the tensorflow contrib build file. This seems to be a known issue and work around.

same stream id to use the same cuda stream objects. This avoids confusing the per-device memory allocator in ways that cause memory corruption. Fixes tensorflow/serving#335. PiperOrigin-RevId: 157258318

abuvaneswari · 2017-10-17T17:13:55Z

Sorry to comment on a closed issue. Just wanted to make sure that one single instance of TFS with model_config_file containing the definition of multiple models will be able to achieve parallel invocation of all these models on the same GPU (assuming that the batching conditions are met and assuming that enough number of threads are available).
Are the batching parameters configurable per model? If so, do I configure them in the model_config_file itself or in the batching parameters file?

dongwang218 mentioned this issue Apr 13, 2017

Results are corrupted when running multiple sessions on one GPU tensorflow/tensorflow#9205

Closed

markvdw mentioned this issue May 23, 2017

Tests failing on GPU with tf1.1 GPflow/GPflow#415

Closed

bparker-github closed this as completed Jun 5, 2017

fkm3 mentioned this issue Oct 19, 2017

Problem running GPU sessions concurrently tensorflow/haskell#158

Closed

misterpeddy added the type:performance Performance Issue label Nov 18, 2019

arghyaganguly mentioned this issue Mar 3, 2021

Multiple model with one gpu #1823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serving multiple models on GPU yields incorrect results under load #335

Serving multiple models on GPU yields incorrect results under load #335

bparker-github commented Feb 24, 2017

dongwang218 commented Mar 7, 2017

dongwang218 commented Mar 7, 2017

fxding commented Mar 22, 2017

bparker-github commented Apr 12, 2017

nfiedel commented Apr 14, 2017

kinhunt commented Apr 15, 2017 •

edited

Loading

bparker-github commented Apr 18, 2017

yves-m commented May 17, 2017 •

edited

Loading

kirilg commented May 18, 2017

bparker-github commented Jun 2, 2017

abuvaneswari commented Oct 17, 2017

Serving multiple models on GPU yields incorrect results under load #335

Serving multiple models on GPU yields incorrect results under load #335

Comments

bparker-github commented Feb 24, 2017

dongwang218 commented Mar 7, 2017

dongwang218 commented Mar 7, 2017

fxding commented Mar 22, 2017

bparker-github commented Apr 12, 2017

nfiedel commented Apr 14, 2017

kinhunt commented Apr 15, 2017 • edited Loading

bparker-github commented Apr 18, 2017

yves-m commented May 17, 2017 • edited Loading

kirilg commented May 18, 2017

bparker-github commented Jun 2, 2017

abuvaneswari commented Oct 17, 2017

kinhunt commented Apr 15, 2017 •

edited

Loading

yves-m commented May 17, 2017 •

edited

Loading