Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serving multiple models on GPU yields incorrect results under load #335

Closed
bparker-github opened this issue Feb 24, 2017 · 11 comments
Closed
Labels
type:performance Performance Issue

Comments

@bparker-github
Copy link

Serving is built for GPU and deployed for multiple models using the model_config_file command line option. Many of the responses were incorrect when two models were active and being requested at the same time. This initially appeared to be a 'under-load' problem, but appears to be more of a 'concurrent-request' problem. The current theory is that the GPU mismanages memory for multiple models. Since it is very difficult to land two requests to the server at the exact same time, it was only really discovered when sending many requests for separate models at the same time.

Hardware info:

GPU: Titan X (Pascal), driver 375.26
Cuda: 8.0
Cudnn: 5.1.5
Bazel: 0.4.4

This issue was also reproduced on an AWS server,

GPU: Tesla K80, driver 367.57
Cuda: 8.0
Cudnn: 5.1.5
Bazel: 0.4.2

Code is on master, commit f100a35.
Build process for GPU comes from issue 186.

  1. Change first line in tools/bazel.rc from
    build:cuda --crosstool_top=@org_tensorflow//third_party/gpus/crosstool to
    build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
  2. Run bazel clean --expunge && export TF_NEED_CUDA=1.
  3. Configure tensorflow using python3 and defaults otherwise (using GPU options).
  4. Build using bazel build -c opt --config=cuda --verbose_failures //tensorflow_serving/model_servers:tensorflow_model_server

Once built, the server is run using

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_config_file=ModelProtoConfigTest.txt --file_system_poll_wait_seconds=600

Note, the issue occurs whether batching is enabled or not.

Contents of ModelProtoConfigTest.txt:

model_config_list: {

  config: {
    name: "TestA",
    base_path: "../Data/Test/A",
    model_platform: "tensorflow"
  },
  config: {
    name: "TestB",
    base_path: "../Data/Test/B",
    model_platform: "tensorflow"
  },

}

The model is a simple multiplication model built just to test this issue. The input is input, and the output is matmul4.

input = tf.placeholder(tf.float32, shape=[1,1])

matrix1 = tf.Variable(tf.truncated_normal([1, 1024]))
matrix2 = tf.Variable(tf.truncated_normal([1024, 1024]))
matrix3 = tf.Variable(tf.truncated_normal([1024, 1024]))
matrix4 = tf.Variable(tf.truncated_normal([1024, 1]))

matmul1 = tf.matmul(input, matrix1)
matmul2 = tf.matmul(matmul1, matrix2)
matmul3 = tf.matmul(matmul2, matrix3)
matmul4 = tf.matmul(matmul3, matrix4)

Note, each time this model is exported, the variables matrix1-4 are all randomly filled with tf.truncated_normal, and therefore expect to have different weights every export.

Saved model is created from the steps outlined in the tutorial for exporting models.

Once this model is exported twice, the separate models are loaded and the server is run. When a client is run against a single model at a time, all answers returned are consistent. If there are two clients requesting different models at the same time, the answers are wrong. For this particular model, if an expected answer is -5,992.991, the returned answer is substantially different like 3,986.742

The clients test by setting up n tasks and running them all as fast as possible to mimic the load of serving to production.

Any help identifying where the issue is in this setup or locating a bug would be very helpful!

@dongwang218
Copy link

+1 for the incorrect result on gpu with concurrent requests.

Thanks @BNoodle for creating the simple test case. I have the same issue but in a complicated setup. I have two saved model sessions, one for decoding jpeg and the second will take the decoded tensor to run on the second session. Multiple client calls will give inconsistent results. If build serving on cpu, the results are always consistent. It looks like a gpu memory handling issue. cc @chrisolston

@dongwang218
Copy link

To clarify, there is no inconsistency issue on gpu with many concurrent calls if there is only one bundle. The issues only appears with multiple bundles.

@fxding
Copy link

fxding commented Mar 22, 2017

same issue here.
get the correct result, when there is only one model. but get different result when run it with two models

@bparker-github
Copy link
Author

I've updated to the latest commit 05ebb72 and can confirm this issue is still occurring. Any help or update would be appreciated. If this is a base Tensorflow issue I can post to that issue tracker as well, but I'm currently unsure of the root cause/location of this bug.

@nfiedel
Copy link
Contributor

nfiedel commented Apr 14, 2017

+1 thanks @BNoodle for the test case. There are two possibilities that I can think of:
(1) An issue with GPUs loading multiple models (sessions) concurrently.
(2) An issue with GPUs performing computation on multiple models concurrently.

It sounds like the tests thus far include both 1/2 simultaneously. One option, both as a possible mitigation and to test a theory would be to use the BatchingParameters config proto to enable batching, and set the num_batch_threads to one (1). This will configure batching, as well as exactly one shared thread across all models, causing any requests on the GPU to be serialized (one model at a time). You should also set the batch_timeout_micros=0 (for no delay, at least for testing) and max_batch_size to something reasonable (depends on your outer tensor dimension, typically 100-1000). If this approach works for you, you can tune the batch size and timeout to tradeoff latency/throughput. Either way, please report back. Thanks!

@kinhunt
Copy link

kinhunt commented Apr 15, 2017

same problem, unpredictable results. tried the suggested options:
--model_config_file=/data/tf/models.conf --use_saved_model=false --batch_timeout_micros=0 --num_batch_threads=1 --enable_batching=true --max_batch_size=1000

the problem is still there.

@bparker-github
Copy link
Author

@nfiedel I can confirm both point one and two are existing issues, if loading a model under stress is considered similar enough.

Testing 1: Start server with versions 1 of TestA and TestB. Start hammering the server with many TestA requests and aspire version 2 of TestB. Once that's loaded, stop the the TestA requests. TestB requests will permanently give consistently incorrect results, while TestA results are consistently correct.

Testing 2: I'm unsure how to add entirely new models to the server once it's already running, but having no versions in TestB on startup will cause the server to hang after starting TestA. After trying that, I still see these issues.

As @kinhunt mentioned, I'm unable to get the batching parameters to help in either case. Side note: this issues occurs whether batching is enabled or not.

@yves-m
Copy link

yves-m commented May 17, 2017

Having the same problem here !

@nfiedel Is someone working on this issue ?

Thanks.

@kirilg
Copy link
Contributor

kirilg commented May 18, 2017

Yes, there's a bug in TensorFlow and a proposed fix is our for review. It should be pushed to the TF repo soon and then we'll sync TF Serving to include it as well. I would estimate it should land sometime early next week.

caisq pushed a commit to caisq/tensorflow that referenced this issue May 27, 2017
same stream id to use the same cuda stream objects. This avoids confusing
the per-device memory allocator in ways that cause memory corruption.

Fixes tensorflow/serving#335.

PiperOrigin-RevId: 157258318
@bparker-github
Copy link
Author

Update: After merging the latest master commit 6c096a4396f1dfca81b245922e3066084a6a43c7 and testing both the basic test above as well as a more complex test I wrote, I can confirm that I am no longer seeing any issues with incorrect/broken responses. Multiple models seems to be working as intended.

Side-note: I am unable to get the branch built as-is. Before building, I must comment out the line referencing the nccl_kernels (was line 87 as of posting) inside the tensorflow contrib build file. This seems to be a known issue and work around.

av8ramit pushed a commit to av8ramit/tensorflow that referenced this issue Jun 6, 2017
same stream id to use the same cuda stream objects. This avoids confusing
the per-device memory allocator in ways that cause memory corruption.

Fixes tensorflow/serving#335.

PiperOrigin-RevId: 157258318
@abuvaneswari
Copy link

Sorry to comment on a closed issue. Just wanted to make sure that one single instance of TFS with model_config_file containing the definition of multiple models will be able to achieve parallel invocation of all these models on the same GPU (assuming that the batching conditions are met and assuming that enough number of threads are available).
Are the batching parameters configurable per model? If so, do I configure them in the model_config_file itself or in the batching parameters file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

9 participants