-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serving multiple models on GPU yields incorrect results under load #335
Comments
+1 for the incorrect result on gpu with concurrent requests. Thanks @BNoodle for creating the simple test case. I have the same issue but in a complicated setup. I have two saved model sessions, one for decoding jpeg and the second will take the decoded tensor to run on the second session. Multiple client calls will give inconsistent results. If build serving on cpu, the results are always consistent. It looks like a gpu memory handling issue. cc @chrisolston |
To clarify, there is no inconsistency issue on gpu with many concurrent calls if there is only one bundle. The issues only appears with multiple bundles. |
same issue here. |
I've updated to the latest commit |
+1 thanks @BNoodle for the test case. There are two possibilities that I can think of: It sounds like the tests thus far include both 1/2 simultaneously. One option, both as a possible mitigation and to test a theory would be to use the BatchingParameters config proto to enable batching, and set the num_batch_threads to one (1). This will configure batching, as well as exactly one shared thread across all models, causing any requests on the GPU to be serialized (one model at a time). You should also set the batch_timeout_micros=0 (for no delay, at least for testing) and max_batch_size to something reasonable (depends on your outer tensor dimension, typically 100-1000). If this approach works for you, you can tune the batch size and timeout to tradeoff latency/throughput. Either way, please report back. Thanks! |
same problem, unpredictable results. tried the suggested options: the problem is still there. |
@nfiedel I can confirm both point one and two are existing issues, if loading a model under stress is considered similar enough. Testing 1: Start server with versions 1 of TestA and TestB. Start hammering the server with many TestA requests and aspire version 2 of TestB. Once that's loaded, stop the the TestA requests. TestB requests will permanently give consistently incorrect results, while TestA results are consistently correct. Testing 2: I'm unsure how to add entirely new models to the server once it's already running, but having no versions in TestB on startup will cause the server to hang after starting TestA. After trying that, I still see these issues. As @kinhunt mentioned, I'm unable to get the batching parameters to help in either case. Side note: this issues occurs whether batching is enabled or not. |
Having the same problem here ! @nfiedel Is someone working on this issue ? Thanks. |
Yes, there's a bug in TensorFlow and a proposed fix is our for review. It should be pushed to the TF repo soon and then we'll sync TF Serving to include it as well. I would estimate it should land sometime early next week. |
same stream id to use the same cuda stream objects. This avoids confusing the per-device memory allocator in ways that cause memory corruption. Fixes tensorflow/serving#335. PiperOrigin-RevId: 157258318
Update: After merging the latest master commit Side-note: I am unable to get the branch built as-is. Before building, I must comment out the line referencing the nccl_kernels (was line 87 as of posting) inside the tensorflow contrib build file. This seems to be a known issue and work around. |
same stream id to use the same cuda stream objects. This avoids confusing the per-device memory allocator in ways that cause memory corruption. Fixes tensorflow/serving#335. PiperOrigin-RevId: 157258318
Sorry to comment on a closed issue. Just wanted to make sure that one single instance of TFS with model_config_file containing the definition of multiple models will be able to achieve parallel invocation of all these models on the same GPU (assuming that the batching conditions are met and assuming that enough number of threads are available). |
Serving is built for GPU and deployed for multiple models using the model_config_file command line option. Many of the responses were incorrect when two models were active and being requested at the same time. This initially appeared to be a 'under-load' problem, but appears to be more of a 'concurrent-request' problem. The current theory is that the GPU mismanages memory for multiple models. Since it is very difficult to land two requests to the server at the exact same time, it was only really discovered when sending many requests for separate models at the same time.
Hardware info:
This issue was also reproduced on an AWS server,
Code is on master, commit
f100a35
.Build process for GPU comes from issue 186.
tools/bazel.rc
frombuild:cuda --crosstool_top=@org_tensorflow//third_party/gpus/crosstool
tobuild:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
bazel clean --expunge && export TF_NEED_CUDA=1
.bazel build -c opt --config=cuda --verbose_failures //tensorflow_serving/model_servers:tensorflow_model_server
Once built, the server is run using
Note, the issue occurs whether batching is enabled or not.
Contents of ModelProtoConfigTest.txt:
The model is a simple multiplication model built just to test this issue. The input is
input
, and the output ismatmul4
.Note, each time this model is exported, the variables matrix1-4 are all randomly filled with tf.truncated_normal, and therefore expect to have different weights every export.
Saved model is created from the steps outlined in the tutorial for exporting models.
Once this model is exported twice, the separate models are loaded and the server is run. When a client is run against a single model at a time, all answers returned are consistent. If there are two clients requesting different models at the same time, the answers are wrong. For this particular model, if an expected answer is -5,992.991, the returned answer is substantially different like 3,986.742
The clients test by setting up n tasks and running them all as fast as possible to mimic the load of serving to production.
Any help identifying where the issue is in this setup or locating a bug would be very helpful!
The text was updated successfully, but these errors were encountered: