Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Serving multiple models on GPU yields incorrect results under load #335
Serving is built for GPU and deployed for multiple models using the model_config_file command line option. Many of the responses were incorrect when two models were active and being requested at the same time. This initially appeared to be a 'under-load' problem, but appears to be more of a 'concurrent-request' problem. The current theory is that the GPU mismanages memory for multiple models. Since it is very difficult to land two requests to the server at the exact same time, it was only really discovered when sending many requests for separate models at the same time.
This issue was also reproduced on an AWS server,
Code is on master, commit
Once built, the server is run using
Note, the issue occurs whether batching is enabled or not.
Contents of ModelProtoConfigTest.txt:
The model is a simple multiplication model built just to test this issue. The input is
input = tf.placeholder(tf.float32, shape=[1,1]) matrix1 = tf.Variable(tf.truncated_normal([1, 1024])) matrix2 = tf.Variable(tf.truncated_normal([1024, 1024])) matrix3 = tf.Variable(tf.truncated_normal([1024, 1024])) matrix4 = tf.Variable(tf.truncated_normal([1024, 1])) matmul1 = tf.matmul(input, matrix1) matmul2 = tf.matmul(matmul1, matrix2) matmul3 = tf.matmul(matmul2, matrix3) matmul4 = tf.matmul(matmul3, matrix4)
Note, each time this model is exported, the variables matrix1-4 are all randomly filled with tf.truncated_normal, and therefore expect to have different weights every export.
Saved model is created from the steps outlined in the tutorial for exporting models.
Once this model is exported twice, the separate models are loaded and the server is run. When a client is run against a single model at a time, all answers returned are consistent. If there are two clients requesting different models at the same time, the answers are wrong. For this particular model, if an expected answer is -5,992.991, the returned answer is substantially different like 3,986.742
The clients test by setting up n tasks and running them all as fast as possible to mimic the load of serving to production.
Any help identifying where the issue is in this setup or locating a bug would be very helpful!
+1 for the incorrect result on gpu with concurrent requests.
Thanks @bnoodle for creating the simple test case. I have the same issue but in a complicated setup. I have two saved model sessions, one for decoding jpeg and the second will take the decoded tensor to run on the second session. Multiple client calls will give inconsistent results. If build serving on cpu, the results are always consistent. It looks like a gpu memory handling issue. cc @chrisolston
I've updated to the latest commit
+1 thanks @bnoodle for the test case. There are two possibilities that I can think of:
It sounds like the tests thus far include both 1/2 simultaneously. One option, both as a possible mitigation and to test a theory would be to use the BatchingParameters config proto to enable batching, and set the num_batch_threads to one (1). This will configure batching, as well as exactly one shared thread across all models, causing any requests on the GPU to be serialized (one model at a time). You should also set the batch_timeout_micros=0 (for no delay, at least for testing) and max_batch_size to something reasonable (depends on your outer tensor dimension, typically 100-1000). If this approach works for you, you can tune the batch size and timeout to tradeoff latency/throughput. Either way, please report back. Thanks!
@nfiedel I can confirm both point one and two are existing issues, if loading a model under stress is considered similar enough.
Testing 1: Start server with versions 1 of TestA and TestB. Start hammering the server with many TestA requests and aspire version 2 of TestB. Once that's loaded, stop the the TestA requests. TestB requests will permanently give consistently incorrect results, while TestA results are consistently correct.
Testing 2: I'm unsure how to add entirely new models to the server once it's already running, but having no versions in TestB on startup will cause the server to hang after starting TestA. After trying that, I still see these issues.
As @kinhunt mentioned, I'm unable to get the batching parameters to help in either case. Side note: this issues occurs whether batching is enabled or not.
Update: After merging the latest master commit
Side-note: I am unable to get the branch built as-is. Before building, I must comment out the line referencing the nccl_kernels (was line 87 as of posting) inside the tensorflow contrib build file. This seems to be a known issue and work around.
Sorry to comment on a closed issue. Just wanted to make sure that one single instance of TFS with model_config_file containing the definition of multiple models will be able to achieve parallel invocation of all these models on the same GPU (assuming that the batching conditions are met and assuming that enough number of threads are available).