-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to load model when the server starts? #453
Comments
It's surprising that the model server is adding so much overhead. The server is not reloading the model from disk with each request, it only does that once for each new version that comes in, and then keeps it in memory. Are you querying a remote server? Could it be a network issue? One useful experiment is to run the model_server and the client on the same machine and issue requests there. That will make the latency numbers only include evaluation time. How did you measure 1.6ms and 100ms? The model server is doing very little on top of TF's session.run call, so if evaluating the model actually takes 1.6ms, model_server requests should take very close to that. |
I also noticed TF Serving overhead (#456) on the same machine comparing it to the JNI interface. |
Closing due to inactivity; please reopen if required. |
@kirilg We finally found this is because the different linux system we used. At first tf-serving was compiled in CentOS 6 without cpu opts. Finally we use CentOS 7 to recompile tf-serving and the speed becomes the same. |
We have trained a neural network and want to use it to infer at real time.
We have known that the average prediction time cost of our model on CPU is 1.6ms per instance. But tensorflow serving will cost much more time, up to 100ms per instance.
I think it is because that every time we call the grpc, the model is first loaded from disk. So I want to know is any solution here or we have to keep a long connection between client and server?
The text was updated successfully, but these errors were encountered: