Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load model when the server starts? #453

Closed
fromradio opened this issue May 25, 2017 · 4 comments
Closed

How to load model when the server starts? #453

fromradio opened this issue May 25, 2017 · 4 comments

Comments

@fromradio
Copy link

fromradio commented May 25, 2017

We have trained a neural network and want to use it to infer at real time.

We have known that the average prediction time cost of our model on CPU is 1.6ms per instance. But tensorflow serving will cost much more time, up to 100ms per instance.

I think it is because that every time we call the grpc, the model is first loaded from disk. So I want to know is any solution here or we have to keep a long connection between client and server?

@fromradio fromradio changed the title How to load model when start the server? How to load model when the server starts? May 25, 2017
@kirilg
Copy link
Contributor

kirilg commented May 25, 2017

It's surprising that the model server is adding so much overhead. The server is not reloading the model from disk with each request, it only does that once for each new version that comes in, and then keeps it in memory.

Are you querying a remote server? Could it be a network issue? One useful experiment is to run the model_server and the client on the same machine and issue requests there. That will make the latency numbers only include evaluation time.

How did you measure 1.6ms and 100ms? The model server is doing very little on top of TF's session.run call, so if evaluating the model actually takes 1.6ms, model_server requests should take very close to that.

@kamei86i
Copy link

I also noticed TF Serving overhead (#456) on the same machine comparing it to the JNI interface.

@sukritiramesh
Copy link
Contributor

Closing due to inactivity; please reopen if required.

@fromradio
Copy link
Author

@kirilg We finally found this is because the different linux system we used. At first tf-serving was compiled in CentOS 6 without cpu opts. Finally we use CentOS 7 to recompile tf-serving and the speed becomes the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants