-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HuggingFaceEmbedder model brings down the application APIs, making recovery impossible #28383
Comments
Hey @eostis, It is easier to analyze the behavior if you write more about the changes you have deployed. For example, it would help if you write something along the lines of
What does this mean?
The protobuf is corrupted/incorrect, as seen from the above warning. This means, that the serving container will not start, if it was running already, it would continue running with the older configuration generation, until you attempt to restart it, in which it will not start, because the current generation says it should use the corrupted file. |
Thanks @jobergum. I did not write the specifics because the issue is more about recovering the failed application than why it failed. Is there something I can do to recover from that kind of issue, presumably a corrupt model, without having to destroy the application? |
Yes, remove the component configuration with reference to the bad file and re-deploy. |
That is my point: I cannot redeploy the services.xml file with the APIs (HTTP 404). |
More exactly: "(http_request_failed) cURL error 6: Could not resolve host: node1.vespanet" |
Read your two last comments and ask yourself, is it possible to decode what you write?
|
Vespa's APIs being no more accessible, impossible to redeploy services.xml to remove the faulty component. |
This is the last comment from me on this, this is not leading anywhere. There are quite a few Vespa APIs. You have to be specific, again, see my first response. You say that some API is returning HTTP 404, but who knows what API or how you got there (again, missing detailed steps to reproduce) so that we can clearly understand what API you are talking about and what the exact response body is. The curl request fail snippet above, seem to be related to DNS, but who knows what and why you wanted to reach node1.vespanet. |
Absolutely all APIs are down. I cannot contact Vespa instance with http, not for search nor for admin.
|
If that is because of the DNS error (Could not resolve host: node1.vespanet), then maybe check on that? The deploy API powered by the configuration server is not affected at all by a ONNX model failing to loading on a serving container that powers the search API. |
I'm closing this issue, but here is a summary of what I experienced for those interested: On a one-node docker
This is fine, I could use the admin APIs to remove the corrupted model from services.xml and redeploy. (I couldn't reproduce it, but it happened that the admin APIs were also down) On a one-node Kubernetes (Google Cloud)
This was the bad part: I had to recreate the cluster and recreate/reindex all the indices of my Vespa demos. Several times. |
A component in the Vespa stateless container (example: HF Embedder) is global. It's not tied to a specific schema. Because the stateless container can talk to any content cluster/schema, and the same component can be re-used across schemas. When deploying changes, there is a new configuration generation that the stateless containers will attempt to migrate to. If initialization fails, it just continues with the previous configuration generation. But, if you restart the process, it cannot roll back and will attempt to start with the active generation (which will fail). On your Kubernetes experience, this sounds more like faulty configuration regarding Kubernetes health probes, so your entire deployment has been wiped (Also removed from DNS) - Vespa does not have that capability. |
Application was fine, until a second HuggingFace ONNX embedder was deployed.
The problem is that the application APIs are now in 404 mode. Which makes recovery impossible (like removing the component from services.xml).
It happened locally, and on Kubernetes. Same issue. Restarting docker does not solve anything.
Logs
The text was updated successfully, but these errors were encountered: