Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HuggingFaceEmbedder model brings down the application APIs, making recovery impossible #28383

Closed
eostis opened this issue Sep 4, 2023 · 12 comments

Comments

@eostis
Copy link

eostis commented Sep 4, 2023

Application was fine, until a second HuggingFace ONNX embedder was deployed.

The problem is that the application APIs are now in 404 mode. Which makes recovery impossible (like removing the component from services.xml).

It happened locally, and on Kubernetes. Same issue. Restarting docker does not solve anything.

Logs

[2023-09-04 13:01:22.117] WARNING container        Container.com.yahoo.container.di.Container	Failed to set up first component graph due to error when constructing one of the components\nexception=\ncom.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'wpsolr_multilingual_e5_small_onnx' of type 'ai.vespa.embedding.huggingface.HuggingFaceEmbedder': null\nCaused by: java.lang.RuntimeException: ONNX Runtime exception\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:161)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:156)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.<init>(OnnxEvaluator.java:36)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.evaluatorOf(OnnxRuntime.java:81)\nCaused by: ai.onnxruntime.OrtException: Error code - ORT_INVALID_PROTOBUF - message: Load model from /opt/vespa/var/db/vespa/download/-2375926038234599409/contents failed:Protobuf parsing failed.\n\tat ai.onnxruntime.OrtSession.createSession(Native Method)\n\tat ai.onnxruntime.OrtSession.<init>(OrtSession.java:73)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:222)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:208)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime$1.create(OnnxRuntime.java:46)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.acquireSession(OnnxRuntime.java:149)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:144)\n\t... 3 more\n
2023-09-04T13:01:22.124180186Z [2023-09-04 13:01:22.117] WARNING container        Container.com.yahoo.jdisc.core.ApplicationLoader	Exception thrown while activating application.\nexception=\ncom.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'wpsolr_multilingual_e5_small_onnx' of type 'ai.vespa.embedding.huggingface.HuggingFaceEmbedder': null\nCaused by: java.lang.RuntimeException: ONNX Runtime exception\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:161)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:156)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.<init>(OnnxEvaluator.java:36)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.evaluatorOf(OnnxRuntime.java:81)\nCaused by: ai.onnxruntime.OrtException: Error code - ORT_INVALID_PROTOBUF - message: Load model from /opt/vespa/var/db/vespa/download/-2375926038234599409/contents failed:Protobuf parsing failed.\n\tat ai.onnxruntime.OrtSession.createSession(Native Method)\n\tat ai.onnxruntime.OrtSession.<init>(OrtSession.java:73)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:222)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:208)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime$1.create(OnnxRuntime.java:46)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.acquireSession(OnnxRuntime.java:149)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:144)\n\t... 3 more\n
2023-09-04T13:01:22.124206620Z [2023-09-04 13:01:22.118] INFO    container        Container.com.yahoo.container.jdisc.ConfiguredApplication	Destroy: Shutting down container now
2023-09-04T13:01:22.124212248Z [2023-09-04 13:01:22.117] INFO    container        Container.com.yahoo.container.jdisc.component.Deconstructor	Starting deconstruction of 10 components and 0 bundles from generation 3160
2023-09-04T13:01:22.124216587Z [2023-09-04 13:01:22.121] INFO    container        Container.com.yahoo.container.jdisc.ConfiguredApplication	Destroy: Finished
2023-09-04T13:01:22.124221504Z [2023-09-04 13:01:22.122] ERROR   container        Container.com.yahoo.jdisc.core.StandaloneMain	JDisc exiting: Throwable caught: \nexception=\ncom.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'wpsolr_multilingual_e5_small_onnx' of type 'ai.vespa.embedding.huggingface.HuggingFaceEmbedder': null\nCaused by: java.lang.RuntimeException: ONNX Runtime exception\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:161)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:156)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.<init>(OnnxEvaluator.java:36)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.evaluatorOf(OnnxRuntime.java:81)\nCaused by: ai.onnxruntime.OrtException: Error code - ORT_INVALID_PROTOBUF - message: Load model from /opt/vespa/var/db/vespa/download/-2375926038234599409/contents failed:Protobuf parsing failed.\n\tat ai.onnxruntime.OrtSession.createSession(Native Method)\n\tat ai.onnxruntime.OrtSession.<init>(OrtSession.java:73)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:222)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:208)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime$1.create(OnnxRuntime.java:46)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.acquireSession(OnnxRuntime.java:149)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:144)\n\t... 3 more\n
2023-09-04T13:01:22.271596244Z [2023-09-04 13:01:22.270] INFO    config-sentinel  sentinel.sentinel.service	container: incremented restart penalty to 254.000 seconds
2023-09-04T13:01:22.271630919Z [2023-09-04 13:01:22.270] INFO    config-sentinel  sentinel.sentinel.service	container: will delay start by 241.682 seconds
@eostis eostis changed the title Bad ONNX brings down the application APIs, making recovery impossible HuggingFaceEmbedder model brings down the application APIs, making recovery impossible Sep 4, 2023
@jobergum
Copy link
Member

jobergum commented Sep 4, 2023

Hey @eostis,

It is easier to analyze the behavior if you write more about the changes you have deployed. For example, it would help if you write something along the lines of

  • This is my configuration and services.xml
  • Then I deploy the following using x
  • Then I observe that this is happening.

The problem is that the application APIs are now in 404 mode

What does this mean?

It happened locally, and on Kubernetes. Same issue. Restarting docker does not solve anything.

The protobuf is corrupted/incorrect, as seen from the above warning. This means, that the serving container will not start, if it was running already, it would continue running with the older configuration generation, until you attempt to restart it, in which it will not start, because the current generation says it should use the corrupted file.

@eostis
Copy link
Author

eostis commented Sep 4, 2023

Thanks @jobergum.

I did not write the specifics because the issue is more about recovering the failed application than why it failed.

Is there something I can do to recover from that kind of issue, presumably a corrupt model, without having to destroy the application?

@jobergum
Copy link
Member

jobergum commented Sep 4, 2023

Yes, remove the component configuration with reference to the bad file and re-deploy.

@eostis
Copy link
Author

eostis commented Sep 4, 2023

That is my point: I cannot redeploy the services.xml file with the APIs (HTTP 404).

@eostis
Copy link
Author

eostis commented Sep 4, 2023

More exactly: "(http_request_failed) cURL error 6: Could not resolve host: node1.vespanet"

@jobergum
Copy link
Member

jobergum commented Sep 4, 2023

Read your two last comments and ask yourself, is it possible to decode what you write?

Cannot redeploy with the APIs (HTTP 404)

(http_request_failed) cURL error 6: Could not resolve host: node1.vespanet

@eostis
Copy link
Author

eostis commented Sep 4, 2023

Vespa's APIs being no more accessible, impossible to redeploy services.xml to remove the faulty component.

@jobergum
Copy link
Member

jobergum commented Sep 4, 2023

This is the last comment from me on this, this is not leading anywhere.

There are quite a few Vespa APIs. You have to be specific, again, see my first response. You say that some API is returning HTTP 404, but who knows what API or how you got there (again, missing detailed steps to reproduce) so that we can clearly understand what API you are talking about and what the exact response body is. The curl request fail snippet above, seem to be related to DNS, but who knows what and why you wanted to reach node1.vespanet.

@eostis
Copy link
Author

eostis commented Sep 4, 2023

Absolutely all APIs are down. I cannot contact Vespa instance with http, not for search nor for admin.

The problem is that the application APIs are now in 404 mode

@jobergum
Copy link
Member

jobergum commented Sep 4, 2023

I cannot contact Vespa instance with http, not for search nor for admin.

If that is because of the DNS error (Could not resolve host: node1.vespanet), then maybe check on that? The deploy API powered by the configuration server is not affected at all by a ONNX model failing to loading on a serving container that powers the search API.

@eostis
Copy link
Author

eostis commented Sep 4, 2023

I'm closing this issue, but here is a summary of what I experienced for those interested:

On a one-node docker

  • Deploying a schema with a corrupted model broke this schema APIs (searching and indexing), but the admin APIs were still ok
  • Restarting docker with the corrupted model broke all other schemas APIs (searching, indexing), but the admin APIs were still ok

This is fine, I could use the admin APIs to remove the corrupted model from services.xml and redeploy.

(I couldn't reproduce it, but it happened that the admin APIs were also down)

On a one-node Kubernetes (Google Cloud)

  • Deploying a schema with a corrupted model broke all schema APIs (searching and indexing), but also the admin APIs.

This was the bad part: I had to recreate the cluster and recreate/reindex all the indices of my Vespa demos. Several times.

@eostis eostis closed this as completed Sep 4, 2023
@jobergum
Copy link
Member

jobergum commented Sep 5, 2023

A component in the Vespa stateless container (example: HF Embedder) is global. It's not tied to a specific schema. Because the stateless container can talk to any content cluster/schema, and the same component can be re-used across schemas.

When deploying changes, there is a new configuration generation that the stateless containers will attempt to migrate to. If initialization fails, it just continues with the previous configuration generation. But, if you restart the process, it cannot roll back and will attempt to start with the active generation (which will fail).

On your Kubernetes experience, this sounds more like faulty configuration regarding Kubernetes health probes, so your entire deployment has been wiped (Also removed from DNS) - Vespa does not have that capability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants