HuggingFaceEmbedder model brings down the application APIs, making recovery impossible #28383

eostis · 2023-09-04T13:09:55Z

Application was fine, until a second HuggingFace ONNX embedder was deployed.

The problem is that the application APIs are now in 404 mode. Which makes recovery impossible (like removing the component from services.xml).

It happened locally, and on Kubernetes. Same issue. Restarting docker does not solve anything.

Logs

[2023-09-04 13:01:22.117] WARNING container        Container.com.yahoo.container.di.Container	Failed to set up first component graph due to error when constructing one of the components\nexception=\ncom.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'wpsolr_multilingual_e5_small_onnx' of type 'ai.vespa.embedding.huggingface.HuggingFaceEmbedder': null\nCaused by: java.lang.RuntimeException: ONNX Runtime exception\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:161)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:156)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.<init>(OnnxEvaluator.java:36)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.evaluatorOf(OnnxRuntime.java:81)\nCaused by: ai.onnxruntime.OrtException: Error code - ORT_INVALID_PROTOBUF - message: Load model from /opt/vespa/var/db/vespa/download/-2375926038234599409/contents failed:Protobuf parsing failed.\n\tat ai.onnxruntime.OrtSession.createSession(Native Method)\n\tat ai.onnxruntime.OrtSession.<init>(OrtSession.java:73)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:222)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:208)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime$1.create(OnnxRuntime.java:46)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.acquireSession(OnnxRuntime.java:149)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:144)\n\t... 3 more\n
2023-09-04T13:01:22.124180186Z [2023-09-04 13:01:22.117] WARNING container        Container.com.yahoo.jdisc.core.ApplicationLoader	Exception thrown while activating application.\nexception=\ncom.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'wpsolr_multilingual_e5_small_onnx' of type 'ai.vespa.embedding.huggingface.HuggingFaceEmbedder': null\nCaused by: java.lang.RuntimeException: ONNX Runtime exception\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:161)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:156)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.<init>(OnnxEvaluator.java:36)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.evaluatorOf(OnnxRuntime.java:81)\nCaused by: ai.onnxruntime.OrtException: Error code - ORT_INVALID_PROTOBUF - message: Load model from /opt/vespa/var/db/vespa/download/-2375926038234599409/contents failed:Protobuf parsing failed.\n\tat ai.onnxruntime.OrtSession.createSession(Native Method)\n\tat ai.onnxruntime.OrtSession.<init>(OrtSession.java:73)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:222)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:208)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime$1.create(OnnxRuntime.java:46)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.acquireSession(OnnxRuntime.java:149)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:144)\n\t... 3 more\n
2023-09-04T13:01:22.124206620Z [2023-09-04 13:01:22.118] INFO    container        Container.com.yahoo.container.jdisc.ConfiguredApplication	Destroy: Shutting down container now
2023-09-04T13:01:22.124212248Z [2023-09-04 13:01:22.117] INFO    container        Container.com.yahoo.container.jdisc.component.Deconstructor	Starting deconstruction of 10 components and 0 bundles from generation 3160
2023-09-04T13:01:22.124216587Z [2023-09-04 13:01:22.121] INFO    container        Container.com.yahoo.container.jdisc.ConfiguredApplication	Destroy: Finished
2023-09-04T13:01:22.124221504Z [2023-09-04 13:01:22.122] ERROR   container        Container.com.yahoo.jdisc.core.StandaloneMain	JDisc exiting: Throwable caught: \nexception=\ncom.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'wpsolr_multilingual_e5_small_onnx' of type 'ai.vespa.embedding.huggingface.HuggingFaceEmbedder': null\nCaused by: java.lang.RuntimeException: ONNX Runtime exception\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:161)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:156)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.<init>(OnnxEvaluator.java:36)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.evaluatorOf(OnnxRuntime.java:81)\nCaused by: ai.onnxruntime.OrtException: Error code - ORT_INVALID_PROTOBUF - message: Load model from /opt/vespa/var/db/vespa/download/-2375926038234599409/contents failed:Protobuf parsing failed.\n\tat ai.onnxruntime.OrtSession.createSession(Native Method)\n\tat ai.onnxruntime.OrtSession.<init>(OrtSession.java:73)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:222)\n\tat ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:208)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime$1.create(OnnxRuntime.java:46)\n\tat ai.vespa.modelintegration.evaluator.OnnxRuntime.acquireSession(OnnxRuntime.java:149)\n\tat ai.vespa.modelintegration.evaluator.OnnxEvaluator.createSession(OnnxEvaluator.java:144)\n\t... 3 more\n
2023-09-04T13:01:22.271596244Z [2023-09-04 13:01:22.270] INFO    config-sentinel  sentinel.sentinel.service	container: incremented restart penalty to 254.000 seconds
2023-09-04T13:01:22.271630919Z [2023-09-04 13:01:22.270] INFO    config-sentinel  sentinel.sentinel.service	container: will delay start by 241.682 seconds

The text was updated successfully, but these errors were encountered:

jobergum · 2023-09-04T15:57:41Z

Hey @eostis,

It is easier to analyze the behavior if you write more about the changes you have deployed. For example, it would help if you write something along the lines of

This is my configuration and services.xml
Then I deploy the following using x
Then I observe that this is happening.

The problem is that the application APIs are now in 404 mode

What does this mean?

It happened locally, and on Kubernetes. Same issue. Restarting docker does not solve anything.

The protobuf is corrupted/incorrect, as seen from the above warning. This means, that the serving container will not start, if it was running already, it would continue running with the older configuration generation, until you attempt to restart it, in which it will not start, because the current generation says it should use the corrupted file.

eostis · 2023-09-04T16:10:48Z

Thanks @jobergum.

I did not write the specifics because the issue is more about recovering the failed application than why it failed.

Is there something I can do to recover from that kind of issue, presumably a corrupt model, without having to destroy the application?

jobergum · 2023-09-04T16:13:14Z

Yes, remove the component configuration with reference to the bad file and re-deploy.

eostis · 2023-09-04T16:14:23Z

That is my point: I cannot redeploy the services.xml file with the APIs (HTTP 404).

eostis · 2023-09-04T16:16:57Z

More exactly: "(http_request_failed) cURL error 6: Could not resolve host: node1.vespanet"

jobergum · 2023-09-04T16:45:01Z

Read your two last comments and ask yourself, is it possible to decode what you write?

Cannot redeploy with the APIs (HTTP 404)

(http_request_failed) cURL error 6: Could not resolve host: node1.vespanet

eostis · 2023-09-04T17:13:38Z

Vespa's APIs being no more accessible, impossible to redeploy services.xml to remove the faulty component.

jobergum · 2023-09-04T17:29:46Z

This is the last comment from me on this, this is not leading anywhere.

There are quite a few Vespa APIs. You have to be specific, again, see my first response. You say that some API is returning HTTP 404, but who knows what API or how you got there (again, missing detailed steps to reproduce) so that we can clearly understand what API you are talking about and what the exact response body is. The curl request fail snippet above, seem to be related to DNS, but who knows what and why you wanted to reach node1.vespanet.

eostis · 2023-09-04T17:33:55Z

Absolutely all APIs are down. I cannot contact Vespa instance with http, not for search nor for admin.

The problem is that the application APIs are now in 404 mode

jobergum · 2023-09-04T17:45:15Z

I cannot contact Vespa instance with http, not for search nor for admin.

If that is because of the DNS error (Could not resolve host: node1.vespanet), then maybe check on that? The deploy API powered by the configuration server is not affected at all by a ONNX model failing to loading on a serving container that powers the search API.

eostis · 2023-09-04T18:39:07Z

I'm closing this issue, but here is a summary of what I experienced for those interested:

On a one-node docker

Deploying a schema with a corrupted model broke this schema APIs (searching and indexing), but the admin APIs were still ok
Restarting docker with the corrupted model broke all other schemas APIs (searching, indexing), but the admin APIs were still ok

This is fine, I could use the admin APIs to remove the corrupted model from services.xml and redeploy.

(I couldn't reproduce it, but it happened that the admin APIs were also down)

On a one-node Kubernetes (Google Cloud)

Deploying a schema with a corrupted model broke all schema APIs (searching and indexing), but also the admin APIs.

This was the bad part: I had to recreate the cluster and recreate/reindex all the indices of my Vespa demos. Several times.

jobergum · 2023-09-05T07:04:30Z

A component in the Vespa stateless container (example: HF Embedder) is global. It's not tied to a specific schema. Because the stateless container can talk to any content cluster/schema, and the same component can be re-used across schemas.

When deploying changes, there is a new configuration generation that the stateless containers will attempt to migrate to. If initialization fails, it just continues with the previous configuration generation. But, if you restart the process, it cannot roll back and will attempt to start with the active generation (which will fail).

On your Kubernetes experience, this sounds more like faulty configuration regarding Kubernetes health probes, so your entire deployment has been wiped (Also removed from DNS) - Vespa does not have that capability.

eostis changed the title ~~Bad ONNX brings down the application APIs, making recovery impossible~~ HuggingFaceEmbedder model brings down the application APIs, making recovery impossible Sep 4, 2023

eostis closed this as completed Sep 4, 2023

eostis mentioned this issue Sep 9, 2023

A checklist for WooCommerce #26694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HuggingFaceEmbedder model brings down the application APIs, making recovery impossible #28383

HuggingFaceEmbedder model brings down the application APIs, making recovery impossible #28383

eostis commented Sep 4, 2023 •

edited

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023 •

edited

eostis commented Sep 4, 2023

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023 •

edited

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023

jobergum commented Sep 5, 2023

HuggingFaceEmbedder model brings down the application APIs, making recovery impossible #28383

HuggingFaceEmbedder model brings down the application APIs, making recovery impossible #28383

Comments

eostis commented Sep 4, 2023 • edited

Logs

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023 • edited

eostis commented Sep 4, 2023

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023 • edited

jobergum commented Sep 4, 2023

eostis commented Sep 4, 2023

On a one-node docker

On a one-node Kubernetes (Google Cloud)

jobergum commented Sep 5, 2023

eostis commented Sep 4, 2023 •

edited

eostis commented Sep 4, 2023 •

edited

eostis commented Sep 4, 2023 •

edited