Newer Base Image KServe Container fails with exec /usr/local/bin/dockerd-entrypoint.sh: exec format error #3033

tylertitsworth · 2024-03-20T19:50:19Z

🐛 Describe the bug

The public TorchServe KFS Image that was recently updated for 0.10.0 has ubuntu:20.04 as its base.

$ docker image inspect pytorch/torchserve-kfs:0.10.0 | grep "org.opencontainers.image.version"
                "org.opencontainers.image.version": "20.04"

Intel is publishing an Intel Optimized version of both the torchserve and torchserve-kfs images, which includes Intel Extension for PyTorch. However, due to Intel's Security First policies, we use ubuntu:22.04 as our base image for both containers (soon to be ubuntu:24.04.

When we deploy with the latest 0.10.0 version of torchserve on kserve, the image immediately enters the CrashLoopBackOff state due to the following error: exec /usr/local/bin/dockerd-entrypoint.sh: exec format error.

We determined that the solution to this issue was to change the base back to ubuntu:20.04, however this means that anyone who intends to create a custom torchserve-kfs container won't be able to use the ubuntu:rolling base specified in https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L19.

This issue is not present in the previous version my team published, only with the latest kserve and torchserve version, and I was not able to reproduce from the command line, only in my cluster.

Error logs

When using ubuntu:23.10, it fails during buildtime:

$ ./build-image.sh
...
#11 4.706   Downloading grpcio-tools-1.48.2.tar.gz (2.2 MB)
#11 4.827      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 18.7 MB/s eta 0:00:00
#11 5.054   Preparing metadata (setup.py): started
#11 5.230   Preparing metadata (setup.py): finished with status 'error'
#11 5.234   error: subprocess-exited-with-error
#11 5.234   
#11 5.234   × python setup.py egg_info did not run successfully.
#11 5.234   │ exit code: 1
#11 5.234   ╰─> [16 lines of output]
#11 5.234       /home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py:30: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
#11 5.234         import pkg_resources
#11 5.234       Traceback (most recent call last):
#11 5.234         File "<string>", line 2, in <module>
#11 5.234         File "<pip-setuptools-caller>", line 34, in <module>
#11 5.234         File "/home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py", line 180, in <module>
#11 5.234           if check_linker_need_libatomic():
#11 5.234              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#11 5.234         File "/home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py", line 91, in check_linker_need_libatomic
#11 5.234           cpp_test = subprocess.Popen([cxx, '-x', 'c++', '-std=c++14', '-'],
#11 5.234                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#11 5.234         File "/usr/lib/python3.11/subprocess.py", line 1026, in __init__
#11 5.234           self._execute_child(args, executable, preexec_fn, close_fds,
#11 5.234         File "/usr/lib/python3.11/subprocess.py", line 1950, in _execute_child
#11 5.234           raise child_exception_type(errno_num, err_msg, err_filename)
#11 5.234       FileNotFoundError: [Errno 2] No such file or directory: 'c++'
#11 5.234       [end of output]
#11 5.234   
#11 5.234   note: This error originates from a subprocess, and is likely not a problem with pip.
#11 5.236 error: metadata-generation-failed
#11 5.236 
#11 5.236 × Encountered error while generating package metadata.
#11 5.236 ╰─> See above for output.
#11 5.236 
#11 5.236 note: This is an issue with the package mentioned above, not pip.
#11 5.236 hint: See above for details.
------
executor failed running [/bin/bash -c pip install -r requirements.txt]: exit code: 1

But I am more interested in the output with ubuntu:22.04, which fails during deployment:

$ kubectl logs vqi-predictor-00001-deployment-8f6cd7bd7-9hl84
Defaulted container "kserve-container" out of: kserve-container, queue-proxy, storage-initializer (init)
exec /usr/local/bin/dockerd-entrypoint.sh: exec format error

Installation instructions

Install TorchServe from source? No
Are you using Docker? Yes

Model Packaing

n/a

config.properties

n/a

Versions

With ubuntu:22.04 as base

$ python ts_scripts/print_env_info.py 
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.10.0
torch-model-archiver==0.10.0

Python version: 3.10 (64-bit runtime)
Python executable: /home/venv/bin/python

Versions of relevant python libraries:
captum==0.7.0
intel-extension-for-pytorch==2.2.0+cpu
numpy==1.26.4
pillow==10.2.0
psutil==5.9.8
requests==2.31.0
requests-oauthlib==1.4.0
torch==2.2.0+cpu
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
torchaudio==2.2.0+cpu
torchdata==0.7.1
torchserve==0.10.0
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
transformers==4.38.2
wheel==0.43.0
torch==2.2.0+cpu
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
torchaudio==2.2.0+cpu

Java Version:


OS: N/A
GCC version: N/A
Clang version: N/A
CMake version: N/A

Environment:
library_path (LD_/DYLD_):

With ubuntu:20.04 as base

python ts_scripts/print_env_info.py 
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.10.0
torch-model-archiver==0.10.0

Python version: 3.8 (64-bit runtime)
Python executable: /home/venv/bin/python

Versions of relevant python libraries:
captum==0.7.0
intel-extension-for-pytorch==2.2.0+cpu
numpy==1.24.4
pillow==10.2.0
psutil==5.9.8
requests==2.31.0
requests-oauthlib==1.4.0
torch==2.2.0+cpu
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
torchaudio==2.2.0+cpu
torchdata==0.7.1
torchserve==0.10.0
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
transformers==4.38.2
wheel==0.43.0
torch==2.2.0+cpu
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
torchaudio==2.2.0+cpu

Java Version:


OS: N/A
GCC version: N/A
Clang version: N/A
CMake version: N/A

Environment:
library_path (LD_/DYLD_):

Repro instructions

From https://github.com/intel/ai-containers,

Clone the Repository
Install docker-compose (see main README.md)
Build the Intel TorchServe container:

export REGISTRY=intel
export REPO=aiops/mlops-ci
cd pytorch
docker compose up --build torchserve

Setup KServe build
1. Comment out these lines https://github.com/intel/ai-containers/blob/main/pytorch/serving/build-kfs.sh#L4-L5
2. docker tag intel/aiops/mlops-ci:b-0-ubuntu-22.04-pip-py3.10-torchserve intel/torchserve:latest
Build KServe Container

cd serving
./build-kfs.sh

Push to Internal Registry
Modify ClusterServingRuntime kserve-torchserve to use the new image
Deploy any example Endpoint

Possible Solution

No response

The text was updated successfully, but these errors were encountered:

tylertitsworth · 2024-03-20T19:54:30Z

Before it gets asked here, yes I have tried to capture logs from within the deployed container, however the container does not even start so no other logs are recorded (other than the liveness probe and queue-proxy failing and all of that)

agunapal · 2024-03-22T17:29:30Z

Thanks for reporting..looking into this. Able to repro the error. Earlier we didn't move to 22.04 as the ubuntu 22.04 runners were flaky. I will try running CI on 22.04 to see if its resolved now.

agunapal · 2024-03-22T18:27:57Z

@tylertitsworth Please pull the submodules before you build kfs image

git submodule update --init --recursive

I am able to build it with 22.04 after doing this

 docker image inspect pytorch/torchserve-kfs:latest-cpu | grep "org.opencontainers.image.version"
                "org.opencontainers.image.version": "22.04"

tylertitsworth · 2024-03-22T18:46:41Z

@agunapal In the build script I use to build this container I pull submodules (https://github.com/intel/ai-containers/blob/main/pytorch/serving/build-kfs.sh#L9)

I am able to build the container, however, my issue is when it is deployed to k8s.

tylertitsworth · 2024-04-02T04:10:33Z

@agunapal any update on this? Is there any misunderstanding I can help alleviate?

agunapal · 2024-04-02T04:22:31Z

Hi @tylertitsworth I understand the problem. I will get back to you this week.

agunapal · 2024-04-06T18:30:45Z

On ubuntu 22.04, tried running grpc testcases..these worked

test_gRPC_inference_api.py::test_inference_apis PASSED                                                                                                                             [ 21%]
test_gRPC_inference_api.py::test_inference_stream_apis 2024-04-06T18:20:11,945 [INFO ] W-9024-echo_stream_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9024-echo_stream_1.0-stderr
PASSED                                                                                                                      [ 21%]
test_gRPC_inference_api.py::test_inference_stream2_apis PASSED                                                                                                                     [ 22%]
test_gRPC_management_apis.py::test_management_apis PASSED

So, it may be something specific to docker/kserve.. Will try the steps you have mentioned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newer Base Image KServe Container fails with exec /usr/local/bin/dockerd-entrypoint.sh: exec format error #3033

Newer Base Image KServe Container fails with exec /usr/local/bin/dockerd-entrypoint.sh: exec format error #3033

tylertitsworth commented Mar 20, 2024

tylertitsworth commented Mar 20, 2024

agunapal commented Mar 22, 2024

agunapal commented Mar 22, 2024

tylertitsworth commented Mar 22, 2024

tylertitsworth commented Apr 2, 2024

agunapal commented Apr 2, 2024

agunapal commented Apr 6, 2024

Newer Base Image KServe Container fails with exec /usr/local/bin/dockerd-entrypoint.sh: exec format error #3033

Newer Base Image KServe Container fails with exec /usr/local/bin/dockerd-entrypoint.sh: exec format error #3033

Comments

tylertitsworth commented Mar 20, 2024

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

tylertitsworth commented Mar 20, 2024

agunapal commented Mar 22, 2024

agunapal commented Mar 22, 2024

tylertitsworth commented Mar 22, 2024

tylertitsworth commented Apr 2, 2024

agunapal commented Apr 2, 2024

agunapal commented Apr 6, 2024