Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newer Base Image KServe Container fails with exec /usr/local/bin/dockerd-entrypoint.sh: exec format error #3033

Open
tylertitsworth opened this issue Mar 20, 2024 · 7 comments

Comments

@tylertitsworth
Copy link

🐛 Describe the bug

The public TorchServe KFS Image that was recently updated for 0.10.0 has ubuntu:20.04 as its base.

$ docker image inspect pytorch/torchserve-kfs:0.10.0 | grep "org.opencontainers.image.version"
                "org.opencontainers.image.version": "20.04"

Intel is publishing an Intel Optimized version of both the torchserve and torchserve-kfs images, which includes Intel Extension for PyTorch. However, due to Intel's Security First policies, we use ubuntu:22.04 as our base image for both containers (soon to be ubuntu:24.04.

When we deploy with the latest 0.10.0 version of torchserve on kserve, the image immediately enters the CrashLoopBackOff state due to the following error: exec /usr/local/bin/dockerd-entrypoint.sh: exec format error.

We determined that the solution to this issue was to change the base back to ubuntu:20.04, however this means that anyone who intends to create a custom torchserve-kfs container won't be able to use the ubuntu:rolling base specified in https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L19.

This issue is not present in the previous version my team published, only with the latest kserve and torchserve version, and I was not able to reproduce from the command line, only in my cluster.

Error logs

When using ubuntu:23.10, it fails during buildtime:

$ ./build-image.sh
...
#11 4.706   Downloading grpcio-tools-1.48.2.tar.gz (2.2 MB)
#11 4.827      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 18.7 MB/s eta 0:00:00
#11 5.054   Preparing metadata (setup.py): started
#11 5.230   Preparing metadata (setup.py): finished with status 'error'
#11 5.234   error: subprocess-exited-with-error
#11 5.234   
#11 5.234   × python setup.py egg_info did not run successfully.
#11 5.234   │ exit code: 1
#11 5.234   ╰─> [16 lines of output]
#11 5.234       /home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py:30: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
#11 5.234         import pkg_resources
#11 5.234       Traceback (most recent call last):
#11 5.234         File "<string>", line 2, in <module>
#11 5.234         File "<pip-setuptools-caller>", line 34, in <module>
#11 5.234         File "/home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py", line 180, in <module>
#11 5.234           if check_linker_need_libatomic():
#11 5.234              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#11 5.234         File "/home/model-server/tmp/pip-install-hni50hgy/grpcio-tools_f16dab96a18c4c7b886e38061d477973/setup.py", line 91, in check_linker_need_libatomic
#11 5.234           cpp_test = subprocess.Popen([cxx, '-x', 'c++', '-std=c++14', '-'],
#11 5.234                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#11 5.234         File "/usr/lib/python3.11/subprocess.py", line 1026, in __init__
#11 5.234           self._execute_child(args, executable, preexec_fn, close_fds,
#11 5.234         File "/usr/lib/python3.11/subprocess.py", line 1950, in _execute_child
#11 5.234           raise child_exception_type(errno_num, err_msg, err_filename)
#11 5.234       FileNotFoundError: [Errno 2] No such file or directory: 'c++'
#11 5.234       [end of output]
#11 5.234   
#11 5.234   note: This error originates from a subprocess, and is likely not a problem with pip.
#11 5.236 error: metadata-generation-failed
#11 5.236 
#11 5.236 × Encountered error while generating package metadata.
#11 5.236 ╰─> See above for output.
#11 5.236 
#11 5.236 note: This is an issue with the package mentioned above, not pip.
#11 5.236 hint: See above for details.
------
executor failed running [/bin/bash -c pip install -r requirements.txt]: exit code: 1

But I am more interested in the output with ubuntu:22.04, which fails during deployment:

$ kubectl logs vqi-predictor-00001-deployment-8f6cd7bd7-9hl84
Defaulted container "kserve-container" out of: kserve-container, queue-proxy, storage-initializer (init)
exec /usr/local/bin/dockerd-entrypoint.sh: exec format error

Installation instructions

Install TorchServe from source? No
Are you using Docker? Yes

Model Packaing

n/a

config.properties

n/a

Versions

With ubuntu:22.04 as base

$ python ts_scripts/print_env_info.py 
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.10.0
torch-model-archiver==0.10.0

Python version: 3.10 (64-bit runtime)
Python executable: /home/venv/bin/python

Versions of relevant python libraries:
captum==0.7.0
intel-extension-for-pytorch==2.2.0+cpu
numpy==1.26.4
pillow==10.2.0
psutil==5.9.8
requests==2.31.0
requests-oauthlib==1.4.0
torch==2.2.0+cpu
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
torchaudio==2.2.0+cpu
torchdata==0.7.1
torchserve==0.10.0
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
transformers==4.38.2
wheel==0.43.0
torch==2.2.0+cpu
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
torchaudio==2.2.0+cpu

Java Version:


OS: N/A
GCC version: N/A
Clang version: N/A
CMake version: N/A

Environment:
library_path (LD_/DYLD_):

With ubuntu:20.04 as base

python ts_scripts/print_env_info.py 
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.10.0
torch-model-archiver==0.10.0

Python version: 3.8 (64-bit runtime)
Python executable: /home/venv/bin/python

Versions of relevant python libraries:
captum==0.7.0
intel-extension-for-pytorch==2.2.0+cpu
numpy==1.24.4
pillow==10.2.0
psutil==5.9.8
requests==2.31.0
requests-oauthlib==1.4.0
torch==2.2.0+cpu
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
torchaudio==2.2.0+cpu
torchdata==0.7.1
torchserve==0.10.0
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
transformers==4.38.2
wheel==0.43.0
torch==2.2.0+cpu
torchtext==0.17.0+cpu
torchvision==0.17.0+cpu
torchaudio==2.2.0+cpu

Java Version:


OS: N/A
GCC version: N/A
Clang version: N/A
CMake version: N/A

Environment:
library_path (LD_/DYLD_):

Repro instructions

From https://github.com/intel/ai-containers,

  1. Clone the Repository
  2. Install docker-compose (see main README.md)
  3. Build the Intel TorchServe container:
export REGISTRY=intel
export REPO=aiops/mlops-ci
cd pytorch
docker compose up --build torchserve
  1. Setup KServe build
    1. Comment out these lines https://github.com/intel/ai-containers/blob/main/pytorch/serving/build-kfs.sh#L4-L5
    2. docker tag intel/aiops/mlops-ci:b-0-ubuntu-22.04-pip-py3.10-torchserve intel/torchserve:latest
  2. Build KServe Container
cd serving
./build-kfs.sh
  1. Push to Internal Registry
  2. Modify ClusterServingRuntime kserve-torchserve to use the new image
  3. Deploy any example Endpoint

Possible Solution

No response

@tylertitsworth
Copy link
Author

Before it gets asked here, yes I have tried to capture logs from within the deployed container, however the container does not even start so no other logs are recorded (other than the liveness probe and queue-proxy failing and all of that)

@agunapal
Copy link
Collaborator

Thanks for reporting..looking into this. Able to repro the error. Earlier we didn't move to 22.04 as the ubuntu 22.04 runners were flaky. I will try running CI on 22.04 to see if its resolved now.

@agunapal
Copy link
Collaborator

@tylertitsworth Please pull the submodules before you build kfs image

git submodule update --init --recursive

I am able to build it with 22.04 after doing this

 docker image inspect pytorch/torchserve-kfs:latest-cpu | grep "org.opencontainers.image.version"
                "org.opencontainers.image.version": "22.04"

@tylertitsworth
Copy link
Author

@agunapal In the build script I use to build this container I pull submodules (https://github.com/intel/ai-containers/blob/main/pytorch/serving/build-kfs.sh#L9)

I am able to build the container, however, my issue is when it is deployed to k8s.

@tylertitsworth
Copy link
Author

@agunapal any update on this? Is there any misunderstanding I can help alleviate?

@agunapal
Copy link
Collaborator

agunapal commented Apr 2, 2024

Hi @tylertitsworth I understand the problem. I will get back to you this week.

@agunapal
Copy link
Collaborator

agunapal commented Apr 6, 2024

On ubuntu 22.04, tried running grpc testcases..these worked

test_gRPC_inference_api.py::test_inference_apis PASSED                                                                                                                             [ 21%]
test_gRPC_inference_api.py::test_inference_stream_apis 2024-04-06T18:20:11,945 [INFO ] W-9024-echo_stream_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9024-echo_stream_1.0-stderr
PASSED                                                                                                                      [ 21%]
test_gRPC_inference_api.py::test_inference_stream2_apis PASSED                                                                                                                     [ 22%]
test_gRPC_management_apis.py::test_management_apis PASSED        

So, it may be something specific to docker/kserve.. Will try the steps you have mentioned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants