Update instructions to build with nvidia cuda runtime image for ONNX #2435

agunapal · 2023-06-28T19:42:29Z

Description

TorchServe's GPU Docker Image uses NVIDIA CUDA base image.

Third part libraries such as ONNX require NVIDIA CUDA runtime base image to work.

Introduce a new arg -bi to docker build script to script the base image

./build_image.sh -bi nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04 -g -cv cu117 -t pytorch/ts_run:latest-gpu

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

1_docker-regression (ubuntu-20.04).txt
2_docker-regression (self-hosted, regression-test-gpu).txt

Error message on -bi and -g

(torchserve) ubuntu@ip-172-31-60-100:~/serve/docker$ ./build_image.sh -bi nvidia/cuda:11.7.1-base-ubuntu20.04  -g -t py:test1
Incompatible options: -bi doesn't work with -g option

(torchserve) ubuntu@ip-172-31-60-100:~/serve/docker$ ./build_image.sh -bi nvidia/cuda:11.7.1-base-ubuntu20.04   -t py:test1
[+] Building 172.8s (24/24) FINISHED                                                                                                                                                      
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => => transferring dockerfile: 38B                                                                                                                                                  0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => => transferring context: 2B                                                                                                                                                      0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                                0.5s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                           0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04                                                                                                       0.0s
 => CACHED [compile-image 1/9] FROM docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04                                                                                                    0.0s
 => [internal] load build context                                                                                                                                                    0.0s
 => => transferring context: 80B                                                                                                                                                     0.0s
 => [compile-image 2/9] RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y  111.0s
 => [compile-image 3/9] RUN python3.9 -m venv /home/venv                                                                                                                             3.0s 
 => [compile-image 4/9] RUN python -m pip install -U pip setuptools                                                                                                                  3.0s 
 => [compile-image 5/9] RUN export USE_CUDA=1                                                                                                                                        0.4s 
 => [compile-image 6/9] RUN git clone --depth 1 https://github.com/pytorch/serve.git                                                                                                 2.8s 
 => [compile-image 7/9] WORKDIR serve                                                                                                                                                0.0s 
 => [compile-image 8/9] RUN     if echo "nvidia/cuda:11.7.1-base-ubuntu20.04" | grep -q "cuda:"; then         if [ "" ]; then             python ./ts_scripts/install_dependencies  35.5s 
 => [compile-image 9/9] RUN     if echo "false" | grep -q "false"; then         python -m pip install --no-cache-dir torchserve torch-model-archiver torch-workflow-archiver;    el  2.0s 
 => CACHED [production-image 2/9] RUN --mount=type=cache,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y &&  0.0s 
 => CACHED [production-image 3/9] RUN useradd -m model-server     && mkdir -p /home/model-server/tmp                                                                                 0.0s 
 => [production-image 4/9] COPY --chown=model-server --from=compile-image /home/venv /home/venv                                                                                      5.0s 
 => [production-image 5/9] COPY dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh                                                                                           0.0s 
 => [production-image 6/9] RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh     && chown -R model-server /home/model-server                                                         0.3s 
 => [production-image 7/9] COPY config.properties /home/model-server/config.properties                                                                                               0.0s 
 => [production-image 8/9] RUN mkdir /home/model-server/model-store && chown -R model-server /home/model-server/model-store                                                          0.4s
 => [production-image 9/9] WORKDIR /home/model-server                                                                                                                                0.0s
 => exporting to image                                                                                                                                                               5.2s
 => => exporting layers                                                                                                                                                              5.2s
 => => writing image sha256:97e78b58035f1c1519c66a2fc85f4d17e2c303c84d854d8f22f184a8515d83ab                                                                                         0.0s
 => => naming to docker.io/library/py:test1

Nvidia-runtime

(torchserve) ubuntu@ip-172-31-2-198:~/serve/docker$ ./build_image.sh -bi nvidia/cuda:11.7.1-runtime-ubuntu20.04 -bt ci -t pytorch/torchserve:ci
[+] Building 215.2s (22/22) FINISHED                                                                                                                                                      
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => => transferring dockerfile: 6.52kB                                                                                                                                               0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => => transferring context: 2B                                                                                                                                                      0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                                0.5s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                           0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.1-runtime-ubuntu20.04                                                                                                    0.8s
 => [compile-image 1/9] FROM docker.io/nvidia/cuda:11.7.1-runtime-ubuntu20.04@sha256:4ff41b20a64e267e9bd9466061711b09adf4807f3d4c656d07009788a56f8178                               19.9s
 => => resolve docker.io/nvidia/cuda:11.7.1-runtime-ubuntu20.04@sha256:4ff41b20a64e267e9bd9466061711b09adf4807f3d4c656d07009788a56f8178                                              0.0s
 => => sha256:4ff41b20a64e267e9bd9466061711b09adf4807f3d4c656d07009788a56f8178 743B / 743B                                                                                           0.0s
 => => sha256:cca775f086be7b61abaf8428ac4aa71fba4a7a1d4718a5aee6cb09d7163ae604 13.17kB / 13.17kB                                                                                     0.0s
 => => sha256:79284eb3dfdfdfbd489bdfbcc675f51c192510f6d4ea5a5971876a0002f5bce1 2.21kB / 2.21kB                                                                                       0.0s
 => => sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 7.94MB / 7.94MB                                                                                       0.4s
 => => sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 47.88MB / 47.88MB                                                                                     0.6s
 => => sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 183B / 183B                                                                                           0.1s
 => => sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176 6.88kB / 6.88kB                                                                                       0.3s
 => => sha256:06d6ff943437bee1c3ad6bb60a3b7727408e450ed129dddd3adf3a46eac22f28 1.09GB / 1.09GB                                                                                       9.2s
 => => extracting sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77                                                                                            0.2s
 => => sha256:5ba16bd606c9f26e6b2bc4c850efa5ff293e1f3bacc3d341b25dc07001712780 62.30kB / 62.30kB                                                                                     0.6s
 => => sha256:566e1b27f99d4f73a48d51e3db82d3124417ae96bf622597b796e78d6c33e700 1.52kB / 1.52kB                                                                                       0.7s
 => => sha256:c20acb837b2297cff47e8f703eb6d6e035d74701f5492b55ac37694440cd26d9 1.68kB / 1.68kB                                                                                       0.7s
 => => extracting sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3                                                                                            0.7s
 => => extracting sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94                                                                                            0.0s
 => => extracting sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176                                                                                            0.0s
 => => extracting sha256:06d6ff943437bee1c3ad6bb60a3b7727408e450ed129dddd3adf3a46eac22f28                                                                                           10.4s
 => => extracting sha256:5ba16bd606c9f26e6b2bc4c850efa5ff293e1f3bacc3d341b25dc07001712780                                                                                            0.0s
 => => extracting sha256:c20acb837b2297cff47e8f703eb6d6e035d74701f5492b55ac37694440cd26d9                                                                                            0.0s
 => => extracting sha256:566e1b27f99d4f73a48d51e3db82d3124417ae96bf622597b796e78d6c33e700                                                                                            0.0s
 => [compile-image 2/9] RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y  102.9s
 => [ci-image 2/6] RUN --mount=type=cache,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y &&     add-apt-  127.3s
 => [compile-image 3/9] RUN python3.9 -m venv /home/venv                                                                                                                             3.0s
 => [compile-image 4/9] RUN python -m pip install -U pip setuptools                                                                                                                  3.2s
 => [compile-image 5/9] RUN export USE_CUDA=1                                                                                                                                        0.3s
 => [compile-image 6/9] RUN git clone --depth 1 https://github.com/pytorch/serve.git                                                                                                 2.2s
 => [compile-image 7/9] WORKDIR serve                                                                                                                                                0.0s
 => [compile-image 8/9] RUN     if echo "nvidia/cuda:11.7.1-runtime-ubuntu20.04" | grep -q "cuda:"; then         if [ "" ]; then             python ./ts_scripts/install_dependenc  37.7s
 => [compile-image 9/9] RUN python -m pip install --no-cache-dir torchserve torch-model-archiver torch-workflow-archiver                                                             1.9s
 => [ci-image 3/6] COPY --from=compile-image /home/venv /home/venv                                                                                                                   5.4s 
 => [ci-image 4/6] RUN python -m pip install --no-cache-dir -r https://raw.githubusercontent.com/pytorch/serve/master/requirements/developer.txt                                    26.8s 
 => [ci-image 5/6] RUN mkdir /home/serve                                                                                                                                             0.3s 
 => [ci-image 6/6] WORKDIR /home/serve                                                                                                                                               0.0s
 => exporting to image                                                                                                                                                               6.3s
 => => exporting layers                                                                                                                                                              6.3s
 => => writing image sha256:369e1a2f338827066d710963b0f4416f2057a02125f6ef19758458019f7ae23a                                                                                         0.0s
 => => naming to docker.io/pytorch/torchserve:ci                                                                                                                                     0.0s
(torchserve) ubuntu@ip-172-31-2-198:~/serve/docker$ docker run -it --gpus all -v $PWD:/home/serve pytorch/torchserve:ci

==========
== CUDA ==
==========

CUDA Version 11.7.1

(torchserve) ubuntu@ip-172-31-7-107:~/serve/docker$ ./build_image.sh -bi nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04 -g -cv cu117 -t pytorch/ts_run:latest-gpu
[+] Building 0.8s (26/26) FINISHED                                                                                                                                                        
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => => transferring dockerfile: 4.96kB                                                                                                                                               0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => => transferring context: 2B                                                                                                                                                      0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                                0.3s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                           0.0s
 => [internal] load .dockerignore                                                                                                                                                    0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04                                                                                             0.3s
 => [compile-image 1/9] FROM docker.io/nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04@sha256:4a4398ca4dbe0d0dbcda3bb153333f1c4d66edb0b5d4fd48eefe765ab7d83d25                         0.0s
 => [internal] load build context                                                                                                                                                    0.0s
 => => transferring context: 80B                                                                                                                                                     0.0s
 => CACHED [runtime-image 2/9] RUN --mount=type=cache,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-common -y &&     0.0s
 => CACHED [runtime-image 3/9] RUN useradd -m model-server     && mkdir -p /home/model-server/tmp                                                                                    0.0s
 => CACHED [compile-image 2/9] RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt     apt-get update &&     apt-get upgrade -y &&     apt-get install software-properties-comm  0.0s
 => CACHED [compile-image 3/9] RUN python3.9 -m venv /home/venv                                                                                                                      0.0s
 => CACHED [compile-image 4/9] RUN python -m pip install -U pip setuptools                                                                                                           0.0s
 => CACHED [compile-image 5/9] RUN export USE_CUDA=1                                                                                                                                 0.0s
 => CACHED [compile-image 6/9] RUN git clone --depth 1 https://github.com/pytorch/serve.git                                                                                          0.0s
 => CACHED [compile-image 7/9] WORKDIR serve                                                                                                                                         0.0s
 => CACHED [compile-image 8/9] RUN     if echo "nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04" | grep -q "cuda:"; then         if [ "cu117" ]; then             python ./ts_scripts  0.0s
 => CACHED [compile-image 9/9] RUN python -m pip install --no-cache-dir torchserve torch-model-archiver torch-workflow-archiver                                                      0.0s
 => CACHED [runtime-image 4/9] COPY --chown=model-server --from=compile-image /home/venv /home/venv                                                                                  0.0s
 => CACHED [runtime-image 5/9] COPY dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh                                                                                       0.0s
 => CACHED [runtime-image 6/9] RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh     && chown -R model-server /home/model-server                                                     0.0s
 => CACHED [runtime-image 7/9] COPY config.properties /home/model-server/config.properties                                                                                           0.0s
 => CACHED [runtime-image 8/9] RUN mkdir /home/model-server/model-store && chown -R model-server /home/model-server/model-store                                                      0.0s
 => CACHED [runtime-image 9/9] WORKDIR /home/model-server                                                                                                                            0.0s
 => exporting to image                                                                                                                                                               0.0s
 => => exporting layers                                                                                                                                                              0.0s
 => => writing image sha256:988e7b1f9eba0f4b6f5b0fe3396195c80ac6877a2f251500b864a30ed04ba253                                                                                         0.0s
 => => naming to docker.io/pytorch/ts_run:latest-gpu

REPOSITORY                              TAG          IMAGE ID       CREATED          SIZE
pytorch/ts_run                          latest-gpu   988e7b1f9eba   15 minutes ago   8.29GB
pytorch/ts_base                         latest-gpu   c9ba68f62f4f   21 minutes ago   5.12GB

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

codecov · 2023-06-28T20:06:44Z

Codecov Report

Merging #2435 (9becc65) into master (e2cd91b) will not change coverage.
The diff coverage is n/a.

❗ Current head 9becc65 differs from pull request most recent head 344f1d0. Consider uploading reports for the commit 344f1d0 to get more accurate results

@@           Coverage Diff           @@
##           master    #2435   +/-   ##
=======================================
  Coverage   72.66%   72.66%           
=======================================
  Files          78       78           
  Lines        3669     3669           
  Branches       58       58           
=======================================
  Hits         2666     2666           
  Misses        999      999           
  Partials        4        4

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

examples/large_models/deepspeed/Readme.md

msaroufim · 2023-06-28T22:03:47Z

I'd really like us to merge running the regression test inside a freshly built docker container to make sure that this works instead of relying on logs

agunapal · 2023-06-28T23:24:04Z

@msaroufim I agree. We have to wait till #2403 is resolved and merged.

msaroufim · 2023-07-10T17:14:40Z

@agunapal LGTM please just fix lint before merge

docker/build_image.sh

…serve into docs/docker_gpu_updates

docker/README.md

examples/large_models/deepspeed/Readme.md

chauhang

Thanks @agunapal for the PR. Few items:

Can we switch to using CUDA 11.8 as the default?
Please attach some tests for the two cases -- image with Nvidia runtime and image with dev build needed for DeepSpeed for verification

agunapal · 2023-07-21T23:53:37Z

@chauhang I attached the NVIDIA Runtime logs

Upgrade to CUDA 11.8 is happening here #2489

Currently, users can't test DeepSpeed with docker because of this bug #2492

I can remove the comment added in DeepSpeed README and add it later when its fixed and verified. Is that fine

docker/build_image.sh

…serve into docs/docker_gpu_updates

feedback seems addressed

Update instructions to build with nvidia cuda runtime image for docker

b925952

agunapal requested review from msaroufim and lxning June 28, 2023 19:42

updated deepspeed documentation

8898cc9

agunapal changed the title ~~Update instructions to build with nvidia cuda runtime image for ONNX, TensorRT, DeepSpeed~~ Update instructions to build with nvidia cuda runtime image for ONNX Jun 28, 2023

agunapal added 2 commits June 28, 2023 21:14

updated deepspeed documentation

7ff353b

updated deepspeed documentation

7a488e7

lxning reviewed Jun 28, 2023

View reviewed changes

examples/large_models/deepspeed/Readme.md Outdated Show resolved Hide resolved

added example command

fce9d59

agunapal requested a review from lxning June 28, 2023 21:47

msaroufim approved these changes Jul 10, 2023

View reviewed changes

msaroufim reviewed Jul 10, 2023

View reviewed changes

docker/build_image.sh Show resolved Hide resolved

agunapal and others added 3 commits July 14, 2023 11:02

Merge branch 'master' into docs/docker_gpu_updates

2cd434c

Run regression

b804843

Merge branch 'docs/docker_gpu_updates' of https://github.com/pytorch/…

e21d50e

…serve into docs/docker_gpu_updates

dt-subaandh-krishnakumar mentioned this pull request Jul 17, 2023

Torchserve 0.8.1: ONNX GPU models not working #2425

Closed

Lint failure

b25e985

agunapal mentioned this pull request Jul 18, 2023

docker/build_image.sh BASE_IMAGE needs an update #2475

Closed

agunapal added 2 commits July 18, 2023 16:32

Merge branch 'master' into docs/docker_gpu_updates

e24b435

Merge branch 'master' into docs/docker_gpu_updates

f2b514f

agunapal requested a review from namannandan July 19, 2023 18:12

chauhang reviewed Jul 20, 2023

View reviewed changes

docker/README.md Outdated Show resolved Hide resolved

chauhang reviewed Jul 20, 2023

View reviewed changes

examples/large_models/deepspeed/Readme.md Outdated Show resolved Hide resolved

chauhang reviewed Jul 20, 2023

View reviewed changes

examples/large_models/deepspeed/Readme.md Outdated Show resolved Hide resolved

chauhang previously requested changes Jul 20, 2023

View reviewed changes

Merge branch 'master' into docs/docker_gpu_updates

994444f

agunapal requested a review from chauhang July 21, 2023 23:53

removed instructions to use deepspeed

34a5847

lxning reviewed Jul 24, 2023

View reviewed changes

docker/build_image.sh Outdated Show resolved Hide resolved

agunapal requested a review from lxning July 24, 2023 22:51

agunapal and others added 5 commits July 25, 2023 20:44

changed variable name

b4df63e

Merge branch 'master' into docs/docker_gpu_updates

c31f710

Exit if -bi and -g are specified

f8a37b6

Merge branch 'docs/docker_gpu_updates' of https://github.com/pytorch/…

85b3213

…serve into docs/docker_gpu_updates

Merge branch 'master' into docs/docker_gpu_updates

81a20f8

lxning approved these changes Jul 28, 2023

View reviewed changes

Merge branch 'master' into docs/docker_gpu_updates

344f1d0

msaroufim merged commit 35ef00f into master Jul 29, 2023
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update instructions to build with nvidia cuda runtime image for ONNX #2435

Update instructions to build with nvidia cuda runtime image for ONNX #2435

agunapal commented Jun 28, 2023 •

edited

codecov bot commented Jun 28, 2023 •

edited

msaroufim commented Jun 28, 2023 •

edited

agunapal commented Jun 28, 2023

msaroufim commented Jul 10, 2023

chauhang left a comment

agunapal commented Jul 21, 2023

Update instructions to build with nvidia cuda runtime image for ONNX #2435

Update instructions to build with nvidia cuda runtime image for ONNX #2435

Conversation

agunapal commented Jun 28, 2023 • edited

Description

Type of change

Feature/Issue validation/testing

Checklist:

codecov bot commented Jun 28, 2023 • edited

Codecov Report

msaroufim commented Jun 28, 2023 • edited

agunapal commented Jun 28, 2023

msaroufim commented Jul 10, 2023

chauhang left a comment

Choose a reason for hiding this comment

agunapal commented Jul 21, 2023

agunapal commented Jun 28, 2023 •

edited

codecov bot commented Jun 28, 2023 •

edited

msaroufim commented Jun 28, 2023 •

edited