Modularize and consolidate the Docker images for downstream usage #46062

seanpmorgan · 2020-12-30T05:16:42Z

System information

TensorFlow version (you are using): 2.4.0+
Are you willing to contribute it (Yes/No): I can help, but it needs ownership from TF Dev-Infra team.

Describe the feature and the current behavior/state.
Currently there are 3 or 4 Dockerfiles which are maintained independently and with different levels of support. This has been a time sink for the TF team, and a headache for downstream consumers. It should be (relatively) easy to refactor these as docker targets that build from one another. Information below is for GPU containers (though it applies to CPU and TF versions as well):

DockerHub tensorflow/tensorflow:gpu
- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dockerfiles/dockerfiles/gpu.Dockerfile
- Builds from nvidia/cuda
- Installs TF pip package to base python3
DockerHub tensorflow/tensorflow:devel-gpu
- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dockerfiles/dockerfiles/devel-gpu.Dockerfile
- Builds from nvidia/cuda
- Installs build tools and bazel
- Clones TF source code;
- Installs python deps to base python3
DockerHub tensorflow/tensorflow:custom_op_gpu
- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/Dockerfile.custom_op_ubuntu_16_cuda10.1
- Builds from nvidia/cuda
- Installs build tools and bazel
- Installs devtoolset7 and devtoolset8 for manylinux2010 builds
- Explicitly installs select python versions
- Installs python deps to base python3
GCR tensorflow-testing/nosla-cuda11.0-cudnn8-ubuntu18.04-manylinux2010-multipython
- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/Dockerfile.rbe.cuda11.0-cudnn8-ubuntu18.04-manylinux2010-multipython
- Builds from nvidia/cuda
- Installs build tools and bazel
- Installs devtoolset7 and devtoolset8 for manylinux2010 builds
- A nice modular installation of all support python versions
- Installs python deps to all python installations
- This is currently used by SIG Addons and SIG IO -- though there is no SLA for these images

As you can see there is a ton of duplication, not reusing of modular scripts when they can be, and with these 4 options there is still a ton of bloat in the images that should be refactored.

Will this change the current api? How?
We should use Docker targets to progressively build the containers and publish their intermediate stages. There would be no need to modify tags or anything. Prototype:

FROM nvidia/cuda${ARCH:+-$ARCH}:${CUDA}-base-ubuntu${UBUNTU_VERSION} as base
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        cuda-command-line-tools-${CUDA/./-} \
        libcublas-${CUDA/./-} \
        ....

RUN ln -s $(which python3) /usr/local/bin/python

COPY install/build_and_install_python.sh /install/
RUN /install/build_and_install_python.sh "3.6.9"
RUN /install/build_and_install_python.sh "3.7.7"
RUN /install/build_and_install_python.sh "3.8.2"

# Install bazel
ARG BAZEL_VERSION=3.7.2
RUN mkdir /bazel && \
    wget -O /bazel/installer.sh "https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh" && \
    wget -O /bazel/LICENSE.txt "https://raw.githubusercontent.com/bazelbuild/bazel/master/LICENSE" && \
    chmod +x /bazel/installer.sh && \
    /bazel/installer.sh && \
    rm -f /bazel/installer.sh

# -------------------------------------------------------------------
FROM base as tensorflow_gpu
RUN python3 -m pip install --no-cache-dir ${TF_PACKAGE}${TF_PACKAGE_VERSION:+==${TF_PACKAGE_VERSION}}

# -------------------------------------------------------------------
FROM base as tensorflow_devel_gpu

RUN apt-get update && apt-get install -y \
    openjdk-8-jdk \
    ....

RUN python3 -m pip --no-cache-dir install \
    Pillow \
    h5py \
    keras_preprocessing \
    matplotlib \
    mock \
    'numpy<1.19.0' \
    scipy \
    sklearn \
    pandas \
    future \
    portpicker \
    enum34

# -------------------------------------------------------------------
FROM base as tensorflow_custom_op_gpu
ADD devtoolset/fixlinks.sh fixlinks.sh
ADD devtoolset/build_devtoolset.sh build_devtoolset.sh
ADD devtoolset/rpm-patch.sh rpm-patch.sh

# Set up a sysroot for glibc 2.12 / libstdc++ 4.4 / devtoolset-7 in /dt7.
RUN /build_devtoolset.sh devtoolset-7 /dt7
# Set up a sysroot for glibc 2.12 / libstdc++ 4.4 / devtoolset-8 in /dt8.
RUN /build_devtoolset.sh devtoolset-8 /dt8

FROM nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04
COPY --from=devtoolset /dt7 /dt7
COPY --from=devtoolset /dt8 /dt8

RUN /install/install_bootstrap_deb_packages.sh
RUN /install/install_deb_packages.sh
RUN /install/install_clang.sh
RUN /install/install_bazel.sh
RUN /install/install_buildifier.sh
RUN /install/install_pip_packages.sh
RUN /install/install_auditwheel.sh

ENV TF_NEED_CUDA=1

# -------------------------------------------------------------------
FROM tensorflow_custom_op_gpu as tensorflow_build_manylinux2010_multipython

COPY install/install_pip_packages_by_version.sh /install/
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip2.7"
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip3.8"
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip3.5"
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip3.6"
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip3.7"

# -------------------------------------------------------------------

Who will benefit with this feature?
SIGs, developers, and downstream libraries looking for well managed Docker containers to build from.

Example benefits:

Currently the custom_op container has a lot of installations that are not needed see The docker image provided to compile tensorflow custom ops is too big. #38352
Currently the custom_op container gets updated when time permits, this would make it update alongside the rest of the containers
Currently the manylinux2010_multipython is the most comprehensive build container, but it has no SLA and is also unnecessarily bulky (can remove all the pip package installations for every py version)
Single Dockerfile to maintain (can have owners for seperate pieces etc.)

Any Other info.
With this being refactored properly we could close #38352, tensorflow/addons#2326 and tensorflow/build#6

The text was updated successfully, but these errors were encountered:

seanpmorgan · 2020-12-30T05:22:29Z

Tagging some stakeholders to see if we're able to get traction on this issue. I know some team members have left Dev-Infra, but this would be a great way to consolidate work and save time down the line.

cc
TF team - @av8ramit @angerson
SIG Build - @perfinion
SIG Addons - @bhack @WindQAQ
SIG IO - @yongtang

bhack · 2020-12-30T10:22:54Z

Dockerfile analisys and refactoring was discussed recently in many SIG build meetings also for the cache issue at tensorflow/build#5.
But I think that @angerson is no more allocated on this activity.

angerson · 2021-01-05T23:38:50Z

First, thanks a lot for writing this up as a clean issue with clear goals. It's a big help for our prioritization and planning.

Docker has been challenging for us internally because of the maintenance costs and related confusion. A consolidation would be awesome:

Save engineering time maintaining the images
Help other internal teams that use TF's docker images for their own tests
Give us an opportunity to use containerization in our own tests
Let us build internally with the same environments as we offer externally

I don't think we'll be able to prioritize a dedicated project for this because of our team's current constraints (DevInfra is only a few people now, and internal maintenance is prioritized). I'd like to make better Docker support a goal beyond just our team, though, and I have some internal OKRs that would actually benefit a lot from Docker improvements. So what I think I can do for now is work on Docker while doing that, and use the results to encourage prioritization at a higher level than just me.

On to implementation: I like your suggested Dockerfile layout. I regret my decision years ago to create a complex assembler for our Dockerfiles and would like to deprecate it for something more normal, like this. I also want to move the Dockerfiles and scripts out of tensorflow/tensorflow because dealing with branches is very annoying; an officially-supported project in SIG Build should work nicely. I'll work on testing out your prototype there.

bhack · 2021-01-06T00:46:07Z

How many Dockerfiles we have in the tree? 67?

https://github.com/tensorflow/tensorflow/search?l=Dockerfile&q=rights&type=

bhack · 2021-01-06T18:37:13Z

find tensorflow/* -iname "*.Dockerfile" | wc -l currently we have 107 Dockerfiles (including partial)

bhack · 2021-01-06T18:43:44Z

Having 107 Dockerfiles create also an overhead when you need to create a PR for update just a python library.

yongtang · 2021-01-06T19:04:57Z

Also, I think some of the python installations such as 2.7, 3.5 (with heavy customization etc) may not be necessary any more, as 2.7 and 3.5 are deprecated. Some of the customizations in the script may also come from the original Ubuntu 14.04 base where packages were missing. It may be possible to clean up a little now as we moves to Ubuntu 18.04+.

angerson · 2021-01-07T01:18:04Z

This is currently used by SIG Addons and SIG IO -- though there is no SLA for these images

Just to be clear: what, beyond an environment that is ideal for building TensorFlow, is required for your use case? I've been operating under the assumption that the custom-op container is somehow unique, but I've never known what about it is helpful. Would the multipython image completely deprecate custom-op if we supported it?

yongtang · 2021-01-07T17:23:16Z

For Addons/IO, one challenges is that the custom kernel ops are based on C++ API which may have subtle differences depending on the compiler and c runtime inside the OS. As a result if the gcc version and C runtime is different from tensorflow's gcc version and c runtime, the kernel ops may see incompatibility (sometimes strange seg fault, etc).

This is the biggest challenge as tensorflow uses a gcc version from devtoolset and relies on a Ubuntu 16.04 c/c++ runtime. (devtoolset is not natively built with ubuntu so any mismatch could accompanied with surprises for custom ops). In the end we realized that it is just much easier to built the kernel ops on Addons/IO with the same devtoolset/ubuntu environment.

goern · 2021-01-13T09:52:12Z

Cc: @sub-mod @fatherlinux dont we build TF images based on RHEL/UBI ?

sub-mod · 2021-01-13T15:12:52Z

@seanpmorgan @angerson Is there an interest for centos Images ? PyPI is using centos and we can contribute those images.

bhack · 2021-01-13T15:17:14Z

We have centos for Onednn third_party builds https://github.com/tensorflow/tensorflow/search?l=Dockerfile&q=centos

sub-mod · 2021-01-13T15:27:28Z

Thanks @bhack .
Last time i checked multistaged dicker builds don't work on RHEL. The docker client is different on RHEL and mac.

angerson · 2021-01-14T00:24:51Z

@seanpmorgan

Currently the manylinux2010_multipython is the most comprehensive build container, but it has no SLA and is also unnecessarily bulky (can remove all the pip package installations for every py version)

I thought having all the pip packages installed in each version was a benefit. Is that incorrect?

yongtang · 2021-01-14T00:46:58Z

@angerson The manylinux2010 is mostly needed for building .so files, so just having one python version will be enough.

The packing of pip can be done easily with any thin python container. For example, in IO we use manylinux2010 container to build the .so file, then on second step, we use python:3.6-slim( and python:3.7-slim and python:3.8-slim) to built the pip wheels.

angerson · 2021-01-14T00:58:43Z

Ah, I see, thank you. But it's still useful to have all of the python versions available, right?

bhack · 2021-01-14T02:42:42Z

Ah, I see, thank you. But it's still useful to have all of the python versions available, right?

What Is the size overhead for the Dev images?

angerson · 2021-01-14T18:33:48Z

Pretty big. ~~I think Nvidia's devel containers are ~3GB and the nosla ones are ~13GB.~~ See below. It will be difficult to make an everything-included image that's small.

image	tag	size
gcr.io/tensorflow-testing/nosla-cuda11.0-cudnn8-ubuntu18.04-manylinux2010-multipython	latest	13.9GB
nvidia/cuda	11.0-base-ubuntu18.04	110MB
nvidia/cuda	11.0-cudnn8-devel-ubuntu18.04	7.41GB
nvidia/cuda	11.0-cudnn8-runtime-ubuntu18.04	3.6GB

angerson · 2021-01-14T18:45:32Z

~~Comparatively, the current official images are pretty small. devel-gpu is 3.22GB and nightly-gpu is 2.36GB. The non-GPU ones are much smaller.~~

Darn. Actually, devel-gpu is 7GB. Docker Hub only reports the compressed size.

bhack · 2021-11-30T02:16:25Z

Custom ops images are not published anymore.
We have temp switched to cr.io/tensorflow-testing/nosla-cuda11.2-cudnn8.1-ubuntu18.04-manylinux2010-multipython with tensorflow/addons#2598 but we don't have a CPU image anymore.

We are also trying to use the new image at tensorflow/addons#2515 but also there we don't have a CPU image anymore.

We have a new ticket for disk space issue on small cloud instances with these large GPU image at tensorflow/addons#2613

yarri-oss · 2021-12-16T19:39:04Z

@angerson perhaps we can bring this issue up in SIG Build, but with the new nightly Docker images would it make sense to support options for CUDA vs other GPUs as well as a CPU-only image?

seanpmorgan added the type:feature Feature requests label Dec 30, 2020

google-ml-butler bot assigned Saduf2019 Dec 30, 2020

seanpmorgan mentioned this issue Dec 30, 2020

Use a minimal and well maintained docker image as the base for our builds and tests tensorflow/addons#2326

Closed

Saduf2019 added the comp:gpu GPU related issues label Dec 30, 2020

Saduf2019 assigned jvishnuvardhan and unassigned Saduf2019 Dec 30, 2020

perfinion added the type:build/install Build and install issues label Dec 30, 2020

perfinion assigned perfinion and angerson Dec 30, 2020

jvishnuvardhan removed their assignment Jan 4, 2021

angerson mentioned this issue Jan 7, 2021

[Dockerfile] move stubs to the end of LD_LIBRARY_PATH #44732

Closed

angerson mentioned this issue Jan 26, 2021

(Experimental) Single Multi-Stage Dockerfile tensorflow/build#21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularize and consolidate the Docker images for downstream usage #46062

Modularize and consolidate the Docker images for downstream usage #46062

seanpmorgan commented Dec 30, 2020 •

edited

seanpmorgan commented Dec 30, 2020

bhack commented Dec 30, 2020 •

edited

angerson commented Jan 5, 2021

bhack commented Jan 6, 2021

bhack commented Jan 6, 2021

bhack commented Jan 6, 2021

yongtang commented Jan 6, 2021

angerson commented Jan 7, 2021 •

edited

yongtang commented Jan 7, 2021

goern commented Jan 13, 2021

sub-mod commented Jan 13, 2021

bhack commented Jan 13, 2021

sub-mod commented Jan 13, 2021

angerson commented Jan 14, 2021 •

edited

yongtang commented Jan 14, 2021

angerson commented Jan 14, 2021

bhack commented Jan 14, 2021

angerson commented Jan 14, 2021 •

edited

angerson commented Jan 14, 2021 •

edited

bhack commented Nov 30, 2021 •

edited

yarri-oss commented Dec 16, 2021

Modularize and consolidate the Docker images for downstream usage #46062

Modularize and consolidate the Docker images for downstream usage #46062

Comments

seanpmorgan commented Dec 30, 2020 • edited

seanpmorgan commented Dec 30, 2020

bhack commented Dec 30, 2020 • edited

angerson commented Jan 5, 2021

bhack commented Jan 6, 2021

bhack commented Jan 6, 2021

bhack commented Jan 6, 2021

yongtang commented Jan 6, 2021

angerson commented Jan 7, 2021 • edited

yongtang commented Jan 7, 2021

goern commented Jan 13, 2021

sub-mod commented Jan 13, 2021

bhack commented Jan 13, 2021

sub-mod commented Jan 13, 2021

angerson commented Jan 14, 2021 • edited

yongtang commented Jan 14, 2021

angerson commented Jan 14, 2021

bhack commented Jan 14, 2021

angerson commented Jan 14, 2021 • edited

angerson commented Jan 14, 2021 • edited

bhack commented Nov 30, 2021 • edited

yarri-oss commented Dec 16, 2021

seanpmorgan commented Dec 30, 2020 •

edited

bhack commented Dec 30, 2020 •

edited

angerson commented Jan 7, 2021 •

edited

angerson commented Jan 14, 2021 •

edited

angerson commented Jan 14, 2021 •

edited

angerson commented Jan 14, 2021 •

edited

bhack commented Nov 30, 2021 •

edited