Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularize and consolidate the Docker images for downstream usage #46062

Open
seanpmorgan opened this issue Dec 30, 2020 · 21 comments
Open

Modularize and consolidate the Docker images for downstream usage #46062

seanpmorgan opened this issue Dec 30, 2020 · 21 comments
Assignees
Labels
comp:gpu GPU related issues type:build/install Build and install issues type:feature Feature requests

Comments

@seanpmorgan
Copy link
Member

seanpmorgan commented Dec 30, 2020

System information

  • TensorFlow version (you are using): 2.4.0+
  • Are you willing to contribute it (Yes/No): I can help, but it needs ownership from TF Dev-Infra team.

Describe the feature and the current behavior/state.
Currently there are 3 or 4 Dockerfiles which are maintained independently and with different levels of support. This has been a time sink for the TF team, and a headache for downstream consumers. It should be (relatively) easy to refactor these as docker targets that build from one another. Information below is for GPU containers (though it applies to CPU and TF versions as well):

As you can see there is a ton of duplication, not reusing of modular scripts when they can be, and with these 4 options there is still a ton of bloat in the images that should be refactored.

Will this change the current api? How?
We should use Docker targets to progressively build the containers and publish their intermediate stages. There would be no need to modify tags or anything. Prototype:

FROM nvidia/cuda${ARCH:+-$ARCH}:${CUDA}-base-ubuntu${UBUNTU_VERSION} as base
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        cuda-command-line-tools-${CUDA/./-} \
        libcublas-${CUDA/./-} \
        ....

RUN ln -s $(which python3) /usr/local/bin/python

COPY install/build_and_install_python.sh /install/
RUN /install/build_and_install_python.sh "3.6.9"
RUN /install/build_and_install_python.sh "3.7.7"
RUN /install/build_and_install_python.sh "3.8.2"

# Install bazel
ARG BAZEL_VERSION=3.7.2
RUN mkdir /bazel && \
    wget -O /bazel/installer.sh "https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh" && \
    wget -O /bazel/LICENSE.txt "https://raw.githubusercontent.com/bazelbuild/bazel/master/LICENSE" && \
    chmod +x /bazel/installer.sh && \
    /bazel/installer.sh && \
    rm -f /bazel/installer.sh

# -------------------------------------------------------------------
FROM base as tensorflow_gpu
RUN python3 -m pip install --no-cache-dir ${TF_PACKAGE}${TF_PACKAGE_VERSION:+==${TF_PACKAGE_VERSION}}

# -------------------------------------------------------------------
FROM base as tensorflow_devel_gpu

RUN apt-get update && apt-get install -y \
    openjdk-8-jdk \
    ....

RUN python3 -m pip --no-cache-dir install \
    Pillow \
    h5py \
    keras_preprocessing \
    matplotlib \
    mock \
    'numpy<1.19.0' \
    scipy \
    sklearn \
    pandas \
    future \
    portpicker \
    enum34

# -------------------------------------------------------------------
FROM base as tensorflow_custom_op_gpu
ADD devtoolset/fixlinks.sh fixlinks.sh
ADD devtoolset/build_devtoolset.sh build_devtoolset.sh
ADD devtoolset/rpm-patch.sh rpm-patch.sh

# Set up a sysroot for glibc 2.12 / libstdc++ 4.4 / devtoolset-7 in /dt7.
RUN /build_devtoolset.sh devtoolset-7 /dt7
# Set up a sysroot for glibc 2.12 / libstdc++ 4.4 / devtoolset-8 in /dt8.
RUN /build_devtoolset.sh devtoolset-8 /dt8

FROM nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04
COPY --from=devtoolset /dt7 /dt7
COPY --from=devtoolset /dt8 /dt8

RUN /install/install_bootstrap_deb_packages.sh
RUN /install/install_deb_packages.sh
RUN /install/install_clang.sh
RUN /install/install_bazel.sh
RUN /install/install_buildifier.sh
RUN /install/install_pip_packages.sh
RUN /install/install_auditwheel.sh

ENV TF_NEED_CUDA=1

# -------------------------------------------------------------------
FROM tensorflow_custom_op_gpu as tensorflow_build_manylinux2010_multipython

COPY install/install_pip_packages_by_version.sh /install/
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip2.7"
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip3.8"
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip3.5"
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip3.6"
RUN /install/install_pip_packages_by_version.sh "/usr/local/bin/pip3.7"

# -------------------------------------------------------------------

Who will benefit with this feature?
SIGs, developers, and downstream libraries looking for well managed Docker containers to build from.

Example benefits:

  • Currently the custom_op container has a lot of installations that are not needed see The docker image provided to compile tensorflow custom ops is too big. #38352
  • Currently the custom_op container gets updated when time permits, this would make it update alongside the rest of the containers
  • Currently the manylinux2010_multipython is the most comprehensive build container, but it has no SLA and is also unnecessarily bulky (can remove all the pip package installations for every py version)
  • Single Dockerfile to maintain (can have owners for seperate pieces etc.)

Any Other info.
With this being refactored properly we could close #38352, tensorflow/addons#2326 and tensorflow/build#6

@seanpmorgan
Copy link
Member Author

Tagging some stakeholders to see if we're able to get traction on this issue. I know some team members have left Dev-Infra, but this would be a great way to consolidate work and save time down the line.

cc
TF team - @av8ramit @angerson
SIG Build - @perfinion
SIG Addons - @bhack @WindQAQ
SIG IO - @yongtang

@Saduf2019 Saduf2019 added the comp:gpu GPU related issues label Dec 30, 2020
@perfinion perfinion added the type:build/install Build and install issues label Dec 30, 2020
@bhack
Copy link
Contributor

bhack commented Dec 30, 2020

Dockerfile analisys and refactoring was discussed recently in many SIG build meetings also for the cache issue at tensorflow/build#5.
But I think that @angerson is no more allocated on this activity.

@jvishnuvardhan jvishnuvardhan removed their assignment Jan 4, 2021
@angerson
Copy link
Contributor

angerson commented Jan 5, 2021

First, thanks a lot for writing this up as a clean issue with clear goals. It's a big help for our prioritization and planning.

Docker has been challenging for us internally because of the maintenance costs and related confusion. A consolidation would be awesome:

  • Save engineering time maintaining the images
  • Help other internal teams that use TF's docker images for their own tests
  • Give us an opportunity to use containerization in our own tests
  • Let us build internally with the same environments as we offer externally

I don't think we'll be able to prioritize a dedicated project for this because of our team's current constraints (DevInfra is only a few people now, and internal maintenance is prioritized). I'd like to make better Docker support a goal beyond just our team, though, and I have some internal OKRs that would actually benefit a lot from Docker improvements. So what I think I can do for now is work on Docker while doing that, and use the results to encourage prioritization at a higher level than just me.

On to implementation: I like your suggested Dockerfile layout. I regret my decision years ago to create a complex assembler for our Dockerfiles and would like to deprecate it for something more normal, like this. I also want to move the Dockerfiles and scripts out of tensorflow/tensorflow because dealing with branches is very annoying; an officially-supported project in SIG Build should work nicely. I'll work on testing out your prototype there.

@bhack
Copy link
Contributor

bhack commented Jan 6, 2021

How many Dockerfiles we have in the tree? 67?

https://github.com/tensorflow/tensorflow/search?l=Dockerfile&q=rights&type=

@bhack
Copy link
Contributor

bhack commented Jan 6, 2021

find tensorflow/* -iname "*.Dockerfile" | wc -l currently we have 107 Dockerfiles (including partial)

@bhack
Copy link
Contributor

bhack commented Jan 6, 2021

Having 107 Dockerfiles create also an overhead when you need to create a PR for update just a python library.

@yongtang
Copy link
Member

yongtang commented Jan 6, 2021

Also, I think some of the python installations such as 2.7, 3.5 (with heavy customization etc) may not be necessary any more, as 2.7 and 3.5 are deprecated. Some of the customizations in the script may also come from the original Ubuntu 14.04 base where packages were missing. It may be possible to clean up a little now as we moves to Ubuntu 18.04+.

@angerson
Copy link
Contributor

angerson commented Jan 7, 2021

This is currently used by SIG Addons and SIG IO -- though there is no SLA for these images

Just to be clear: what, beyond an environment that is ideal for building TensorFlow, is required for your use case? I've been operating under the assumption that the custom-op container is somehow unique, but I've never known what about it is helpful. Would the multipython image completely deprecate custom-op if we supported it?

@yongtang
Copy link
Member

yongtang commented Jan 7, 2021

For Addons/IO, one challenges is that the custom kernel ops are based on C++ API which may have subtle differences depending on the compiler and c runtime inside the OS. As a result if the gcc version and C runtime is different from tensorflow's gcc version and c runtime, the kernel ops may see incompatibility (sometimes strange seg fault, etc).

This is the biggest challenge as tensorflow uses a gcc version from devtoolset and relies on a Ubuntu 16.04 c/c++ runtime. (devtoolset is not natively built with ubuntu so any mismatch could accompanied with surprises for custom ops). In the end we realized that it is just much easier to built the kernel ops on Addons/IO with the same devtoolset/ubuntu environment.

@goern
Copy link

goern commented Jan 13, 2021

Cc: @sub-mod @fatherlinux dont we build TF images based on RHEL/UBI ?

@sub-mod
Copy link
Contributor

sub-mod commented Jan 13, 2021

@seanpmorgan @angerson Is there an interest for centos Images ? PyPI is using centos and we can contribute those images.

@bhack
Copy link
Contributor

bhack commented Jan 13, 2021

We have centos for Onednn third_party builds https://github.com/tensorflow/tensorflow/search?l=Dockerfile&q=centos

@sub-mod
Copy link
Contributor

sub-mod commented Jan 13, 2021

Thanks @bhack .
Last time i checked multistaged dicker builds don't work on RHEL. The docker client is different on RHEL and mac.

@angerson
Copy link
Contributor

angerson commented Jan 14, 2021

@seanpmorgan

Currently the manylinux2010_multipython is the most comprehensive build container, but it has no SLA and is also unnecessarily bulky (can remove all the pip package installations for every py version)

I thought having all the pip packages installed in each version was a benefit. Is that incorrect?

@yongtang
Copy link
Member

@angerson The manylinux2010 is mostly needed for building .so files, so just having one python version will be enough.

The packing of pip can be done easily with any thin python container. For example, in IO we use manylinux2010 container to build the .so file, then on second step, we use python:3.6-slim( and python:3.7-slim and python:3.8-slim) to built the pip wheels.

@angerson
Copy link
Contributor

Ah, I see, thank you. But it's still useful to have all of the python versions available, right?

@bhack
Copy link
Contributor

bhack commented Jan 14, 2021

Ah, I see, thank you. But it's still useful to have all of the python versions available, right?

What Is the size overhead for the Dev images?

@angerson
Copy link
Contributor

angerson commented Jan 14, 2021

Pretty big. I think Nvidia's devel containers are ~3GB and the nosla ones are ~13GB. See below. It will be difficult to make an everything-included image that's small.

image tag size
gcr.io/tensorflow-testing/nosla-cuda11.0-cudnn8-ubuntu18.04-manylinux2010-multipython latest 13.9GB
nvidia/cuda 11.0-base-ubuntu18.04 110MB
nvidia/cuda 11.0-cudnn8-devel-ubuntu18.04 7.41GB
nvidia/cuda 11.0-cudnn8-runtime-ubuntu18.04 3.6GB

@angerson
Copy link
Contributor

angerson commented Jan 14, 2021

Comparatively, the current official images are pretty small. devel-gpu is 3.22GB and nightly-gpu is 2.36GB. The non-GPU ones are much smaller.

Darn. Actually, devel-gpu is 7GB. Docker Hub only reports the compressed size.

@bhack
Copy link
Contributor

bhack commented Nov 30, 2021

Custom ops images are not published anymore.
We have temp switched to cr.io/tensorflow-testing/nosla-cuda11.2-cudnn8.1-ubuntu18.04-manylinux2010-multipython with tensorflow/addons#2598 but we don't have a CPU image anymore.

We are also trying to use the new image at tensorflow/addons#2515 but also there we don't have a CPU image anymore.

We have a new ticket for disk space issue on small cloud instances with these large GPU image at tensorflow/addons#2613

@yarri-oss
Copy link

@angerson perhaps we can bring this issue up in SIG Build, but with the new nightly Docker images would it make sense to support options for CUDA vs other GPUs as well as a CPU-only image?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues type:build/install Build and install issues type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

10 participants