-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modularize and consolidate the Docker images for downstream usage #46062
Comments
Tagging some stakeholders to see if we're able to get traction on this issue. I know some team members have left Dev-Infra, but this would be a great way to consolidate work and save time down the line. cc |
Dockerfile analisys and refactoring was discussed recently in many SIG build meetings also for the cache issue at tensorflow/build#5. |
First, thanks a lot for writing this up as a clean issue with clear goals. It's a big help for our prioritization and planning. Docker has been challenging for us internally because of the maintenance costs and related confusion. A consolidation would be awesome:
I don't think we'll be able to prioritize a dedicated project for this because of our team's current constraints (DevInfra is only a few people now, and internal maintenance is prioritized). I'd like to make better Docker support a goal beyond just our team, though, and I have some internal OKRs that would actually benefit a lot from Docker improvements. So what I think I can do for now is work on Docker while doing that, and use the results to encourage prioritization at a higher level than just me. On to implementation: I like your suggested Dockerfile layout. I regret my decision years ago to create a complex assembler for our Dockerfiles and would like to deprecate it for something more normal, like this. I also want to move the Dockerfiles and scripts out of tensorflow/tensorflow because dealing with branches is very annoying; an officially-supported project in SIG Build should work nicely. I'll work on testing out your prototype there. |
How many Dockerfiles we have in the tree? 67? https://github.com/tensorflow/tensorflow/search?l=Dockerfile&q=rights&type= |
|
Having 107 Dockerfiles create also an overhead when you need to create a PR for update just a python library. |
Also, I think some of the python installations such as 2.7, 3.5 (with heavy customization etc) may not be necessary any more, as 2.7 and 3.5 are deprecated. Some of the customizations in the script may also come from the original Ubuntu 14.04 base where packages were missing. It may be possible to clean up a little now as we moves to Ubuntu 18.04+. |
Just to be clear: what, beyond an environment that is ideal for building TensorFlow, is required for your use case? I've been operating under the assumption that the custom-op container is somehow unique, but I've never known what about it is helpful. Would the multipython image completely deprecate custom-op if we supported it? |
For Addons/IO, one challenges is that the custom kernel ops are based on C++ API which may have subtle differences depending on the compiler and c runtime inside the OS. As a result if the gcc version and C runtime is different from tensorflow's gcc version and c runtime, the kernel ops may see incompatibility (sometimes strange seg fault, etc). This is the biggest challenge as tensorflow uses a gcc version from devtoolset and relies on a Ubuntu 16.04 c/c++ runtime. (devtoolset is not natively built with ubuntu so any mismatch could accompanied with surprises for custom ops). In the end we realized that it is just much easier to built the kernel ops on Addons/IO with the same devtoolset/ubuntu environment. |
Cc: @sub-mod @fatherlinux dont we build TF images based on RHEL/UBI ? |
@seanpmorgan @angerson Is there an interest for centos Images ? PyPI is using centos and we can contribute those images. |
We have centos for Onednn third_party builds https://github.com/tensorflow/tensorflow/search?l=Dockerfile&q=centos |
Thanks @bhack . |
I thought having all the pip packages installed in each version was a benefit. Is that incorrect? |
@angerson The manylinux2010 is mostly needed for building The packing of pip can be done easily with any thin python container. For example, in IO we use manylinux2010 container to build the |
Ah, I see, thank you. But it's still useful to have all of the python versions available, right? |
What Is the size overhead for the Dev images? |
Pretty big.
|
Darn. Actually, devel-gpu is 7GB. Docker Hub only reports the compressed size. |
Custom ops images are not published anymore. We are also trying to use the new image at tensorflow/addons#2515 but also there we don't have a CPU image anymore. We have a new ticket for disk space issue on small cloud instances with these large GPU image at tensorflow/addons#2613 |
@angerson perhaps we can bring this issue up in SIG Build, but with the new nightly Docker images would it make sense to support options for CUDA vs other GPUs as well as a CPU-only image? |
System information
Describe the feature and the current behavior/state.
Currently there are 3 or 4 Dockerfiles which are maintained independently and with different levels of support. This has been a time sink for the TF team, and a headache for downstream consumers. It should be (relatively) easy to refactor these as docker targets that build from one another. Information below is for GPU containers (though it applies to CPU and TF versions as well):
As you can see there is a ton of duplication, not reusing of modular scripts when they can be, and with these 4 options there is still a ton of bloat in the images that should be refactored.
Will this change the current api? How?
We should use Docker targets to progressively build the containers and publish their intermediate stages. There would be no need to modify tags or anything. Prototype:
Who will benefit with this feature?
SIGs, developers, and downstream libraries looking for well managed Docker containers to build from.
Example benefits:
Any Other info.
With this being refactored properly we could close #38352, tensorflow/addons#2326 and tensorflow/build#6
The text was updated successfully, but these errors were encountered: