Official Tensorflow Docker Image #149

ruffsl · 2015-11-11T18:55:06Z

Hello Tensorflow Community,

I just wanted to kick start a discussions on creating an official docker image for Tensorflow. So going in line with creating common framework for machine learning related researchers and developers to rally around, and given the onslaught on software containers, I think creating a common Tensorflow image would also help in the same regard.

I'd normally make a more detailed proposal as I did for a cuda docker image here, but my goal right now is just to facilitate a discussion. I know there are many technical issues thanks in part to heavy use of GPUs and driver dependencies, but it looks like Nvidia is making some progress on that front: NVIDIA/nvidia-docker.

So if you like the idea or have some ideas/drafts, please chime in.

craigcitro · 2015-11-11T19:49:48Z

Hi @ruffsl -- just to ask, by "official" do you mean having an org on Docker hub?

If so, it's definitely something we want to do; in fact, I registered the Tensorflow org on docker hub -- I just didn't know how to experiment with moving over pre-release without accidentally making the org public. 😉

If you've done this before, I'd love help/pointers.

ruffsl · 2015-11-11T20:04:53Z

Hey @craigcitro, Yes, I think making a official Docker Hub image would be a great goal!

In my haste, I didn't notice the tensorflow/tools/docker folder already in the repo. This great progress, I'll have better sift through what already exists, as I'm just starting with this project my self.

I've created and submitted official Docker Hub images before, this including one for ROS and Gazebo projects, and if you'd like any assistance, I'd love to help. If you want to PM me, I could elaborate on the processes and my experience in greater detail.

ruffsl · 2015-11-13T19:55:09Z

Relevant issue on starting cuda image that GPU enabled tag of tensorflow could build from:
NVIDIA/nvidia-docker#7

cauerego · 2015-11-13T21:45:47Z

More than having a docker image, I'd add, would be having it ready for development and, eventually, even promoting it as the default way to go #203 ! :)

As a newcomer to docker myself, I'm struggling quite a lot to get off the ground here, even after studying it and better understanding it quite well. It isn't as simple or easy as it may seem at first - but it could and should be!

ruffsl · 2015-11-14T23:52:26Z

@craigcitro , could there be a separate repo (tensorflow-docker?) under the tensorflow github org for just the Dockerfiles? Many other projects keep theses files separate to make attaching web hooks, CI and other things simpler. Its a common practice among most official images. Then we'd have a place to build up PRs, and link back to during the submission process.

Also what are the origins of the b.gcr.io/tensorflow/tensorflow base image?
I'm not clear on how it was built:

:~$ docker history b.gcr.io/tensorflow/tensorflow:latest 
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
217daf2537d2        7 days ago          /bin/sh -c #(nop) CMD ["/bin/bash"]             0 B                 
da34eb7f1273        7 days ago          /bin/sh -c #(nop) EXPOSE 8888/tcp               0 B                 
e9bc6354df37        7 days ago          /bin/sh -c #(nop) EXPOSE 6006/tcp               0 B                 
55b545f9baa4        7 days ago          /bin/sh -c #(nop) COPY file:a7af486c3e6a1a7cf   35 B                
957340752397        7 days ago          /bin/sh -c #(nop) COPY file:617470c4514ec5022   137 B               
c906b2184874        8 days ago          /bin/sh -c pip --no-cache-dir install ipykern   3.413 kB            
f67a15164dd5        8 days ago          /bin/sh -c curl -O https://bootstrap.pypa.io/   7.079 MB            
bcb5994d8a18        8 days ago          /bin/sh -c pip install https://storage.google   60.33 MB            
934fbda38a19        8 days ago          /bin/sh -c pip install         jupyter          125.2 MB            
51666ff792cc        8 days ago          /bin/sh -c apt-get update && apt-get install    272.1 MB            
fc3b69d5428a        8 days ago          /bin/sh -c #(nop) MAINTAINER Craig Citro <cra   0 B                 
1d073211c498        3 weeks ago         /bin/sh -c #(nop) CMD ["/bin/bash"]             0 B                 
5a4526e952f0        3 weeks ago         /bin/sh -c sed -i 's/^#\s*\(deb.*universe\)$/   1.895 kB            
99fcaefe76ef        3 weeks ago         /bin/sh -c echo '#!/bin/sh' > /usr/sbin/polic   194.5 kB            
c63fb41c2213        3 weeks ago         /bin/sh -c #(nop) ADD file:e97fe9bddafcfac4ca   187.7 MB

craigcitro · 2015-11-17T06:32:48Z

instructions for building the docker containers are in the README.
we can move the dockerfiles into their own repo if/when it's a problem -- right now, we're still making sure we've got syncing and external contribution flows in order, so no need to add extra complexity yet.
sadly, the CUDA image still doesn't have all the libraries we need (cudnn isn't there).

ruffsl · 2015-11-18T20:04:13Z

Ok, so once a cudnn supported tag image comes around (see NVIDIA/nvidia-docker#10), how do you envision the build image dependency structured? For some given major release of tensorflow, the tags would follow something roughly like this:

Tag	From
tensorflow:`1.0`/`latest`	ubuntu:`14.04`
tensorflow:`1.0-gpu`	cuda:`7.0-cudnn`

i.e. what tags are you expecting to host, and where would each build from?

craigcitro · 2015-11-19T18:33:08Z

@ruffsl whoa, that cudnn issue being resolved is fantastic! thanks for doing the legwork there.

currently, i was thinking we'd do three collections of containers:

Name	From	Contents
tensorflow	ubuntu	base TF install, no deps (bazel, java, etc)
tensorflow-full	tensorflow	full source install, with all deps
tensorflow-full-gpu	tensorflow-full	same as tensorflow-full, but built with GPU support

it's actually possible that we'd tweak the from lines, depending on how we build the containers (eg script vs. Dockerfile), but the net effect would be the same.

that said, I'm now curious: is multiple collections of tags the more common approach in the docker world? in particular, would we also maintain latest, latest-full, and latest-gpu? (i feel like latest gets special treatment from docker, but latest-gpu and latest-full wouldn't, right?)

ruffsl · 2015-11-19T22:39:18Z

Hmm, normally its a best practice to keep a nice linear sequential hierarchy in the tag structure to leverage the storage savings and reduce duplication of binaries by reusing image layers on disk. But from what I'm seeing currently, building the project over itself (once for CPU, then for GPU) is leading to some large image sizes:

$ docker images
REPOSITORY                                TAG                    IMAGE ID            CREATED             VIRTUAL SIZE
tensorflow                                gpu                    1ca458346ab2        16 hours ago        4.847 GB
tensorflow                                cpu                    f092a7f6f122        16 hours ago        3.967 GB
tensorflow                                latest                 6711b8288898        16 hours ago        2.406 GB
cudnn                                     v2-7.0                 80a8e0efea8d        21 hours ago        2.041 GB
cuda                                      7.0                    f8abb195d52b        21 hours ago        2.012 GB
ubuntu                                    14.04                  e9ae3c220b23        9 days ago          187.9 MB

(The above isn't really optimized yet, its just built in cascaded order with nothing flattened like in the README.md. Also, perhaps the cuda image could shed some weight, I suspect it might be rolling in a lot of odd packages)

By forking the tensorflow:full and tensorflow:full-gpu tags we could keep the disk size of each individual tag reasonable at the cost of making the sum of all tags larger. However, my expectation is that users will just deploy with one or the other so bandwidth in moving images would be the limitation. Whereas if someone has both tags locally, ether for testing or experimentation, I don't think storage would be their primary concern. The same could be done I suppose for the latest image, so the trade off is redundant looking dockerfiles and layers for smaller individual tags.

Name	From
tensorflow:`latest`	ubuntu:`14.04`
tensorflow:`full`	tensorflow:`latest` (OR ubuntu:`14.04`)
tensorflow:`full-gpu`	cuda:`7.0-cudnn`

We could play some tricks and keep the heads of each dockerfile similar as long as we could to preserve the build catch. I'm not sure how much the Docker Hub's build processes respects this though if an entire repo is churned with independent tags.

To get to your last question: Yes, latest holds a special default meaning. I haven't seen the use of :latest-foo anywhere, as the user would have to spell it out fully to use it anyway, so for brevity most just do :foo or :<release>-foo. By the way, what release is tensorflow at? Is this all still version < 1.0?

ruffsl · 2015-11-25T02:35:04Z

I've been tinkering with variants of the Dockerfiles for each tag, and with my latest modifications I'm getting virtual image sizes shown bellow given the FROM structure I tabled above:

$ docker images 
REPOSITORY           TAG                   IMAGE ID            CREATED             VIRTUAL SIZE
tensorflow           full-gpu              73ec812422ad        21 minutes ago      3.444 GB
tensorflow           full                  3c90422c6a96        About an hour ago   2.125 GB
tensorflow           latest                098cb21442b1        3 days ago          1.606 GB

Given the dependencies for a source build, I don't immediately see much more I could cut. Are there any plans for shipping a binary, built against a release of cuda? Then we could swap for the lighter cudnn-runtime tag.

craigcitro · 2015-12-03T18:32:11Z

Sorry for the lull here -- was waiting for the NVidia images to get pushed publicly, and now they're live! Adding in @ruffsl @ebrevdo @jendap for discussion as well.

I now see what you mean about multiple tags in the repo, which I think is what we want to go with. For better or worse, I think we want to go with four tags:

Tag	From	Contents
`latest`/`<release>`	`ubuntu:14.04`	minimal container with TF, no GPU support
`latest-gpu`/`<release>-gpu`	`cuda:cudnn-runtime`	minimal container with TF and GPU support
`devel`/`<release>-devel`	`ubuntu:14.04`	full image with all TF deps + source code, no GPU
`devel-gpu`/`<release>-devel-gpu`	`cuda:cudnn-devel`	full image with all TF deps + source + GPU support

We'll update these as new versions of TF get released, so that over time we'll end up with a number of older tags.
My main motivation for the two sets of images is just size: I think the latest image can be slimmed down a bit, so it'll end up ~1/2 the size of the image with GPU support. For the "I just want to kick the tires" use case, this seems valuable.
The naming scheme is also similar to the one for the CUDA images, except that we elide the runtime suffix.

Thoughts?

flx42 · 2015-12-03T18:58:16Z

If you still have strong requirements on the CUDA and cuDNN versions, you should make your gpu image use the full version name like cuda:7.0-cudnn2-runtime. The tag cuda:cudnn-runtime is currently cuDNN v2 but will be bumped to cuDNN v4 once it's available (soon!).

ruffsl · 2015-12-03T20:30:36Z

I concur with @flx42 , specifically for official images, as you'd most likely want everything nailed down so it doesn't shift underneath you unless explicitly noted through updating the Dockerfiles. A compromise of specificity is why we normally see FROM ubuntu:14.04 and not just ubuntu nor ubuntu:14.04.3, as updates in minor distros releases are normally quite stable routine security patches.

For cudnn, we should use the full name cuda:7.0-cudnn2-runtime and cuda:7.0-cudnn2-devel to lock down the version of cuda, as even cuda 7.5 support still seems to be an open issue [https://github.com//issues/20].

Change 109537918 TensorFlow pip setup: wheel >= 0.26 for python3 pip install Change 109505848 Fix distortion default value to 1.0 in fixed_unigram_candidate_sampler. This means we default to the actual provided unigram distribution, instead of to the uniform (as it is currently). Change 109470494 Bugfix in gradients calculation when the ys rely on each other. Change 109467619 Fix CIFAR-10 model to train on all the training data instead of just 80% of it. Fixes #396. Change 109467557 Replaced checkpoint file with binary GraphDef. Change 109467433 Updates to C++ tutorial section. Change 109465269 TensorFlow: update documentation for tutorials to not assume use of bazel (when possible). Change 109462916 A tutorial for image recognition to coincide with the release of the latest Inception image classification model. Change 109462342 Clear control dependencies in variable_scope.get_variable() when creating ops for the initializer. Add tests of various error conditions. Change 109461981 Various performance improvements in low-level node execution code paths. Speeds up ptb_word_lm on my desktop with a Titan X from 3638 words per second to 3751 words per second (3.1% speedup). Changes include: o Avoided many strcmp operations per node execution and extra touches of cache lines in executor.cc, by making all the various IsMerge, IsSwitch, IsSend, etc. operations instead be based on an internal enum value that is pre-computed at Node construction time, rather than doing string comparisons against node->type_string(). We were doing about 6 such comparisons per executed node. o Removed mutex_lock in executor.cc in ExecutorState::Process. The lock was not needed and the comment about the iterations array being potentially resized is not true (the iterations arrays are created with a fixed size). Checked with yuanbyu to confirm this. o Added new two-argument port::Tracing::ScopedAnnotation constructor that takes two StringPiece arguments, and only concatenates them lazily if tracing is enabled. Also changed the code in platform/tracing.{h,cc} so that the ScopedAnnotation constructor and the TraceMe constructor can be inlined. o In BaseGPUDevice::Compute, used the two-argument ScopedAnnotation constructor to avoid doing StrCat(opkernel->name(), ":", op_kernel->type_string()) on every node execution on a GPU. o Introduced a new TensorReference class that just holds a reference to an underlying TensorBuffer, and requires an explicit Unref(). o Changed the EventMgr interface to take a vector of TensorReference objects for EventMgr::ThenDeleteTensors, rather than a vector of Tensor objects. o Used TensorReference in a few places in gpu_util.cc o Minor: switched to using InlinedVectors in a few places to get better cache locality. Change 109456692 Updated the label_image example to use the latest Inception model Change 109456545 Provides classify_image which performs image recognition on a 1000 object label set. $ ./classify_image giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.88493) indri, indris, Indri indri, Indri brevicaudatus (score = 0.00878) lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00317) custard apple (score = 0.00149) earthstar (score = 0.00127) Change 109455002 TensorFlow: make the helper libraries for various models available in the pip package so that when users type: python translate.py ... the absolute import works. This change is supposed to help make our tutorials run without the *need* to use bazel. Change 109450041 TensorFlow: remove cifar and convolutional binary copies from pip install. Adds embedding and some other models to the list. Change 109448520 Move the description of a failing invariant from a comment into the dcheck-fail message text. Change 109447577 TensorBoard has release tagging (tensorboard/TAG) Also track TensorBoard changes (tensorboard/CHANGES) Change 109444161 Added ParseSingleSequenceExample + python wrappers + unit tests. Change 109440864 Update all the TensorFlow Dockerfiles, and simplify GPU containers. This change updates all four of our Dockerfiles to match the targets discussed in #149. The most notable change here is moving the GPU images to use the NVidia containers which include cudnn and other build-time dependencies, dramatically simplifying both the build and run steps. A description of which tags exist and get pushed where will be in a follow-up. Change 109432591 Some pylint and pydoc changes in saver. Change 109430127 Remove unused hydrogen components Change 109419354 The RNN api, although moved into python/ops/, remains undocumented. It may still change at any time. Base CL: 109538006

craigcitro · 2015-12-10T20:56:10Z

OK -- I was waiting for the new release, which is imminent, so new images are pushed.

We now have b.gcr.io/tensorflow/tensorflow:<tag>, for tag all four variations mentioned above, as latest and 0.6.0 versions. I'm also busy uploading new containers to dockerhub.

I think we can close this for now, with a caveat that we need to come back in ~2 months and figure out if there are enough downloads that we want to submit a PR to become an official repo?

ruffsl · 2015-12-10T21:38:31Z

Hmm, I suppose we could also come back to this once NVIDIA/nvidia-docker#7 is finished, as this would be a prerequisite for official repo GPU support. I take it this the current docker hub repo: https://hub.docker.com/r/tensorflow/tensorflow

craigcitro · 2015-12-10T21:43:10Z

@ruffsl yep, that's the one.

so closing for now, but we can either reopen or create a new issue later.

Change 109537918 TensorFlow pip setup: wheel >= 0.26 for python3 pip install Change 109505848 Fix distortion default value to 1.0 in fixed_unigram_candidate_sampler. This means we default to the actual provided unigram distribution, instead of to the uniform (as it is currently). Change 109470494 Bugfix in gradients calculation when the ys rely on each other. Change 109467619 Fix CIFAR-10 model to train on all the training data instead of just 80% of it. Fixes #396. Change 109467557 Replaced checkpoint file with binary GraphDef. Change 109467433 Updates to C++ tutorial section. Change 109465269 TensorFlow: update documentation for tutorials to not assume use of bazel (when possible). Change 109462916 A tutorial for image recognition to coincide with the release of the latest Inception image classification model. Change 109462342 Clear control dependencies in variable_scope.get_variable() when creating ops for the initializer. Add tests of various error conditions. Change 109461981 Various performance improvements in low-level node execution code paths. Speeds up ptb_word_lm on my desktop with a Titan X from 3638 words per second to 3751 words per second (3.1% speedup). Changes include: o Avoided many strcmp operations per node execution and extra touches of cache lines in executor.cc, by making all the various IsMerge, IsSwitch, IsSend, etc. operations instead be based on an internal enum value that is pre-computed at Node construction time, rather than doing string comparisons against node->type_string(). We were doing about 6 such comparisons per executed node. o Removed mutex_lock in executor.cc in ExecutorState::Process. The lock was not needed and the comment about the iterations array being potentially resized is not true (the iterations arrays are created with a fixed size). Checked with yuanbyu to confirm this. o Added new two-argument port::Tracing::ScopedAnnotation constructor that takes two StringPiece arguments, and only concatenates them lazily if tracing is enabled. Also changed the code in platform/tracing.{h,cc} so that the ScopedAnnotation constructor and the TraceMe constructor can be inlined. o In BaseGPUDevice::Compute, used the two-argument ScopedAnnotation constructor to avoid doing StrCat(opkernel->name(), ":", op_kernel->type_string()) on every node execution on a GPU. o Introduced a new TensorReference class that just holds a reference to an underlying TensorBuffer, and requires an explicit Unref(). o Changed the EventMgr interface to take a vector of TensorReference objects for EventMgr::ThenDeleteTensors, rather than a vector of Tensor objects. o Used TensorReference in a few places in gpu_util.cc o Minor: switched to using InlinedVectors in a few places to get better cache locality. Change 109456692 Updated the label_image example to use the latest Inception model Change 109456545 Provides classify_image which performs image recognition on a 1000 object label set. $ ./classify_image giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.88493) indri, indris, Indri indri, Indri brevicaudatus (score = 0.00878) lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00317) custard apple (score = 0.00149) earthstar (score = 0.00127) Change 109455002 TensorFlow: make the helper libraries for various models available in the pip package so that when users type: python translate.py ... the absolute import works. This change is supposed to help make our tutorials run without the *need* to use bazel. Change 109450041 TensorFlow: remove cifar and convolutional binary copies from pip install. Adds embedding and some other models to the list. Change 109448520 Move the description of a failing invariant from a comment into the dcheck-fail message text. Change 109447577 TensorBoard has release tagging (tensorboard/TAG) Also track TensorBoard changes (tensorboard/CHANGES) Change 109444161 Added ParseSingleSequenceExample + python wrappers + unit tests. Change 109440864 Update all the TensorFlow Dockerfiles, and simplify GPU containers. This change updates all four of our Dockerfiles to match the targets discussed in tensorflow/tensorflow#149. The most notable change here is moving the GPU images to use the NVidia containers which include cudnn and other build-time dependencies, dramatically simplifying both the build and run steps. A description of which tags exist and get pushed where will be in a follow-up. Change 109432591 Some pylint and pydoc changes in saver. Change 109430127 Remove unused hydrogen components Change 109419354 The RNN api, although moved into python/ops/, remains undocumented. It may still change at any time. Base CL: 109538006

Closes #149 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#149 from kiszk:missing_links_g3doc 5f98bc279649d54ea3dcf9fe0e17be6ad6d6cb8f PiperOrigin-RevId: 271568274

ruffsl changed the title ~~Official Tensorflow Docker Image 🐳~~ Official Tensorflow Docker Image Nov 11, 2015

teamdandelion added the installation/startup label Nov 12, 2015

ebrevdo mentioned this issue Nov 15, 2015

Promote usage of docker above all other installation methods #203

Closed

craigcitro self-assigned this Nov 17, 2015

craigcitro closed this as completed Dec 10, 2015

fredtony mentioned this issue Apr 24, 2017

after type 'run' in tensorflow debugger, the terminal reappears and stucks #7774

Closed

lissyx pushed a commit to lissyx/tensorflow that referenced this issue Feb 16, 2018

[OpenCL] Registers LogicalAnd (tensorflow#149)

9247d08

tensorflow-copybara pushed a commit that referenced this issue Nov 19, 2019

Fix missing links in the documentation

e1c1ab6

Closes #149 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#149 from kiszk:missing_links_g3doc 5f98bc279649d54ea3dcf9fe0e17be6ad6d6cb8f PiperOrigin-RevId: 271568274

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Official Tensorflow Docker Image #149

Official Tensorflow Docker Image #149

ruffsl commented Nov 11, 2015

craigcitro commented Nov 11, 2015

ruffsl commented Nov 11, 2015

ruffsl commented Nov 13, 2015

cauerego commented Nov 13, 2015

ruffsl commented Nov 14, 2015

craigcitro commented Nov 17, 2015

ruffsl commented Nov 18, 2015

craigcitro commented Nov 19, 2015

ruffsl commented Nov 19, 2015

ruffsl commented Nov 25, 2015

craigcitro commented Dec 3, 2015

flx42 commented Dec 3, 2015

ruffsl commented Dec 3, 2015

craigcitro commented Dec 10, 2015

ruffsl commented Dec 10, 2015

craigcitro commented Dec 10, 2015

Official Tensorflow Docker Image #149

Official Tensorflow Docker Image #149

Comments

ruffsl commented Nov 11, 2015

craigcitro commented Nov 11, 2015

ruffsl commented Nov 11, 2015

ruffsl commented Nov 13, 2015

cauerego commented Nov 13, 2015

ruffsl commented Nov 14, 2015

craigcitro commented Nov 17, 2015

ruffsl commented Nov 18, 2015

craigcitro commented Nov 19, 2015

ruffsl commented Nov 19, 2015

ruffsl commented Nov 25, 2015

craigcitro commented Dec 3, 2015

flx42 commented Dec 3, 2015

ruffsl commented Dec 3, 2015

craigcitro commented Dec 10, 2015

ruffsl commented Dec 10, 2015

craigcitro commented Dec 10, 2015