Distributed tensorflow on Mesos #1996

bhack · 2016-04-17T08:51:45Z

In the distributed howto: "We are working on tools for launching tasks programmatically, e.g. using a cluster manager like Kubernetes. If there are particular cluster managers for which you'd like to see support, please raise a GitHub issue."
It could be interesting to have CPU and GPU dockefiles ready for distributed than can run in a scalable way on Mesos (with Marathon and Kubernetes)

bhack · 2016-04-17T10:26:10Z

/cc @mesosphere

mckelvin · 2016-04-18T04:45:38Z

We(@douban)'re developing a lightweight mesos framework (named tfmesos) to run the tensorflow (0.8+) in Docker on Mesos. It already works now but maybe not production ready yet. If anybody is interested, we would like to open-source the experimental code. If there're any better distributed implementation in the future, we would like to mark tfmesos with deprecated flag.

There's still a dependency conflict on protobuf between tensorflow and mesos. We made a dirty patch to solve this and mesos crews are working on it right now: https://issues.apache.org/jira/browse/MESOS-5186

thinxer · 2016-04-18T05:56:24Z

@mckelvin I'm interested in the implementation. Is it possible to share with the rest us that the possibilities opened by tfmesos? For example, how does it simplify the process of running a distributed tensorflow?

bhack · 2016-04-18T07:14:11Z

@mckelvin Yes. This issue was tagged "contribution welcome". So If you have something to share probably could be transformed in a pull request soon.

bhack · 2016-04-18T09:17:43Z

@mrry What do you think? Do you have some feedback on how to handle this so that could be easier to integrate with TF repository with a PR?

mrry · 2016-04-18T09:26:28Z

I'm agnostic as to whether this would be better as a standalone repository, or integrated into somewhere like tf.contrib. One of the concerns is that there might be version skew between TensorFlow HEAD and an external repository: although we're trying our best to keep the API stable, the distributed runtime libraries are pretty new, and we might want want to change them between releases, so having it local might be better. On the other hand, I'm not sure how easy it would be to add Mesos to our test matrix, and I don't want to put our testing team on the hook for that.

Hopefully the integration can be relatively simple, and exist as a set of Python scripts somewhere (though I don't have enough experience with Mesos to say). There might be some changes required in the core, so I'll be watching this thread, and prepared to respond to feature requests.

bhack · 2016-04-18T09:48:05Z

@mrry Some initial work is at https://github.com/douban/tfmesos

mckelvin · 2016-04-18T10:10:30Z

@bhack You got it! I've pushed the initial code to https://github.com/douban/tfmesos as well as https://hub.docker.com/r/tfmesos/tfmesos/ (@windreamer is the main developer of tfmesos).

If you have mesos+docker installed, you can run the demo now. Before you start, you should pull tfmesos docker images on mesos server and slaves, via: docker pull tfmesos/tfmesos .

Notice: There're still some unsolved issues in tfmesos and it's not production ready. For example, the tfmesos container is running in root mode, which is dangerous. We're still working on it. Feel free to PR / issue if you have any idea/suggestion.

# coding: utf-8
# ~/demo.py

import sys
import tensorflow as tf
from tfmesos import cluster


def main(argv):
    jobs_def = [
        {
            "name": "ps",
            "num": 2
        },
        {
            "name": "worker",
            "num": 2
        },
    ]
    mesos_master = sys.argv[1]
    with cluster(jobs_def, master=mesos_master, quiet=False) as targets:
        with tf.device('/job:ps/task:0'):
            a = tf.constant(10)

        with tf.device('/job:ps/task:1'):
            b = tf.constant(32)

        with tf.device("/job:worker/task:1"):
            op = a + b

        with tf.Session(targets['/job:worker/task:0']) as sess:
            print sess.run(op)


if __name__ == '__main__':
    main(sys.argv)

mckelvin@mesos1 ~ $ cat ./run.sh
#!/bin/sh
docker run \
    -e MESOS_MASTER=mesos1 \
    -e DOCKER_IMAGE=tfmesos/tfmesos:latest \
    --net=host \
    -v /home/mckelvin/demo.py:/tmp/demo.py \
    --rm \
    -it \
    tfmesos/tfmesos:latest \
    python /tmp/demo.py mesos1

mckelvin@mesos1 ~ $ ./run.sh
2016-04-18 09:59:22,804 [INFO] [tfmesos.scheduler] Tensorflow cluster registered. ( http://mesos1:5050/#/frameworks/8beedc27-4bea-4f33-85b9-b440697419bd-0293 )
2016-04-18 09:59:26,142 [INFO] [tfmesos.scheduler] Device /job:ps/task:0 activated @ grpc://mesos1:52382
2016-04-18 09:59:26,150 [INFO] [tfmesos.scheduler] Device /job:ps/task:1 activated @ grpc://mesos1:44664
2016-04-18 09:59:26,158 [INFO] [tfmesos.scheduler] Device /job:worker/task:0 activated @ grpc://mesos1:32984
2016-04-18 09:59:26,166 [INFO] [tfmesos.scheduler] Device /job:worker/task:1 activated @ grpc://mesos1:32032
42
2016-04-18 09:59:26,190 [DEBUG] [tfmesos.scheduler] exit

bhack · 2016-04-18T10:42:04Z

/cc @mtamburrano

bhack · 2016-04-18T19:54:04Z

@mckelvin I think that the docker image could be based on https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/README.md.
Expecially if we want to have GPU support.

windreamer · 2016-04-19T02:03:14Z

the main problem is Mesos by default does not handle GPUs as resources. so tfmesos now focuses on building a CPU-based distributed cluster. I've no idea how to share GPU resources among tasks. @bhack do u use GPU on ur mesos cluster？

bhack · 2016-04-19T06:23:25Z

http://www.nvidia.com/object/apache-mesos.html and https://mesosphere.com/blog/2015/11/10/mesos-nvidia-gpus/

bhack · 2016-04-19T06:55:56Z

@windreamer See also NVIDIA/nvidia-docker#60

bhack · 2016-04-23T08:54:18Z

Related news: Dcos is opensource now

bhack · 2016-05-05T07:10:49Z

We are trying to test tensorflow on mesos on GPU. If somebody it is interested can see douban/tfmesos#3. We also need to think to a smarter auto device placement in cluster scenarios like with mesos. See also #2126

jvanz · 2016-05-26T15:20:29Z

Hi, I'm new Mesos contributor and I have a little experience developing Mesos frameworks. While I'm studying the possibility to run Tensorflow as a native framework I found this thread. ^.^

I saw that you're talking about GPU resource in Mesos. Take a look:

https://issues.apache.org/jira/browse/MESOS-4424

windreamer · 2016-05-27T07:54:03Z

@jvanz we are trying nvidia-docker to help us allocate GPU resources, base on @3XX0 's suggestion

experimental work is here:
douban/tfmesos#3

but before all of there, I think this JIRA should be fixed first:
https://issues.apache.org/jira/browse/MESOS-5186

bhack · 2016-05-27T11:45:53Z

@windreamer for issue 5186 I suggest you to open directly a PR at https://github.com/apache/mesos/pulls

bhack · 2016-06-15T07:13:46Z

@windreamer @girving How do you think this can be initally contributed to TF? I think that if you can create a PR we can attract a more extended users base.

windreamer · 2016-06-15T08:35:51Z

sure，it is an honour to contribute these small code base to TF. however, i think TF prefers k8s over mesos or yarn.

bhack · 2016-06-15T08:41:23Z

K8s is in house but I don't that @mrry is against Mesos contribution.

windreamer · 2016-06-15T08:45:53Z

by the way, tfmesos depend on pymesos as a driver, (or you can use mesos native python driver, however it is a lot heavier). I do not know whether TF cares about external dependencies.

mrry · 2016-06-15T16:39:09Z

@bhack @windreamer: We'd be delighted to see TensorFlow working on Mesos (and YARN, and other cluster managers). I don't know anything about how tfmesos is structured, but if it can exist as external code, that would be best. It shouldn't be necessary to add a dependency from the core tensorflow module to pymesos or any other cluster manager, unless I'm missing something. We could investigate putting things in tensorflow.contrib if the dependency were optional.

windreamer · 2016-08-31T11:13:26Z

Ok, we can run a distributed TensorFlow training using TFMesos with or without GPU now (thanks to nvidia-docker project). And we also provide a script tfrun to submit a Between-graph Replicated Training script to Mesos cluster (the distributed inception model is using this mode).

I think although TFMesos is still experimental, but we believe this a good start to better integrate TensorFlow to the existing Big-Data Ecosystem and support bigger and deeper models in the future.

bhack · 2016-08-31T11:19:01Z

As we have already discussed we need to take a decision on the new http://mesos.apache.org/documentation/latest/container-image/

windreamer · 2016-08-31T11:23:58Z

@bhack yes, Apache Mesos 1.0 introduces a lot of new features, and I need more time to decide which is the best way.

windreamer · 2016-08-31T11:25:16Z

and unfortunately https://issues.apache.org/jira/browse/MESOS-5186 is still unresolved...

klueska · 2016-08-31T11:40:45Z

For what it's worth, the only reliable / supported path for GPUs in Mesos
going forward will be via the unified containerizer. i.e. what is discussed
in the link sent by bhack before: http://mesos.apache.org/documentation/latest/container-image/

Development for the docker containerizer is currently under development,
but using it will not be the recommended mode of operation.

Regarding MESOS-5168, what features of proto3 do you require? I know we
aren't planning on bumping Mesos to protobuf 3.0 anytime soon (though there
are long term plans to do so). Barring this change, what else could be done
to unblock this?

On Wed, Aug 31, 2016 at 1:27 PM windreamer notifications@github.com wrote:

and unfortunately https://issues.apache.org/jira/browse/MESOS-5186 is
still unresolved...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1996 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAF4o5LcDt2-hae6R5V3tq3LMepFye6Vks5qlWUOgaJpZM4IJHgJ
.

windreamer · 2016-08-31T11:52:38Z

@klueska proto3 support is vital, or we will end up with a version confliction between mesos.interface and TensowFlow. Frankly speaking, I can not understand this "one-line-modification" is blocking for months... For the existing user of mesos, they can keep on using proto2 as usual

I would like to try the mesos containerization way to launch the image, but it is still a bit mess for me to figure out how to enable this and GPU support. I need more time, and any suggestion and contribution is definitely welcome!

bhack · 2016-08-31T13:02:59Z

Also is there a DC/OS plan?

klueska · 2016-08-31T14:16:37Z

@windreamer I think the reason that the JIRA was probably never resolved is that it's not clear that the fix being proposed is the right one. It may happen to work for your particular use case, but many of mesos's probufs aren't written in a way that is compatible with proto3 clients. For example, many of mesos's protobufs still contain required fields. We don't want people to blindly do a pip install protobuf (which installs 3.0 by default) and then start writing clients that will break in subtle ways when interacting with proto2 data coming over the wire. If you know of a general workaround for this, I'm sure it would gladly be accepted.

Regarding problems figuring out how to enable GPU support -- I can help with that. We basically mimic the functionality of nvidia-docker so that anything that runs in nvidia-docker should now be able to run in mesos as well. Consider the following example:

$ mesos-master \
      --ip=127.0.0.1 \
      --work_dir=/var/lib/mesos

$ mesos-agent \
      --master=127.0.0.1:5050 \
      --work_dir=/var/lib/mesos \
      --image_providers=docker \
      --executor_environment_variables="{}" \
      --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia"

$ mesos-execute \
      --master=127.0.0.1:5050 \
      --name=gpu-test \
      --docker_image=nvidia/cuda \
      --command="nvidia-smi" \
      --framework_capabilities="GPU_RESOURCES" \
      --resources="gpus:1"

The flags of note here are:

  mesos-agent: 
      --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" 

  mesos-execute: 
      --resources="gpus:1" 
      --framework_capabilities="GPU_RESOURCES"

When launching an agent, both the cgroups/devices and the gpu/nvidia isolation flags are required for Nvidia GPU support in Mesos. Likewise, the docker/runtime and filesystem/linux flags are needed to enable running docker images with the unified containerizer.

The cgroups/devices flag tells the agent to restrict access to a specific set of devices when launching a task (i.e. a subset of the devices listed in /dev). The gpu/nvidia isolation flag allows the agent to grant / revoke access to GPUs on a per-task basis. It also handles automatic injection of the Nvidia libraries / volumes into the container if the label com.nvidia.volumes.needed = nvidia_driver is present in the docker image. The docker/runtime flag allows the agent to parse docker image files and containerize them. The filesystem/linux flag says to use linux specific functionality when creating / entering the new mount namespace for the container filesystem.

In addition to these agent isolation flags, Mesos requires frameworks that want to consume GPU resources to have the GPU_RESOURCES framework capability set. Without this, the master will not send an offer to a framework if it contains GPUs. The choice to make frameworks explicitly opt-in to this GPU_RESOURCES capability was to keep legacy frameworks from accidentally consuming a bunch of non-GPU resources on any GPU-capable machines in a cluster (and thus blocking your GPU jobs from running). It's not that big a deal if all of your nodes have GPUs, but in a mixed-node environment, it can be a big problem.

Finally, the --resources="gpus:1" flag tells the framework to only accept offers that contain at least 1 GPU. This is just an example of consuming a single GPU, you can (and probably should) build your framework to do something more interesting.

Hopefully you can extrapolate things from there. Let me know if you have any questions.

yuefengz · 2016-08-31T17:44:58Z

I played a little bit with running Tensorflow on Mesos, without the GPU support.

Regarding to the protobuf 2 dependency issue, I removed the protobuf 2 dependency from Mesos by not setting the protobuf 2 path in PYTHONPATH when starting the executor. It will pick the protobuf 3 dependency if installed.

To make it production-ready, we may have to handle a few failure cases: master failure, agent failure, executor failure, network partition, message lost, service discovery, etc. So I am looking into other Mesos-based framework such as Marathon and see whether it handles all the failure cases for us.

bhack · 2016-08-31T19:23:17Z

@YuefengZhou Marathon? In the DC/OS flavour?

yuefengz · 2016-08-31T21:01:47Z

@bhack I have an experimental Mesos/Marathon cluster setup in my machine. I haven't looked into DC/OS yet. But I guess if it can work on Marathon smoothly with fault tolerance, it is not difficult to switch to DC/OS setup.

klueska · 2016-08-31T21:23:37Z

GPU support will be included in Marathon 1.3 (being released in the next
couple of week). It will only be supported for the unified containerizer
though. There are still some hurdles to getting it supported all the way
through DC/OS, but we plan to have those issues resolved by the DC/OS 1.9
release (mid October).

On Wed, Aug 31, 2016 at 11:03 PM Yuefeng Zhou notifications@github.com
wrote:

@bhack https://github.com/bhack I have a experimental Mesos/Marathon
cluster setup in my machine. I haven't looked into DC/OS yet. But I guess
if it can work on Marathon smoothly with fault tolerance, it is not
difficult to switch to DC/OS setup.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1996 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAF4ozyLc0KkKR6OddfDCqGqGWtIlO0Xks5qlew4gaJpZM4IJHgJ
.

windreamer · 2016-09-01T04:19:30Z

@klueska for Unified Containerization Support I open an issue douban/tfmesos#12 , we can further discuss the detail there.

And for proto3, if my understanding is right, protobuf is both a compiler and runtime library. For compiler, proto3 compiler is not backward-compatible with proto2. But as long as library is generated, proto3 runtime is backward-compatible with proto2, as least I believe so.

So which compiler used to build mesos library is managed by mesos itself. Only driver developers (such as pymesos of our own) may need to pay attention to this problem. For end user, I believe both proto2 or proto3 are ok with them.

this is why I don't think https://issues.apache.org/jira/browse/MESOS-5186 is a big issue to be resolved until Mesos 2.0

bhack · 2016-09-02T16:07:22Z

/cc @vicki-c if interested in this topic.

bhack · 2016-09-17T10:52:52Z

Why this rapid position change by google? Was self assigned to google just 17 days ago.

bhack · 2016-09-22T11:13:06Z

Unified Containerization support for Apache Mesos 1.0 is merged now douban/tfmesos#12 (comment)

aselle · 2016-09-22T20:50:40Z

I accidentally added these tags here and not in the issue I intended to. Sorry.

jhseu · 2016-11-03T23:48:09Z

We have examples running on Marathon in the new repo: github.com/tensorflow/ecosystem.

Closing this issue. If you have any changes you want to make, please create pull requests or issues in the new repo.

haosdent · 2017-02-18T02:39:08Z

There's still a dependency conflict on protobuf between tensorflow and mesos.

Thanks a lot to @mckelvin , after we clean the compatible issues. http://issues.apache.org/jira/browse/MESOS-5186 is committed to master just now.

klueska · 2017-02-18T02:50:22Z

@windreamer https://reviews.apache.org/r/56238/

klueska · 2017-10-30T21:11:54Z

I know this thread is closed, but I wanted to point out the new release of distributed TensorFlow on DC/OS that we announced today. https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/

…upstream-sync-230220 Develop upstream sync 230220

vrv added enhancement stat:contribution welcome Status - Contributions welcome labels Apr 18, 2016

windreamer mentioned this issue Apr 21, 2016

Add [GP]GPU support douban/tfmesos#1

Closed

tjsongzw mentioned this issue May 23, 2016

No handlers could be found for logger "pymesos.process" douban/tfmesos#5

Closed

This was referenced Jun 4, 2016

Feature Request: Support for YARN cluster manager for Distributed TensorFlow #2655

Closed

update release and roadmap #2656

Merged

girving added the triaged label Jun 8, 2016

windreamer mentioned this issue Sep 1, 2016

Unified Containerization support for Apache Mesos 1.0 douban/tfmesos#12

Closed

3 tasks

aselle added stat:community support Status - Community Support enhancement and removed enhancement stat:community support Status - Community Support labels Sep 16, 2016

jhseu closed this as completed Nov 3, 2016

aselle added type:feature Feature requests and removed enhancement labels Feb 9, 2017

fsx950223 pushed a commit to fsx950223/tensorflow that referenced this issue Nov 28, 2023

Merge pull request tensorflow#1996 from ROCmSoftwarePlatform/develop-…

ecf7f52

…upstream-sync-230220 Develop upstream sync 230220

Distributed tensorflow on Mesos #1996

Distributed tensorflow on Mesos #1996

Comments

bhack commented Apr 17, 2016

bhack commented Apr 17, 2016

mckelvin commented Apr 18, 2016

thinxer commented Apr 18, 2016

bhack commented Apr 18, 2016

bhack commented Apr 18, 2016

mrry commented Apr 18, 2016

bhack commented Apr 18, 2016

mckelvin commented Apr 18, 2016

bhack commented Apr 18, 2016

bhack commented Apr 18, 2016

windreamer commented Apr 19, 2016

bhack commented Apr 19, 2016

bhack commented Apr 19, 2016

bhack commented Apr 23, 2016

bhack commented May 5, 2016 • edited Loading

jvanz commented May 26, 2016

windreamer commented May 27, 2016

bhack commented May 27, 2016

bhack commented Jun 15, 2016

windreamer commented Jun 15, 2016

bhack commented Jun 15, 2016

windreamer commented Jun 15, 2016

mrry commented Jun 15, 2016

windreamer commented Aug 31, 2016

bhack commented Aug 31, 2016

windreamer commented Aug 31, 2016

windreamer commented Aug 31, 2016

klueska commented Aug 31, 2016 • edited Loading

windreamer commented Aug 31, 2016 • edited Loading

bhack commented Aug 31, 2016

klueska commented Aug 31, 2016

yuefengz commented Aug 31, 2016

bhack commented Aug 31, 2016

yuefengz commented Aug 31, 2016 • edited Loading

klueska commented Aug 31, 2016

windreamer commented Sep 1, 2016 • edited Loading

bhack commented Sep 2, 2016

bhack commented Sep 17, 2016

bhack commented Sep 22, 2016

aselle commented Sep 22, 2016

jhseu commented Nov 3, 2016

haosdent commented Feb 18, 2017 • edited Loading

klueska commented Feb 18, 2017

klueska commented Oct 30, 2017

bhack commented May 5, 2016 •

edited

Loading

klueska commented Aug 31, 2016 •

edited

Loading

windreamer commented Aug 31, 2016 •

edited

Loading

yuefengz commented Aug 31, 2016 •

edited

Loading

windreamer commented Sep 1, 2016 •

edited

Loading

haosdent commented Feb 18, 2017 •

edited

Loading