Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed tensorflow on Mesos #1996

Closed
bhack opened this issue Apr 17, 2016 · 52 comments
Closed

Distributed tensorflow on Mesos #1996

bhack opened this issue Apr 17, 2016 · 52 comments
Assignees
Labels
type:feature Feature requests

Comments

@bhack
Copy link
Contributor

bhack commented Apr 17, 2016

In the distributed howto: "We are working on tools for launching tasks programmatically, e.g. using a cluster manager like Kubernetes. If there are particular cluster managers for which you'd like to see support, please raise a GitHub issue."
It could be interesting to have CPU and GPU dockefiles ready for distributed than can run in a scalable way on Mesos (with Marathon and Kubernetes)

@bhack
Copy link
Contributor Author

bhack commented Apr 17, 2016

/cc @mesosphere

@vrv vrv added enhancement stat:contribution welcome Status - Contributions welcome labels Apr 18, 2016
@mckelvin
Copy link

We(@douban)'re developing a lightweight mesos framework (named tfmesos) to run the tensorflow (0.8+) in Docker on Mesos. It already works now but maybe not production ready yet. If anybody is interested, we would like to open-source the experimental code. If there're any better distributed implementation in the future, we would like to mark tfmesos with deprecated flag.

There's still a dependency conflict on protobuf between tensorflow and mesos. We made a dirty patch to solve this and mesos crews are working on it right now: https://issues.apache.org/jira/browse/MESOS-5186

@thinxer
Copy link
Contributor

thinxer commented Apr 18, 2016

@mckelvin I'm interested in the implementation. Is it possible to share with the rest us that the possibilities opened by tfmesos? For example, how does it simplify the process of running a distributed tensorflow?

@bhack
Copy link
Contributor Author

bhack commented Apr 18, 2016

@mckelvin Yes. This issue was tagged "contribution welcome". So If you have something to share probably could be transformed in a pull request soon.

@bhack
Copy link
Contributor Author

bhack commented Apr 18, 2016

@mrry What do you think? Do you have some feedback on how to handle this so that could be easier to integrate with TF repository with a PR?

@mrry
Copy link
Contributor

mrry commented Apr 18, 2016

I'm agnostic as to whether this would be better as a standalone repository, or integrated into somewhere like tf.contrib. One of the concerns is that there might be version skew between TensorFlow HEAD and an external repository: although we're trying our best to keep the API stable, the distributed runtime libraries are pretty new, and we might want want to change them between releases, so having it local might be better. On the other hand, I'm not sure how easy it would be to add Mesos to our test matrix, and I don't want to put our testing team on the hook for that.

Hopefully the integration can be relatively simple, and exist as a set of Python scripts somewhere (though I don't have enough experience with Mesos to say). There might be some changes required in the core, so I'll be watching this thread, and prepared to respond to feature requests.

@bhack
Copy link
Contributor Author

bhack commented Apr 18, 2016

@mrry Some initial work is at https://github.com/douban/tfmesos

@mckelvin
Copy link

@bhack You got it! I've pushed the initial code to https://github.com/douban/tfmesos as well as https://hub.docker.com/r/tfmesos/tfmesos/ (@windreamer is the main developer of tfmesos).

If you have mesos+docker installed, you can run the demo now. Before you start, you should pull tfmesos docker images on mesos server and slaves, via: docker pull tfmesos/tfmesos .

Notice: There're still some unsolved issues in tfmesos and it's not production ready. For example, the tfmesos container is running in root mode, which is dangerous. We're still working on it. Feel free to PR / issue if you have any idea/suggestion.

# coding: utf-8
# ~/demo.py

import sys
import tensorflow as tf
from tfmesos import cluster


def main(argv):
    jobs_def = [
        {
            "name": "ps",
            "num": 2
        },
        {
            "name": "worker",
            "num": 2
        },
    ]
    mesos_master = sys.argv[1]
    with cluster(jobs_def, master=mesos_master, quiet=False) as targets:
        with tf.device('/job:ps/task:0'):
            a = tf.constant(10)

        with tf.device('/job:ps/task:1'):
            b = tf.constant(32)

        with tf.device("/job:worker/task:1"):
            op = a + b

        with tf.Session(targets['/job:worker/task:0']) as sess:
            print sess.run(op)


if __name__ == '__main__':
    main(sys.argv)
mckelvin@mesos1 ~ $ cat ./run.sh
#!/bin/sh
docker run \
    -e MESOS_MASTER=mesos1 \
    -e DOCKER_IMAGE=tfmesos/tfmesos:latest \
    --net=host \
    -v /home/mckelvin/demo.py:/tmp/demo.py \
    --rm \
    -it \
    tfmesos/tfmesos:latest \
    python /tmp/demo.py mesos1
mckelvin@mesos1 ~ $ ./run.sh
2016-04-18 09:59:22,804 [INFO] [tfmesos.scheduler] Tensorflow cluster registered. ( http://mesos1:5050/#/frameworks/8beedc27-4bea-4f33-85b9-b440697419bd-0293 )
2016-04-18 09:59:26,142 [INFO] [tfmesos.scheduler] Device /job:ps/task:0 activated @ grpc://mesos1:52382
2016-04-18 09:59:26,150 [INFO] [tfmesos.scheduler] Device /job:ps/task:1 activated @ grpc://mesos1:44664
2016-04-18 09:59:26,158 [INFO] [tfmesos.scheduler] Device /job:worker/task:0 activated @ grpc://mesos1:32984
2016-04-18 09:59:26,166 [INFO] [tfmesos.scheduler] Device /job:worker/task:1 activated @ grpc://mesos1:32032
42
2016-04-18 09:59:26,190 [DEBUG] [tfmesos.scheduler] exit

@bhack
Copy link
Contributor Author

bhack commented Apr 18, 2016

/cc @mtamburrano

@bhack
Copy link
Contributor Author

bhack commented Apr 18, 2016

@mckelvin I think that the docker image could be based on https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/README.md.
Expecially if we want to have GPU support.

@windreamer
Copy link

the main problem is Mesos by default does not handle GPUs as resources. so tfmesos now focuses on building a CPU-based distributed cluster. I've no idea how to share GPU resources among tasks. @bhack do u use GPU on ur mesos cluster?

@bhack
Copy link
Contributor Author

bhack commented Apr 19, 2016

@bhack
Copy link
Contributor Author

bhack commented Apr 19, 2016

@windreamer See also NVIDIA/nvidia-docker#60

@bhack
Copy link
Contributor Author

bhack commented Apr 23, 2016

Related news: Dcos is opensource now

@bhack
Copy link
Contributor Author

bhack commented May 5, 2016

We are trying to test tensorflow on mesos on GPU. If somebody it is interested can see douban/tfmesos#3. We also need to think to a smarter auto device placement in cluster scenarios like with mesos. See also #2126

@jvanz
Copy link

jvanz commented May 26, 2016

Hi, I'm new Mesos contributor and I have a little experience developing Mesos frameworks. While I'm studying the possibility to run Tensorflow as a native framework I found this thread. ^.^

I saw that you're talking about GPU resource in Mesos. Take a look:

https://issues.apache.org/jira/browse/MESOS-4424

@windreamer
Copy link

@jvanz we are trying nvidia-docker to help us allocate GPU resources, base on @3XX0 's suggestion

experimental work is here:
douban/tfmesos#3

but before all of there, I think this JIRA should be fixed first:
https://issues.apache.org/jira/browse/MESOS-5186

@bhack
Copy link
Contributor Author

bhack commented May 27, 2016

@windreamer for issue 5186 I suggest you to open directly a PR at https://github.com/apache/mesos/pulls

@bhack
Copy link
Contributor Author

bhack commented Jun 15, 2016

@windreamer @girving How do you think this can be initally contributed to TF? I think that if you can create a PR we can attract a more extended users base.

@windreamer
Copy link

sure,it is an honour to contribute these small code base to TF. however, i think TF prefers k8s over mesos or yarn.

@bhack
Copy link
Contributor Author

bhack commented Jun 15, 2016

K8s is in house but I don't that @mrry is against Mesos contribution.

@windreamer
Copy link

by the way, tfmesos depend on pymesos as a driver, (or you can use mesos native python driver, however it is a lot heavier). I do not know whether TF cares about external dependencies.

@mrry
Copy link
Contributor

mrry commented Jun 15, 2016

@bhack @windreamer: We'd be delighted to see TensorFlow working on Mesos (and YARN, and other cluster managers). I don't know anything about how tfmesos is structured, but if it can exist as external code, that would be best. It shouldn't be necessary to add a dependency from the core tensorflow module to pymesos or any other cluster manager, unless I'm missing something. We could investigate putting things in tensorflow.contrib if the dependency were optional.

@windreamer
Copy link

Ok, we can run a distributed TensorFlow training using TFMesos with or without GPU now (thanks to nvidia-docker project). And we also provide a script tfrun to submit a Between-graph Replicated Training script to Mesos cluster (the distributed inception model is using this mode).

I think although TFMesos is still experimental, but we believe this a good start to better integrate TensorFlow to the existing Big-Data Ecosystem and support bigger and deeper models in the future.

@bhack
Copy link
Contributor Author

bhack commented Aug 31, 2016

As we have already discussed we need to take a decision on the new http://mesos.apache.org/documentation/latest/container-image/

@windreamer
Copy link

@bhack yes, Apache Mesos 1.0 introduces a lot of new features, and I need more time to decide which is the best way.

@windreamer
Copy link

and unfortunately https://issues.apache.org/jira/browse/MESOS-5186 is still unresolved...

@klueska
Copy link

klueska commented Aug 31, 2016

For what it's worth, the only reliable / supported path for GPUs in Mesos
going forward will be via the unified containerizer. i.e. what is discussed
in the link sent by bhack before: http://mesos.apache.org/documentation/latest/container-image/

Development for the docker containerizer is currently under development,
but using it will not be the recommended mode of operation.

Regarding MESOS-5168, what features of proto3 do you require? I know we
aren't planning on bumping Mesos to protobuf 3.0 anytime soon (though there
are long term plans to do so). Barring this change, what else could be done
to unblock this?

On Wed, Aug 31, 2016 at 1:27 PM windreamer notifications@github.com wrote:

and unfortunately https://issues.apache.org/jira/browse/MESOS-5186 is
still unresolved...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1996 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAF4o5LcDt2-hae6R5V3tq3LMepFye6Vks5qlWUOgaJpZM4IJHgJ
.

@windreamer
Copy link

windreamer commented Aug 31, 2016

@klueska proto3 support is vital, or we will end up with a version confliction between mesos.interface and TensowFlow. Frankly speaking, I can not understand this "one-line-modification" is blocking for months... For the existing user of mesos, they can keep on using proto2 as usual

I would like to try the mesos containerization way to launch the image, but it is still a bit mess for me to figure out how to enable this and GPU support. I need more time, and any suggestion and contribution is definitely welcome!

@bhack
Copy link
Contributor Author

bhack commented Aug 31, 2016

Also is there a DC/OS plan?

@klueska
Copy link

klueska commented Aug 31, 2016

@windreamer I think the reason that the JIRA was probably never resolved is that it's not clear that the fix being proposed is the right one. It may happen to work for your particular use case, but many of mesos's probufs aren't written in a way that is compatible with proto3 clients. For example, many of mesos's protobufs still contain required fields. We don't want people to blindly do a pip install protobuf (which installs 3.0 by default) and then start writing clients that will break in subtle ways when interacting with proto2 data coming over the wire. If you know of a general workaround for this, I'm sure it would gladly be accepted.

Regarding problems figuring out how to enable GPU support -- I can help with that. We basically mimic the functionality of nvidia-docker so that anything that runs in nvidia-docker should now be able to run in mesos as well. Consider the following example:

$ mesos-master \
      --ip=127.0.0.1 \
      --work_dir=/var/lib/mesos

$ mesos-agent \
      --master=127.0.0.1:5050 \
      --work_dir=/var/lib/mesos \
      --image_providers=docker \
      --executor_environment_variables="{}" \
      --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia"

$ mesos-execute \
      --master=127.0.0.1:5050 \
      --name=gpu-test \
      --docker_image=nvidia/cuda \
      --command="nvidia-smi" \
      --framework_capabilities="GPU_RESOURCES" \
      --resources="gpus:1"

The flags of note here are:

  mesos-agent: 
      --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" 

  mesos-execute: 
      --resources="gpus:1" 
      --framework_capabilities="GPU_RESOURCES" 

When launching an agent, both the cgroups/devices and the gpu/nvidia isolation flags are required for Nvidia GPU support in Mesos. Likewise, the docker/runtime and filesystem/linux flags are needed to enable running docker images with the unified containerizer.

The cgroups/devices flag tells the agent to restrict access to a specific set of devices when launching a task (i.e. a subset of the devices listed in /dev). The gpu/nvidia isolation flag allows the agent to grant / revoke access to GPUs on a per-task basis. It also handles automatic injection of the Nvidia libraries / volumes into the container if the label com.nvidia.volumes.needed = nvidia_driver is present in the docker image. The docker/runtime flag allows the agent to parse docker image files and containerize them. The filesystem/linux flag says to use linux specific functionality when creating / entering the new mount namespace for the container filesystem.

In addition to these agent isolation flags, Mesos requires frameworks that want to consume GPU resources to have the GPU_RESOURCES framework capability set. Without this, the master will not send an offer to a framework if it contains GPUs. The choice to make frameworks explicitly opt-in to this GPU_RESOURCES capability was to keep legacy frameworks from accidentally consuming a bunch of non-GPU resources on any GPU-capable machines in a cluster (and thus blocking your GPU jobs from running). It's not that big a deal if all of your nodes have GPUs, but in a mixed-node environment, it can be a big problem.

Finally, the --resources="gpus:1" flag tells the framework to only accept offers that contain at least 1 GPU. This is just an example of consuming a single GPU, you can (and probably should) build your framework to do something more interesting.

Hopefully you can extrapolate things from there. Let me know if you have any questions.

@yuefengz
Copy link
Contributor

I played a little bit with running Tensorflow on Mesos, without the GPU support.

Regarding to the protobuf 2 dependency issue, I removed the protobuf 2 dependency from Mesos by not setting the protobuf 2 path in PYTHONPATH when starting the executor. It will pick the protobuf 3 dependency if installed.

To make it production-ready, we may have to handle a few failure cases: master failure, agent failure, executor failure, network partition, message lost, service discovery, etc. So I am looking into other Mesos-based framework such as Marathon and see whether it handles all the failure cases for us.

@bhack
Copy link
Contributor Author

bhack commented Aug 31, 2016

@YuefengZhou Marathon? In the DC/OS flavour?

@yuefengz
Copy link
Contributor

yuefengz commented Aug 31, 2016

@bhack I have an experimental Mesos/Marathon cluster setup in my machine. I haven't looked into DC/OS yet. But I guess if it can work on Marathon smoothly with fault tolerance, it is not difficult to switch to DC/OS setup.

@klueska
Copy link

klueska commented Aug 31, 2016

GPU support will be included in Marathon 1.3 (being released in the next
couple of week). It will only be supported for the unified containerizer
though. There are still some hurdles to getting it supported all the way
through DC/OS, but we plan to have those issues resolved by the DC/OS 1.9
release (mid October).

On Wed, Aug 31, 2016 at 11:03 PM Yuefeng Zhou notifications@github.com
wrote:

@bhack https://github.com/bhack I have a experimental Mesos/Marathon
cluster setup in my machine. I haven't looked into DC/OS yet. But I guess
if it can work on Marathon smoothly with fault tolerance, it is not
difficult to switch to DC/OS setup.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1996 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAF4ozyLc0KkKR6OddfDCqGqGWtIlO0Xks5qlew4gaJpZM4IJHgJ
.

@windreamer
Copy link

windreamer commented Sep 1, 2016

@klueska for Unified Containerization Support I open an issue douban/tfmesos#12 , we can further discuss the detail there.

And for proto3, if my understanding is right, protobuf is both a compiler and runtime library. For compiler, proto3 compiler is not backward-compatible with proto2. But as long as library is generated, proto3 runtime is backward-compatible with proto2, as least I believe so.

So which compiler used to build mesos library is managed by mesos itself. Only driver developers (such as pymesos of our own) may need to pay attention to this problem. For end user, I believe both proto2 or proto3 are ok with them.

this is why I don't think https://issues.apache.org/jira/browse/MESOS-5186 is a big issue to be resolved until Mesos 2.0

@bhack
Copy link
Contributor Author

bhack commented Sep 2, 2016

/cc @vicki-c if interested in this topic.

@aselle aselle added stat:community support Status - Community Support enhancement and removed enhancement stat:community support Status - Community Support labels Sep 16, 2016
@bhack
Copy link
Contributor Author

bhack commented Sep 17, 2016

Why this rapid position change by google? Was self assigned to google just 17 days ago.

@bhack
Copy link
Contributor Author

bhack commented Sep 22, 2016

Unified Containerization support for Apache Mesos 1.0 is merged now douban/tfmesos#12 (comment)

@aselle
Copy link
Contributor

aselle commented Sep 22, 2016

I accidentally added these tags here and not in the issue I intended to. Sorry.

@jhseu
Copy link
Contributor

jhseu commented Nov 3, 2016

We have examples running on Marathon in the new repo: github.com/tensorflow/ecosystem.

Closing this issue. If you have any changes you want to make, please create pull requests or issues in the new repo.

@jhseu jhseu closed this as completed Nov 3, 2016
@aselle aselle added type:feature Feature requests and removed enhancement labels Feb 9, 2017
@haosdent
Copy link
Contributor

haosdent commented Feb 18, 2017

There's still a dependency conflict on protobuf between tensorflow and mesos.

Thanks a lot to @mckelvin , after we clean the compatible issues. http://issues.apache.org/jira/browse/MESOS-5186 is committed to master just now.

@klueska
Copy link

klueska commented Feb 18, 2017

@klueska
Copy link

klueska commented Oct 30, 2017

I know this thread is closed, but I wanted to point out the new release of distributed TensorFlow on DC/OS that we announced today. https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/

fsx950223 pushed a commit to fsx950223/tensorflow that referenced this issue Nov 28, 2023
…upstream-sync-230220

Develop upstream sync 230220
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature Feature requests
Projects
None yet
Development

No branches or pull requests