Distributed Training #243

jorgemf · 2017-12-06T16:02:35Z

Currently, there are no simple ways to support distributed training in TensorLayer. I am trying to encapsulate how Gcloud and TensorPort configure the distributed nodes in order to perform a distributed training. I am not sure how this can be organized into the project. I am developing the functionality in a new file: https://github.com/jorgemf/tensorlayer/blob/feat/distributed_session/tensorlayer/distributed.py
Please let me know how is this better organized and how to specify the documentation. I am planning to add an example and also tests, but I haven't found any test in the project.

zsdonghao · 2017-12-06T20:04:32Z

Hi @jorgemf , your way of organizing the distributed API is great. I also suggest you to import distributed.py in __init__.py, so users can call the API easily.

As you only add the distributed API, which would not affect the original APIs, so
for testing, just simply run tutorial_mnist_simple.py. To help others understand how to use your APIs, you can put a distributed example here.

For the online documentation, you can create a distributed.rst here and add the index into here (I can help to do this part if you wish), then we can have a new tag in the online documentation.

Thank you for your contribution.

jorgemf · 2017-12-12T10:05:31Z

@zsdonghao I have made a PR, please take a look and let me know what should be changed: #245

zsdonghao · 2017-12-12T16:02:55Z

@jorgemf thanks, i corrected the format error and syn it to chinese docs.

wagamamaz · 2017-12-15T18:22:36Z

Hi all, I wonder whether TensorLayer supports training RNN model in distributed way?

jorgemf · 2017-12-16T22:40:42Z

Hi @wagamamaz, I have just added some methods to use distributed training in master. It doesn't matter what you want to train because the only change is in the session you need to use for it. Take a look to the example: https://github.com/zsdonghao/tensorlayer/blob/master/example/tutorial_mnist_distributed.py and also to the documentation online: http://tensorlayer.readthedocs.io/en/latest/modules/distributed.html

Let me know if you have any issues with it and I will fix them

luomai · 2018-01-23T14:11:00Z

@jorgemf In the distributed_inception_v3 example, you are using the kaggle or imagenet website to download training data. Are these data in the same format and structure with the one downloaded by the tf-slim: https://github.com/tensorflow/models/blob/master/research/slim/datasets/download_and_convert_imagenet.sh

jorgemf · 2018-01-23T15:57:13Z

@luomai No, I don't convert ImageNet to TFrecords, I create a txt where each line has a path to the image and the labels.

zsdonghao · 2018-09-05T17:31:22Z

https://tensorlayer.readthedocs.io/en/stable/modules/distributed.html

We now have a simple way to support distributed training.

Feel free to reopen this issue if you want to discuss.

zsdonghao added the enhancement label Dec 6, 2017

zsdonghao added discussion and removed enhancement labels Dec 16, 2017

zsdonghao closed this as completed Sep 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Training #243

Distributed Training #243

jorgemf commented Dec 6, 2017

zsdonghao commented Dec 6, 2017

jorgemf commented Dec 12, 2017

zsdonghao commented Dec 12, 2017

wagamamaz commented Dec 15, 2017

jorgemf commented Dec 16, 2017

luomai commented Jan 23, 2018 •

edited

jorgemf commented Jan 23, 2018

zsdonghao commented Sep 5, 2018

Distributed Training #243

Distributed Training #243

Comments

jorgemf commented Dec 6, 2017

zsdonghao commented Dec 6, 2017

jorgemf commented Dec 12, 2017

zsdonghao commented Dec 12, 2017

wagamamaz commented Dec 15, 2017

jorgemf commented Dec 16, 2017

luomai commented Jan 23, 2018 • edited

jorgemf commented Jan 23, 2018

zsdonghao commented Sep 5, 2018

luomai commented Jan 23, 2018 •

edited