Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Training #243

Closed
jorgemf opened this issue Dec 6, 2017 · 8 comments
Closed

Distributed Training #243

jorgemf opened this issue Dec 6, 2017 · 8 comments

Comments

@jorgemf
Copy link
Contributor

jorgemf commented Dec 6, 2017

Currently, there are no simple ways to support distributed training in TensorLayer. I am trying to encapsulate how Gcloud and TensorPort configure the distributed nodes in order to perform a distributed training. I am not sure how this can be organized into the project. I am developing the functionality in a new file: https://github.com/jorgemf/tensorlayer/blob/feat/distributed_session/tensorlayer/distributed.py
Please let me know how is this better organized and how to specify the documentation. I am planning to add an example and also tests, but I haven't found any test in the project.

@zsdonghao
Copy link
Member

Hi @jorgemf , your way of organizing the distributed API is great. I also suggest you to import distributed.py in __init__.py, so users can call the API easily.

As you only add the distributed API, which would not affect the original APIs, so
for testing, just simply run tutorial_mnist_simple.py. To help others understand how to use your APIs, you can put a distributed example here.

For the online documentation, you can create a distributed.rst here and add the index into here (I can help to do this part if you wish), then we can have a new tag in the online documentation.

Thank you for your contribution.

@jorgemf
Copy link
Contributor Author

jorgemf commented Dec 12, 2017

@zsdonghao I have made a PR, please take a look and let me know what should be changed: #245

@zsdonghao
Copy link
Member

@jorgemf thanks, i corrected the format error and syn it to chinese docs.

@wagamamaz
Copy link
Collaborator

Hi all, I wonder whether TensorLayer supports training RNN model in distributed way?

@jorgemf
Copy link
Contributor Author

jorgemf commented Dec 16, 2017

Hi @wagamamaz, I have just added some methods to use distributed training in master. It doesn't matter what you want to train because the only change is in the session you need to use for it. Take a look to the example: https://github.com/zsdonghao/tensorlayer/blob/master/example/tutorial_mnist_distributed.py and also to the documentation online: http://tensorlayer.readthedocs.io/en/latest/modules/distributed.html

Let me know if you have any issues with it and I will fix them

@luomai
Copy link
Member

luomai commented Jan 23, 2018

@jorgemf In the distributed_inception_v3 example, you are using the kaggle or imagenet website to download training data. Are these data in the same format and structure with the one downloaded by the tf-slim: https://github.com/tensorflow/models/blob/master/research/slim/datasets/download_and_convert_imagenet.sh

@jorgemf
Copy link
Contributor Author

jorgemf commented Jan 23, 2018

@luomai No, I don't convert ImageNet to TFrecords, I create a txt where each line has a path to the image and the labels.

@zsdonghao
Copy link
Member

https://tensorlayer.readthedocs.io/en/stable/modules/distributed.html

We now have a simple way to support distributed training.

Feel free to reopen this issue if you want to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants