New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed Training #243
Comments
Hi @jorgemf , your way of organizing the distributed API is great. I also suggest you to import As you only add the distributed API, which would not affect the original APIs, so For the online documentation, you can create a Thank you for your contribution. |
@zsdonghao I have made a PR, please take a look and let me know what should be changed: #245 |
@jorgemf thanks, i corrected the format error and syn it to chinese docs. |
Hi all, I wonder whether TensorLayer supports training RNN model in distributed way? |
Hi @wagamamaz, I have just added some methods to use distributed training in master. It doesn't matter what you want to train because the only change is in the session you need to use for it. Take a look to the example: https://github.com/zsdonghao/tensorlayer/blob/master/example/tutorial_mnist_distributed.py and also to the documentation online: http://tensorlayer.readthedocs.io/en/latest/modules/distributed.html Let me know if you have any issues with it and I will fix them |
@jorgemf In the distributed_inception_v3 example, you are using the kaggle or imagenet website to download training data. Are these data in the same format and structure with the one downloaded by the tf-slim: https://github.com/tensorflow/models/blob/master/research/slim/datasets/download_and_convert_imagenet.sh |
@luomai No, I don't convert ImageNet to TFrecords, I create a txt where each line has a path to the image and the labels. |
https://tensorlayer.readthedocs.io/en/stable/modules/distributed.html We now have a simple way to support distributed training. Feel free to reopen this issue if you want to discuss. |
Currently, there are no simple ways to support distributed training in TensorLayer. I am trying to encapsulate how Gcloud and TensorPort configure the distributed nodes in order to perform a distributed training. I am not sure how this can be organized into the project. I am developing the functionality in a new file: https://github.com/jorgemf/tensorlayer/blob/feat/distributed_session/tensorlayer/distributed.py
Please let me know how is this better organized and how to specify the documentation. I am planning to add an example and also tests, but I haven't found any test in the project.
The text was updated successfully, but these errors were encountered: