-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed GPU Training #218
Comments
Bert equivalent google-research/bert#568 |
@LifeIsStrange Thanks for the links. I already know both of them, but as you already know they only support bert and GPT, but not XLNet. For my use-case, I am interested in XLNet. Hopefully, we will have a distributed GPU version soon. |
Actually you can, just set,
But current implementation is using old technique distribution, you will find your RAM will leak very bad. |
I created multigpus pretraining session for xlnet using mirrorstrategy. Instruction how to use. Source code. Just copy paste this code after cloned this repository. Please remove CUDA_VISIBLE, I put there to limit my gpus usage. Tested on 2 TESLA V100 32GB VRAM. |
@huseinzol05 This is multi-gpu training for single node training. I am asking about distributed GPU Training for multi-nodes. |
Actually you just add |
Both your code and the official code are using "MirroredStrategy" which works for single node multi-gpu, in order to make it work for multiple nodes a "MultiWorkerMirroredStrategy" should be used. It is also written in the blog post you post it here. "tf_config" works with "MultiWorkerMirroredStrategy". |
I believe you can change it after copy pasted? lol |
Thanks for the information, but I am looking for more advanced large scale distributed training using Horovod for example. |
Hello,
Any plans to have a script for training XLNet on distributed GPUs?
Maybe with Horovod or MultiWorkerMirroredStrategy?
The text was updated successfully, but these errors were encountered: