Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed GPU Training #218

Open
agemagician opened this issue Aug 19, 2019 · 10 comments
Open

Distributed GPU Training #218

agemagician opened this issue Aug 19, 2019 · 10 comments

Comments

@agemagician
Copy link

agemagician commented Aug 19, 2019

Hello,

Any plans to have a script for training XLNet on distributed GPUs?

Maybe with Horovod or MultiWorkerMirroredStrategy?

@LifeIsStrange
Copy link

Bert equivalent google-research/bert#568

@LifeIsStrange
Copy link

@agemagician
Copy link
Author

@LifeIsStrange Thanks for the links.

I already know both of them, but as you already know they only support bert and GPT, but not XLNet.

For my use-case, I am interested in XLNet. Hopefully, we will have a distributed GPU version soon.

@huseinzol05
Copy link

Actually you can, just set,

--num_core_per_host=3 --train_batch_size=30 
# 3 gpus and 30 batch will automatically divide among 3 gpus

But current implementation is using old technique distribution, you will find your RAM will leak very bad.

@huseinzol05
Copy link

huseinzol05 commented Aug 25, 2019

I created multigpus pretraining session for xlnet using mirrorstrategy.

Instruction how to use. Source code. Just copy paste this code after cloned this repository.

Please remove CUDA_VISIBLE, I put there to limit my gpus usage.

Tested on 2 TESLA V100 32GB VRAM.

@agemagician
Copy link
Author

@huseinzol05 This is multi-gpu training for single node training. I am asking about distributed GPU Training for multi-nodes.

@huseinzol05
Copy link

Actually you just add tf_config like this, https://lambdalabs.com/blog/tensorflow-2-0-tutorial-05-distributed-training-multi-node/

@agemagician
Copy link
Author

Both your code and the official code are using "MirroredStrategy" which works for single node multi-gpu, in order to make it work for multiple nodes a "MultiWorkerMirroredStrategy" should be used.

It is also written in the blog post you post it here. "tf_config" works with "MultiWorkerMirroredStrategy".

@huseinzol05
Copy link

I believe you can change it after copy pasted? lol

@agemagician
Copy link
Author

Thanks for the information, but I am looking for more advanced large scale distributed training using Horovod for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants