Distributed GPU Training #218

agemagician · 2019-08-19T16:09:35Z

Hello,

Any plans to have a script for training XLNet on distributed GPUs?

Maybe with Horovod or MultiWorkerMirroredStrategy?

LifeIsStrange · 2019-08-20T15:55:06Z

Bert equivalent google-research/bert#568

LifeIsStrange · 2019-08-20T16:00:15Z

https://github.com/NVIDIA/Megatron-LM

agemagician · 2019-08-21T10:25:45Z

@LifeIsStrange Thanks for the links.

I already know both of them, but as you already know they only support bert and GPT, but not XLNet.

For my use-case, I am interested in XLNet. Hopefully, we will have a distributed GPU version soon.

huseinzol05 · 2019-08-25T06:22:16Z

Actually you can, just set,

--num_core_per_host=3 --train_batch_size=30 
# 3 gpus and 30 batch will automatically divide among 3 gpus

But current implementation is using old technique distribution, you will find your RAM will leak very bad.

huseinzol05 · 2019-08-25T11:54:29Z

I created multigpus pretraining session for xlnet using mirrorstrategy.

Instruction how to use. Source code. Just copy paste this code after cloned this repository.

Please remove CUDA_VISIBLE, I put there to limit my gpus usage.

Tested on 2 TESLA V100 32GB VRAM.

agemagician · 2019-08-27T09:17:11Z

@huseinzol05 This is multi-gpu training for single node training. I am asking about distributed GPU Training for multi-nodes.

huseinzol05 · 2019-08-27T10:23:20Z

Actually you just add tf_config like this, https://lambdalabs.com/blog/tensorflow-2-0-tutorial-05-distributed-training-multi-node/

agemagician · 2019-08-27T12:24:10Z

Both your code and the official code are using "MirroredStrategy" which works for single node multi-gpu, in order to make it work for multiple nodes a "MultiWorkerMirroredStrategy" should be used.

It is also written in the blog post you post it here. "tf_config" works with "MultiWorkerMirroredStrategy".

huseinzol05 · 2019-08-27T12:26:53Z

I believe you can change it after copy pasted? lol

agemagician · 2019-08-27T12:39:33Z

Thanks for the information, but I am looking for more advanced large scale distributed training using Horovod for example.

LifeIsStrange mentioned this issue Aug 20, 2019

how to train a bert model with distributed training ? NVIDIA/DeepLearningExamples#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed GPU Training #218

Distributed GPU Training #218

agemagician commented Aug 19, 2019 •

edited

LifeIsStrange commented Aug 20, 2019

LifeIsStrange commented Aug 20, 2019

agemagician commented Aug 21, 2019

huseinzol05 commented Aug 25, 2019

huseinzol05 commented Aug 25, 2019 •

edited

agemagician commented Aug 27, 2019

huseinzol05 commented Aug 27, 2019

agemagician commented Aug 27, 2019

huseinzol05 commented Aug 27, 2019

agemagician commented Aug 27, 2019

Distributed GPU Training #218

Distributed GPU Training #218

Comments

agemagician commented Aug 19, 2019 • edited

LifeIsStrange commented Aug 20, 2019

LifeIsStrange commented Aug 20, 2019

agemagician commented Aug 21, 2019

huseinzol05 commented Aug 25, 2019

huseinzol05 commented Aug 25, 2019 • edited

agemagician commented Aug 27, 2019

huseinzol05 commented Aug 27, 2019

agemagician commented Aug 27, 2019

huseinzol05 commented Aug 27, 2019

agemagician commented Aug 27, 2019

agemagician commented Aug 19, 2019 •

edited

huseinzol05 commented Aug 25, 2019 •

edited