Transformer LM training issues #20

deep-speech · 2019-05-27T12:17:29Z

I am able to replicate single gpu scores with the libri corpus, but the training is slow - 23-24 hours per epoch. I tried 4 gpu training using horovod, after 30 epochs, train/dev ppls are around 170. Any suggestions for improving the convergence of multi-gpu training?

albertz · 2019-05-27T13:20:55Z

You probably mean the Librispeech text corpus, right? As far as I know, that is huge (@kazuki-irie can comment), so I guess ~24h per epoch sounds right. (Edit: Correction, that should be 24h per sub-epoch (1/10th of the full epoch).)
In RETURNN, you use the LmDataset for it, right? You probably want to use epoch_split to split it up into sub-epochs. E.g. epoch_split=25 or so. Then one of the sub-epoch should take less than 1h. This also means that you store the model checkpoint more often and you can do the learning rate scheduling more often.

I don't know if @kazuki-irie has any experience about multi-GPU training of language models. I don't have. What horovod_reduce_type and horovod_param_sync_step do you use? That probably will impact convergence speed. Also learning rate of course (and it's probably different as in single GPU).
LmDataset might also not be optimal for multi-GPU training (I don't know). Maybe it is more efficient to use HDFDataset. See also here.

kazuki-irie · 2019-05-27T15:04:41Z

You probably mean the Librispeech text corpus, right? As far as I know, that is huge (@kazuki-irie can comment), so I guess ~24h per epoch sounds right.

That's right. I confirm that the training speed is in that range for the best (large) models using a single GPU (with random sequence ordering).
If you are working with the official LibriSpeech 200K word level vocabulary or our 10K BPEs, you could also consider making use of our pre-trained models:
https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers/librispeech

I don't know if @kazuki-irie has any experience about multi-GPU training of language models.

No. It has been on my TODO list since a while but had no time for that so far. So I can not help here. Sorry.

albertz · 2019-05-27T15:58:37Z

Closing now, as this is not really about a bug in the code. But feel free to ask further questions.

deep-speech · 2019-05-27T18:30:16Z

Yes, it's Librispeech text corpus. I used the bpe based transformer LM config, only change I did was horovod related flags reduce_type='param', sync_step=50. Some changes in LMDataset to distribute text sequences between the gpus. Similar changes have worked well for multi gpu training of LSTM based LM configs. I am trying to reproduce the results so that it can be used to train for larger corpus on a multi gpu setup.

albertz · 2019-05-27T20:14:51Z

Please share your experience and results if you are successful, that might be helpful.
Can you also share some details about what you changed exactly in LmDataset?

deep-speech · 2019-06-06T09:29:30Z

Sorry for the late reply & formatting. I have added following changes to the _iter_text() method.

import horovod.tensorflow as hvd
hvd_rank = hvd.local_rank()
hvd_size = hvd.size()
count =-1

for line in f:
count +=1
if count%hvd_size != hvd_rank :
continue

albertz · 2019-06-06T09:33:27Z

Ah, but that should not be needed. Actually that is probably wrong.
Check FeedDictDataProvider.get_next_batch. In case of Horovod, you have batch_slice = slice(hvd.rank(), None, hvd.size()) there.

deep-speech · 2019-06-06T09:38:04Z

Thanks for clarifying, will try as you suggested. But without this change, all instances load all the training sequences.

albertz · 2019-06-06T09:45:48Z

Yes, that is unfortunately the case. But I did not know about a better solution (which would work as-is with all existing datasets). Your solution is of course better, but then only works for LmDataset, and also, it is wrong, unless you remove that batch_slice logic.

deep-speech · 2019-06-06T09:52:02Z

With default logic and bigger corpus, memory becomes an issue. I will experiment by removing batch_slice logic.

albertz · 2019-06-06T09:54:26Z

Yes, I know. Btw, that is why I recommend HDFDataset for multi-GPU training. That will not load the whole data into memory, and thus it should not be a real issue, and it should also be fast. You can use the tool hdf_dump.py to convert your LmDataset (or any dataset) into a HDFDataset. See the documentation about multi-GPU training.

deep-speech · 2019-06-06T10:02:36Z

Sure, will try that. Thanks a lot for the clarifications.

deep-speech · 2019-06-06T12:23:50Z

I rechecked, I had batch_slice logic commented in my earlier experiments which worked well for multi-gpu training for LSTM based LMs.

deep-speech · 2019-06-18T09:10:02Z

Is there a script for checking ppl on test data?

kazuki-irie · 2019-06-18T09:30:40Z

If you can add the test data in the config and directly call returnn

rnn.py train.config ++task eval ++train None ++load_epoch $EPOCH ++log eval.txt ++learning_rate_file dummy_for_eval.txt

bitterfly · 2021-02-03T14:42:30Z

Hello,
This may not be the right place since this (again) is not a bug but I'm gonna take advantage of the

But feel free to ask further questions.

statement and ask a question regarding the training time of this particular experiment.

We've tried replicating the results from this paper and more precisely re_transfo_96_d00.2048_512.head_8.sgd.lr1.cl1.small_batch.config. We used the Librispeech corpus and the dictionary provided in the experiment.

According to the beginning of this discussion, one epoch is supposed to take ~24h. Using the config file (and tensorflow 2.3.1) running one sub-epoch (with epoch_split = 10) takes around 27h on our machine which means that the entire epoch would take around 10 days. We are running the model on a single Tesla V100 GPU (CUDA 10.1). It is mentioned in another issue that the returnn-experiments are run on a GTX 1080 Ti GPU. So I was wondering what the problem with our setup could be or have we misunderstood something since the training takes 10 times more on a (supposedly) faster GPU.

kazuki-irie · 2021-02-04T15:14:58Z

When I read my old response now, I realize how confusing it was...
If I remember correctly, the ~24-hour training time was for 1 sub-epoch not 1 epoch, and there we had 10 sub-epochs = 1 epoch (just like in your setting).
This should explain the factor 10.

albertz · 2021-02-04T16:11:58Z

Hi,

To add:

It is mentioned in another issue that the returnn-experiments are run on a GTX 1080 Ti GPU

Which other issue? The experiments here were performed on many different kind of hardware, by many different people (GTX 1080 is common, but also 980, 2080, and some of the professional cards as well). I think @kazuki-irie also often trained on a faster GPU as far as I remember (not sure which one exactly).

bitterfly · 2021-02-05T07:17:26Z

@albertz, I meant this particular issue.
Actually, the whole thought process was inspired by this quote from the issue:

(although normally our training times are often 1-5 days or so; in only some of the rare extreme cases you get sth like 2 weeks; all of that always on a single GTX 1080 Ti GPU)

I might be catching at a straw here because I couldn't find much information about the training times while trying to recreate the results from the experiment.

@kazuki-irie, thanks for the reply (and edit).
If I understand the naming convention correctly this model has been trained for 30 sub-epochs each taking about a day...so this model takes a month to train on a single GPU? I'm not sure that the above quote applies to this particular experiment (as it takes longer than 2 weeks). So I'm not sure whether you used multiple GPUs or the model just really takes that long to train.

Sorry for wasting your time.

kazuki-irie · 2021-02-05T18:38:46Z

this model takes a month to train on a single GPU?

That is correct. It's a big model.

So I'm not sure whether you used multiple GPUs

No, on one GPU (at that time).

Spotlight0xff · 2021-02-06T08:53:49Z

Hey, maybe I can chime in a bit.
I did train a large LM on multi-GPU (LSTM though),
those are my RETURNN settings for that:

use_horovod = config.bool("use_horovod", False)
horovod_dataset_distribution = "shard"
horovod_reduce_type = "param"
horovod_param_sync_step = 100

Haven't experimented with it a lot (one single run actually), but it seemed to work.
I would assume that similar settings would also work for the Transformer LM.

That was with 8 GPUs, one sub-epoch with ~90k steps took around 6h (on a V100)

albertz assigned kazuki-irie May 27, 2019

albertz closed this as completed May 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer LM training issues #20

Transformer LM training issues #20

deep-speech commented May 27, 2019

albertz commented May 27, 2019 •

edited

Loading

kazuki-irie commented May 27, 2019

albertz commented May 27, 2019

deep-speech commented May 27, 2019 •

edited

Loading

albertz commented May 27, 2019

deep-speech commented Jun 6, 2019 •

edited

Loading

albertz commented Jun 6, 2019

deep-speech commented Jun 6, 2019 •

edited

Loading

albertz commented Jun 6, 2019

deep-speech commented Jun 6, 2019

albertz commented Jun 6, 2019

deep-speech commented Jun 6, 2019

deep-speech commented Jun 6, 2019

deep-speech commented Jun 18, 2019

kazuki-irie commented Jun 18, 2019

bitterfly commented Feb 3, 2021

kazuki-irie commented Feb 4, 2021

albertz commented Feb 4, 2021

bitterfly commented Feb 5, 2021

kazuki-irie commented Feb 5, 2021

Spotlight0xff commented Feb 6, 2021

Transformer LM training issues #20

Transformer LM training issues #20

Comments

deep-speech commented May 27, 2019

albertz commented May 27, 2019 • edited Loading

kazuki-irie commented May 27, 2019

albertz commented May 27, 2019

deep-speech commented May 27, 2019 • edited Loading

albertz commented May 27, 2019

deep-speech commented Jun 6, 2019 • edited Loading

albertz commented Jun 6, 2019

deep-speech commented Jun 6, 2019 • edited Loading

albertz commented Jun 6, 2019

deep-speech commented Jun 6, 2019

albertz commented Jun 6, 2019

deep-speech commented Jun 6, 2019

deep-speech commented Jun 6, 2019

deep-speech commented Jun 18, 2019

kazuki-irie commented Jun 18, 2019

bitterfly commented Feb 3, 2021

kazuki-irie commented Feb 4, 2021

albertz commented Feb 4, 2021

bitterfly commented Feb 5, 2021

kazuki-irie commented Feb 5, 2021

Spotlight0xff commented Feb 6, 2021

albertz commented May 27, 2019 •

edited

Loading

deep-speech commented May 27, 2019 •

edited

Loading

deep-speech commented Jun 6, 2019 •

edited

Loading

deep-speech commented Jun 6, 2019 •

edited

Loading