Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer LM training issues #20

Closed
deep-speech opened this issue May 27, 2019 · 21 comments
Closed

Transformer LM training issues #20

deep-speech opened this issue May 27, 2019 · 21 comments
Assignees

Comments

@deep-speech
Copy link

I am able to replicate single gpu scores with the libri corpus, but the training is slow - 23-24 hours per epoch. I tried 4 gpu training using horovod, after 30 epochs, train/dev ppls are around 170. Any suggestions for improving the convergence of multi-gpu training?

@albertz
Copy link
Member

albertz commented May 27, 2019

You probably mean the Librispeech text corpus, right? As far as I know, that is huge (@kazuki-irie can comment), so I guess ~24h per epoch sounds right. (Edit: Correction, that should be 24h per sub-epoch (1/10th of the full epoch).)
In RETURNN, you use the LmDataset for it, right? You probably want to use epoch_split to split it up into sub-epochs. E.g. epoch_split=25 or so. Then one of the sub-epoch should take less than 1h. This also means that you store the model checkpoint more often and you can do the learning rate scheduling more often.

I don't know if @kazuki-irie has any experience about multi-GPU training of language models. I don't have. What horovod_reduce_type and horovod_param_sync_step do you use? That probably will impact convergence speed. Also learning rate of course (and it's probably different as in single GPU).
LmDataset might also not be optimal for multi-GPU training (I don't know). Maybe it is more efficient to use HDFDataset. See also here.

@kazuki-irie
Copy link
Contributor

You probably mean the Librispeech text corpus, right? As far as I know, that is huge (@kazuki-irie can comment), so I guess ~24h per epoch sounds right.

That's right. I confirm that the training speed is in that range for the best (large) models using a single GPU (with random sequence ordering).
If you are working with the official LibriSpeech 200K word level vocabulary or our 10K BPEs, you could also consider making use of our pre-trained models:
https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers/librispeech

I don't know if @kazuki-irie has any experience about multi-GPU training of language models.

No. It has been on my TODO list since a while but had no time for that so far. So I can not help here. Sorry.

@albertz
Copy link
Member

albertz commented May 27, 2019

Closing now, as this is not really about a bug in the code. But feel free to ask further questions.

@albertz albertz closed this as completed May 27, 2019
@deep-speech
Copy link
Author

deep-speech commented May 27, 2019

Yes, it's Librispeech text corpus. I used the bpe based transformer LM config, only change I did was horovod related flags reduce_type='param', sync_step=50. Some changes in LMDataset to distribute text sequences between the gpus. Similar changes have worked well for multi gpu training of LSTM based LM configs. I am trying to reproduce the results so that it can be used to train for larger corpus on a multi gpu setup.

@albertz
Copy link
Member

albertz commented May 27, 2019

Please share your experience and results if you are successful, that might be helpful.
Can you also share some details about what you changed exactly in LmDataset?

@deep-speech
Copy link
Author

deep-speech commented Jun 6, 2019

Sorry for the late reply & formatting. I have added following changes to the _iter_text() method.

import horovod.tensorflow as hvd
hvd_rank = hvd.local_rank()
hvd_size = hvd.size()
count =-1

for line in f:
count +=1
if count%hvd_size != hvd_rank :
continue

@albertz
Copy link
Member

albertz commented Jun 6, 2019

Ah, but that should not be needed. Actually that is probably wrong.
Check FeedDictDataProvider.get_next_batch. In case of Horovod, you have batch_slice = slice(hvd.rank(), None, hvd.size()) there.

@deep-speech
Copy link
Author

deep-speech commented Jun 6, 2019

Thanks for clarifying, will try as you suggested. But without this change, all instances load all the training sequences.

@albertz
Copy link
Member

albertz commented Jun 6, 2019

Yes, that is unfortunately the case. But I did not know about a better solution (which would work as-is with all existing datasets). Your solution is of course better, but then only works for LmDataset, and also, it is wrong, unless you remove that batch_slice logic.

@deep-speech
Copy link
Author

With default logic and bigger corpus, memory becomes an issue. I will experiment by removing batch_slice logic.

@albertz
Copy link
Member

albertz commented Jun 6, 2019

Yes, I know. Btw, that is why I recommend HDFDataset for multi-GPU training. That will not load the whole data into memory, and thus it should not be a real issue, and it should also be fast. You can use the tool hdf_dump.py to convert your LmDataset (or any dataset) into a HDFDataset. See the documentation about multi-GPU training.

@deep-speech
Copy link
Author

Sure, will try that. Thanks a lot for the clarifications.

@deep-speech
Copy link
Author

I rechecked, I had batch_slice logic commented in my earlier experiments which worked well for multi-gpu training for LSTM based LMs.

@deep-speech
Copy link
Author

Is there a script for checking ppl on test data?

@kazuki-irie
Copy link
Contributor

If you can add the test data in the config and directly call returnn

rnn.py train.config ++task eval ++train None ++load_epoch $EPOCH ++log eval.txt ++learning_rate_file dummy_for_eval.txt

@bitterfly
Copy link

Hello,
This may not be the right place since this (again) is not a bug but I'm gonna take advantage of the

But feel free to ask further questions.

statement and ask a question regarding the training time of this particular experiment.

We've tried replicating the results from this paper and more precisely re_transfo_96_d00.2048_512.head_8.sgd.lr1.cl1.small_batch.config. We used the Librispeech corpus and the dictionary provided in the experiment.

According to the beginning of this discussion, one epoch is supposed to take ~24h. Using the config file (and tensorflow 2.3.1) running one sub-epoch (with epoch_split = 10) takes around 27h on our machine which means that the entire epoch would take around 10 days. We are running the model on a single Tesla V100 GPU (CUDA 10.1). It is mentioned in another issue that the returnn-experiments are run on a GTX 1080 Ti GPU. So I was wondering what the problem with our setup could be or have we misunderstood something since the training takes 10 times more on a (supposedly) faster GPU.

@kazuki-irie
Copy link
Contributor

When I read my old response now, I realize how confusing it was...
If I remember correctly, the ~24-hour training time was for 1 sub-epoch not 1 epoch, and there we had 10 sub-epochs = 1 epoch (just like in your setting).
This should explain the factor 10.  

@albertz
Copy link
Member

albertz commented Feb 4, 2021

Hi,

To add:

It is mentioned in another issue that the returnn-experiments are run on a GTX 1080 Ti GPU

Which other issue? The experiments here were performed on many different kind of hardware, by many different people (GTX 1080 is common, but also 980, 2080, and some of the professional cards as well). I think @kazuki-irie also often trained on a faster GPU as far as I remember (not sure which one exactly).

@bitterfly
Copy link

@albertz, I meant this particular issue.
Actually, the whole thought process was inspired by this quote from the issue:

(although normally our training times are often 1-5 days or so; in only some of the rare extreme cases you get sth like 2 weeks; all of that always on a single GTX 1080 Ti GPU)

I might be catching at a straw here because I couldn't find much information about the training times while trying to recreate the results from the experiment.

@kazuki-irie, thanks for the reply (and edit).
If I understand the naming convention correctly this model has been trained for 30 sub-epochs each taking about a day...so this model takes a month to train on a single GPU? I'm not sure that the above quote applies to this particular experiment (as it takes longer than 2 weeks). So I'm not sure whether you used multiple GPUs or the model just really takes that long to train.

Sorry for wasting your time.

@kazuki-irie
Copy link
Contributor

this model takes a month to train on a single GPU?

That is correct. It's a big model.

So I'm not sure whether you used multiple GPUs

No, on one GPU (at that time).

@Spotlight0xff
Copy link
Contributor

Hey, maybe I can chime in a bit.
I did train a large LM on multi-GPU (LSTM though),
those are my RETURNN settings for that:

use_horovod = config.bool("use_horovod", False)
horovod_dataset_distribution = "shard"
horovod_reduce_type = "param"
horovod_param_sync_step = 100

Haven't experimented with it a lot (one single run actually), but it seemed to work.
I would assume that similar settings would also work for the Transformer LM.

That was with 8 GPUs, one sub-epoch with ~90k steps took around 6h (on a V100)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants