-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformer LM training issues #20
Comments
You probably mean the Librispeech text corpus, right? As far as I know, that is huge (@kazuki-irie can comment), so I guess ~24h per epoch sounds right. (Edit: Correction, that should be 24h per sub-epoch (1/10th of the full epoch).) I don't know if @kazuki-irie has any experience about multi-GPU training of language models. I don't have. What |
That's right. I confirm that the training speed is in that range for the best (large) models using a single GPU (with
No. It has been on my TODO list since a while but had no time for that so far. So I can not help here. Sorry. |
Closing now, as this is not really about a bug in the code. But feel free to ask further questions. |
Yes, it's Librispeech text corpus. I used the bpe based transformer LM config, only change I did was horovod related flags reduce_type='param', sync_step=50. Some changes in LMDataset to distribute text sequences between the gpus. Similar changes have worked well for multi gpu training of LSTM based LM configs. I am trying to reproduce the results so that it can be used to train for larger corpus on a multi gpu setup. |
Please share your experience and results if you are successful, that might be helpful. |
Sorry for the late reply & formatting. I have added following changes to the _iter_text() method. import horovod.tensorflow as hvd for line in f: |
Ah, but that should not be needed. Actually that is probably wrong. |
Thanks for clarifying, will try as you suggested. But without this change, all instances load all the training sequences. |
Yes, that is unfortunately the case. But I did not know about a better solution (which would work as-is with all existing datasets). Your solution is of course better, but then only works for |
With default logic and bigger corpus, memory becomes an issue. I will experiment by removing batch_slice logic. |
Yes, I know. Btw, that is why I recommend |
Sure, will try that. Thanks a lot for the clarifications. |
I rechecked, I had batch_slice logic commented in my earlier experiments which worked well for multi-gpu training for LSTM based LMs. |
Is there a script for checking ppl on test data? |
If you can add the test data in the config and directly call returnn
|
Hello,
statement and ask a question regarding the training time of this particular experiment. We've tried replicating the results from this paper and more precisely According to the beginning of this discussion, one epoch is supposed to take ~24h. Using the config file (and tensorflow 2.3.1) running one sub-epoch (with |
When I read my old response now, I realize how confusing it was... |
Hi, To add:
Which other issue? The experiments here were performed on many different kind of hardware, by many different people (GTX 1080 is common, but also 980, 2080, and some of the professional cards as well). I think @kazuki-irie also often trained on a faster GPU as far as I remember (not sure which one exactly). |
@albertz, I meant this particular issue.
I might be catching at a straw here because I couldn't find much information about the training times while trying to recreate the results from the experiment. @kazuki-irie, thanks for the reply (and edit). Sorry for wasting your time. |
That is correct. It's a big model.
No, on one GPU (at that time). |
Hey, maybe I can chime in a bit.
Haven't experimented with it a lot (one single run actually), but it seemed to work. That was with 8 GPUs, one sub-epoch with ~90k steps took around 6h (on a V100) |
I am able to replicate single gpu scores with the libri corpus, but the training is slow - 23-24 hours per epoch. I tried 4 gpu training using horovod, after 30 epochs, train/dev ppls are around 170. Any suggestions for improving the convergence of multi-gpu training?
The text was updated successfully, but these errors were encountered: