Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

*help* Memory use during decoding with the big transformer on a single GPU #712

Closed
sebastian-nehrdich opened this issue Apr 16, 2018 · 3 comments
Labels

Comments

@sebastian-nehrdich
Copy link

Hello there,
I am currently trying to train the big transformer on a not very big dataset (abou 500k sentences of Sanskrit, ~30mb in/out), . The base-model trains and decodes fine, but when I try the big model training is good (even though much slower, but that is expected behavior I guess). However on decoding I have to decrease the batch-size a lot, down to 10 in order to achieve a result. Unfortunately the decoded sequences thus become very short. is this the expected behavior when decoding the big transformer on a single GPU? It is a Maxwell TitanX, 12GB.

@martinpopel
Copy link
Contributor

I am not sure if big will be better than base for such a small training data. Big models need more memory (hint: it can be decreased with optimizer=Adafactor,learning_rate_schedule=rsqrt_decay), so you can afford smaller batch size when training. In my experiments, the default batch size for decoding (24, I think) is usually ok even for big models and changing it resulted in exactly the same translations.

Unfortunately the decoded sequences thus become very short.

I don't see how this could be related to the decode batch size. There is the max_length parameter restricting the sentence length in training. In my experiments, the model was not able to generalize to longer inputs (and produce long enough outputs) in decode time, so I tend to set max_length high enough (150). By default, max_length is set to the training batch_size, but that is always higher than 150 (subwords), so this should not be the problem.

@sebastian-nehrdich
Copy link
Author

OK thank you for the hints! The problems were caused by a simple confusion, I was changing the batch_size for training, but wanted to change the batch_size for decoding. This is why the model behaved in such a strange way.

@martinpopel
Copy link
Contributor

OK. So let's close this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants