You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.
Hello there,
I am currently trying to train the big transformer on a not very big dataset (abou 500k sentences of Sanskrit, ~30mb in/out), . The base-model trains and decodes fine, but when I try the big model training is good (even though much slower, but that is expected behavior I guess). However on decoding I have to decrease the batch-size a lot, down to 10 in order to achieve a result. Unfortunately the decoded sequences thus become very short. is this the expected behavior when decoding the big transformer on a single GPU? It is a Maxwell TitanX, 12GB.
The text was updated successfully, but these errors were encountered:
I am not sure if big will be better than base for such a small training data. Big models need more memory (hint: it can be decreased with optimizer=Adafactor,learning_rate_schedule=rsqrt_decay), so you can afford smaller batch size when training. In my experiments, the default batch size for decoding (24, I think) is usually ok even for big models and changing it resulted in exactly the same translations.
Unfortunately the decoded sequences thus become very short.
I don't see how this could be related to the decode batch size. There is the max_length parameter restricting the sentence length in training. In my experiments, the model was not able to generalize to longer inputs (and produce long enough outputs) in decode time, so I tend to set max_length high enough (150). By default, max_length is set to the training batch_size, but that is always higher than 150 (subwords), so this should not be the problem.
OK thank you for the hints! The problems were caused by a simple confusion, I was changing the batch_size for training, but wanted to change the batch_size for decoding. This is why the model behaved in such a strange way.
Hello there,
I am currently trying to train the big transformer on a not very big dataset (abou 500k sentences of Sanskrit, ~30mb in/out), . The base-model trains and decodes fine, but when I try the big model training is good (even though much slower, but that is expected behavior I guess). However on decoding I have to decrease the batch-size a lot, down to 10 in order to achieve a result. Unfortunately the decoded sequences thus become very short. is this the expected behavior when decoding the big transformer on a single GPU? It is a Maxwell TitanX, 12GB.
The text was updated successfully, but these errors were encountered: