help Memory use during decoding with the big transformer on a single GPU #712

sebastian-nehrdich · 2018-04-16T13:53:42Z

Hello there,
I am currently trying to train the big transformer on a not very big dataset (abou 500k sentences of Sanskrit, ~30mb in/out), . The base-model trains and decodes fine, but when I try the big model training is good (even though much slower, but that is expected behavior I guess). However on decoding I have to decrease the batch-size a lot, down to 10 in order to achieve a result. Unfortunately the decoded sequences thus become very short. is this the expected behavior when decoding the big transformer on a single GPU? It is a Maxwell TitanX, 12GB.

martinpopel · 2018-04-16T14:29:48Z

I am not sure if big will be better than base for such a small training data. Big models need more memory (hint: it can be decreased with optimizer=Adafactor,learning_rate_schedule=rsqrt_decay), so you can afford smaller batch size when training. In my experiments, the default batch size for decoding (24, I think) is usually ok even for big models and changing it resulted in exactly the same translations.

Unfortunately the decoded sequences thus become very short.

I don't see how this could be related to the decode batch size. There is the max_length parameter restricting the sentence length in training. In my experiments, the model was not able to generalize to longer inputs (and produce long enough outputs) in decode time, so I tend to set max_length high enough (150). By default, max_length is set to the training batch_size, but that is always higher than 150 (subwords), so this should not be the problem.

sebastian-nehrdich · 2018-04-19T21:09:01Z

OK thank you for the hints! The problems were caused by a simple confusion, I was changing the batch_size for training, but wanted to change the batch_size for decoding. This is why the model behaved in such a strange way.

martinpopel · 2018-04-19T21:34:54Z

OK. So let's close this issue.

rsepassi added the question label May 24, 2018

rsepassi closed this as completed May 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

help Memory use during decoding with the big transformer on a single GPU #712

help Memory use during decoding with the big transformer on a single GPU #712

sebastian-nehrdich commented Apr 16, 2018

martinpopel commented Apr 16, 2018

sebastian-nehrdich commented Apr 19, 2018

martinpopel commented Apr 19, 2018

*help* Memory use during decoding with the big transformer on a single GPU #712

*help* Memory use during decoding with the big transformer on a single GPU #712

Comments

sebastian-nehrdich commented Apr 16, 2018

martinpopel commented Apr 16, 2018

sebastian-nehrdich commented Apr 19, 2018

martinpopel commented Apr 19, 2018

help Memory use during decoding with the big transformer on a single GPU #712

help Memory use during decoding with the big transformer on a single GPU #712