how much training time #4

yuexianghubit · 2020-07-14T05:41:07Z

Hi there,
thanks for so nice code.

I didn't find early stopping in your code. So after changing 'PRETRAIN_NUM_WORDS' every time, it will run 1000 steps/epochs? which will take many days to train the whole model.

may I know how long it took for you to train the whole AV model?
Looking forward to your reply. Thanks!

smeetrs · 2020-07-14T08:26:17Z

Hi,
Thanks for your interest in the code.

Early stopping is implemented by manually terminating the program once the scheduler has decreased the learning rate to 1e-6 and the validation set WER has stopped decreasing. After that, you just pick up the best weights based on validation set WER from the checkpoints/models directory. Most often the model is trained for around 200-300 steps only.

I had trained the model on a single GeForce GTX 1080 Ti GPU. The number of words in the curriculum learning iterations were 1,2,3,5,7,9,13,17,21,29,37. I trained the AO and VO models first and used them to initialize the AV model (I haven't trained AV from scratch). For these settings, it took about 7 days to train each model.

yuexianghubit · 2020-07-14T10:37:42Z

Thanks for your reply.
I noted that you just implemented the transformer-CTC architecture of the paper. I implemented the transformer-seq2seq based on Espnet, but the performance of the AV model is slightly worse than the AO model. I think it's due to context vector fusion step.

The paper said concatenating channel-wise the context vectors of the audio and video modalities and fed to feed-forward layers. I just concatenated the two context vectors (e.g, the dimension both are (32, 123, 512)) into (32, 123, 1024) and used a linear layer into (32, 123, 512), then fed to feed-forward layers, but the performance is always slightly worse than the AO model. Do you have any idea on how to implement the channel-wise concatenation?
I noted you used a convolution layer to fuse the audio and video embedding in the transformer-CTC. Should I use the same method for fusion of the two context vectors?

Any idea or suggestions will be appreciated. Thanks!

yuexianghubit · 2020-07-14T13:19:11Z

Hi,
I am trying to train the video-only model, when the 'PRETRAIN_NUM_WORDS' is 1, it seems that the WER of training and testing set are both 1 all the time and there is no any improvement.

`Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 182 || Tr.Loss: 3.239813 Val.Loss: 3.226672 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Epoch 183: reducing learning rate of group 0 to 1.0000e-06.
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 183 || Tr.Loss: 3.241490 Val.Loss: 3.221430 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 184 || Tr.Loss: 3.240253 Val.Loss: 3.238177 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 185 || Tr.Loss: 3.228107 Val.Loss: 3.234346 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 186 || Tr.Loss: 3.234290 Val.Loss: 3.216766 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 187 || Tr.Loss: 3.241915 Val.Loss: 3.232590 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 188 || Tr.Loss: 3.233189 Val.Loss: 3.228462 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 189 || Tr.Loss: 3.236741 Val.Loss: 3.223365 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 190 || Tr.Loss: 3.235876 Val.Loss: 3.216625 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 191 || Tr.Loss: 3.241944 Val.Loss: 3.242806 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 192 || Tr.Loss: 3.237240 Val.Loss: 3.243809 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 193 || Tr.Loss: 3.238747 Val.Loss: 3.219588 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
`
Is this situation normal?
Thanks for your suggestions.

smeetrs · 2020-07-14T15:03:00Z

I have replied to your queries in issue #5 and #6 respectively. Kindly open new issue for questions unrelated to any of the current issues. Thanks. 😃

yuexianghubit · 2020-07-15T01:22:02Z

no problem.
Thanks for answering my questions and giving many training details and tricks.
I'll try what you suggested. Thanks.

kasri-mids · 2020-07-15T23:37:54Z

Hi,
Thanks for your interest in the code.

Early stopping is implemented by manually terminating the program once the scheduler has decreased the learning rate to 1e-6 and the validation set WER has stopped decreasing. After that, you just pick up the best weights based on validation set WER from the checkpoints/models directory. Most often the model is trained for around 200-300 steps only.

I had trained the model on a single GeForce GTX 1080 Ti GPU. The number of words in the curriculum learning iterations were 1,2,3,5,7,9,13,17,21,29,37. I trained the AO and VO models first and used them to initialize the AV model (I haven't trained AV from scratch). For these settings, it took about 7 days to train each model.

What is the expected val WER for these pretrains... I got down to 0.74 for PRETRAIN_NUM_WORD of 1 but have not gotten past 0.85 for subsequent pretrains. For the pretrains I also notice that the val loss does not reduce and the model seems to be overfitted right from the get go.

smeetrs · 2020-07-16T03:36:48Z

That is exactly what happens. The val WER will stay around 0.85 for some iterations and then gradually start decreasing after the 13 or 17 word iteration upto about 0.77 at the 37 word iteration. It will drop further to approx. 0.7 after training on the train set.

The train loss does go much lower than the val loss while the latter stays almost constant (or may increase slightly). However, as long as the val WER does not increase and is almost constant, we can still say that there is no degradation in the generalization capability of the model. I had tried to add regularization to the loss function as well, but that didn't improve the performance.

kasri-mids · 2020-07-16T22:25:10Z

Thanks!..will try and get this to run through and report back.

smeetrs · 2020-07-25T20:18:35Z

@kasri-mids are you able to achieve the specified WER after complete training of the model?

kasri-mids · 2020-07-25T20:37:13Z

I got to 57% WER for VO model....close enough to your 55. I didn’t perform pretrain as well as it should have been due to the memory issues

…

On Sat, Jul 25, 2020 at 1:18 PM Smeet Shah ***@***.***> wrote: @kasri-mids <https://github.com/kasri-mids> are you able to achieve the specified WER after complete training of the model? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AL2GYBGV4PXH2EUGTENHN5LR5M42NANCNFSM4OZFE2HQ> .

yuexianghubit · 2020-07-26T03:50:19Z

I performed the whole pretrain schedule as you said, and I also got the about 57% WER. I think it's okay.

smeetrs · 2020-07-26T05:49:13Z

That's great! I am happy that you both have been able to reproduce the results.

This was referenced Jul 14, 2020

Improvement in AV model performance over AO #5

Closed

Train and validation WER both remain 1 while training VO model #6

Closed

smeetrs closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how much training time #4

how much training time #4

yuexianghubit commented Jul 14, 2020

smeetrs commented Jul 14, 2020

yuexianghubit commented Jul 14, 2020

yuexianghubit commented Jul 14, 2020

smeetrs commented Jul 14, 2020

yuexianghubit commented Jul 15, 2020

kasri-mids commented Jul 15, 2020

smeetrs commented Jul 16, 2020

kasri-mids commented Jul 16, 2020

smeetrs commented Jul 25, 2020

kasri-mids commented Jul 25, 2020 via email

yuexianghubit commented Jul 26, 2020

smeetrs commented Jul 26, 2020

how much training time #4

how much training time #4

Comments

yuexianghubit commented Jul 14, 2020

smeetrs commented Jul 14, 2020

yuexianghubit commented Jul 14, 2020

yuexianghubit commented Jul 14, 2020

smeetrs commented Jul 14, 2020

yuexianghubit commented Jul 15, 2020

kasri-mids commented Jul 15, 2020

smeetrs commented Jul 16, 2020

kasri-mids commented Jul 16, 2020

smeetrs commented Jul 25, 2020

kasri-mids commented Jul 25, 2020 via email

yuexianghubit commented Jul 26, 2020

smeetrs commented Jul 26, 2020