-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how much training time #4
Comments
Hi, Early stopping is implemented by manually terminating the program once the scheduler has decreased the learning rate to 1e-6 and the validation set WER has stopped decreasing. After that, you just pick up the best weights based on validation set WER from the I had trained the model on a single GeForce GTX 1080 Ti GPU. The number of words in the curriculum learning iterations were 1,2,3,5,7,9,13,17,21,29,37. I trained the AO and VO models first and used them to initialize the AV model (I haven't trained AV from scratch). For these settings, it took about 7 days to train each model. |
Thanks for your reply. The paper said concatenating channel-wise the context vectors of the audio and video modalities and fed to feed-forward layers. I just concatenated the two context vectors (e.g, the dimension both are (32, 123, 512)) into (32, 123, 1024) and used a linear layer into (32, 123, 512), then fed to feed-forward layers, but the performance is always slightly worse than the AO model. Do you have any idea on how to implement the channel-wise concatenation? Any idea or suggestions will be appreciated. Thanks! |
Hi, `Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 182 || Tr.Loss: 3.239813 Val.Loss: 3.226672 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Epoch 183: reducing learning rate of group 0 to 1.0000e-06. Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 184 || Tr.Loss: 3.240253 Val.Loss: 3.238177 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 185 || Tr.Loss: 3.228107 Val.Loss: 3.234346 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 186 || Tr.Loss: 3.234290 Val.Loss: 3.216766 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 187 || Tr.Loss: 3.241915 Val.Loss: 3.232590 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 188 || Tr.Loss: 3.233189 Val.Loss: 3.228462 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 189 || Tr.Loss: 3.236741 Val.Loss: 3.223365 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 190 || Tr.Loss: 3.235876 Val.Loss: 3.216625 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 191 || Tr.Loss: 3.241944 Val.Loss: 3.242806 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 192 || Tr.Loss: 3.237240 Val.Loss: 3.243809 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 193 || Tr.Loss: 3.238747 Val.Loss: 3.219588 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000 |
no problem. |
What is the expected val WER for these pretrains... I got down to 0.74 for PRETRAIN_NUM_WORD of 1 but have not gotten past 0.85 for subsequent pretrains. For the pretrains I also notice that the val loss does not reduce and the model seems to be overfitted right from the get go. |
That is exactly what happens. The val WER will stay around 0.85 for some iterations and then gradually start decreasing after the 13 or 17 word iteration upto about 0.77 at the 37 word iteration. It will drop further to approx. 0.7 after training on the train set. The train loss does go much lower than the val loss while the latter stays almost constant (or may increase slightly). However, as long as the val WER does not increase and is almost constant, we can still say that there is no degradation in the generalization capability of the model. I had tried to add regularization to the loss function as well, but that didn't improve the performance. |
Thanks!..will try and get this to run through and report back. |
@kasri-mids are you able to achieve the specified WER after complete training of the model? |
I got to 57% WER for VO model....close enough to your 55. I didn’t perform
pretrain as well as it should have been due to the memory issues
…On Sat, Jul 25, 2020 at 1:18 PM Smeet Shah ***@***.***> wrote:
@kasri-mids <https://github.com/kasri-mids> are you able to achieve the
specified WER after complete training of the model?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL2GYBGV4PXH2EUGTENHN5LR5M42NANCNFSM4OZFE2HQ>
.
|
I performed the whole pretrain schedule as you said, and I also got the about 57% WER. I think it's okay. |
That's great! I am happy that you both have been able to reproduce the results. |
Hi there,
thanks for so nice code.
I didn't find early stopping in your code. So after changing 'PRETRAIN_NUM_WORDS' every time, it will run 1000 steps/epochs? which will take many days to train the whole model.
may I know how long it took for you to train the whole AV model?
Looking forward to your reply. Thanks!
The text was updated successfully, but these errors were encountered: