Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how much training time #4

Closed
yuexianghubit opened this issue Jul 14, 2020 · 12 comments
Closed

how much training time #4

yuexianghubit opened this issue Jul 14, 2020 · 12 comments

Comments

@yuexianghubit
Copy link

Hi there,
thanks for so nice code.

I didn't find early stopping in your code. So after changing 'PRETRAIN_NUM_WORDS' every time, it will run 1000 steps/epochs? which will take many days to train the whole model.

may I know how long it took for you to train the whole AV model?
Looking forward to your reply. Thanks!

@smeetrs
Copy link
Owner

smeetrs commented Jul 14, 2020

Hi,
Thanks for your interest in the code.

Early stopping is implemented by manually terminating the program once the scheduler has decreased the learning rate to 1e-6 and the validation set WER has stopped decreasing. After that, you just pick up the best weights based on validation set WER from the checkpoints/models directory. Most often the model is trained for around 200-300 steps only.

I had trained the model on a single GeForce GTX 1080 Ti GPU. The number of words in the curriculum learning iterations were 1,2,3,5,7,9,13,17,21,29,37. I trained the AO and VO models first and used them to initialize the AV model (I haven't trained AV from scratch). For these settings, it took about 7 days to train each model.

@yuexianghubit
Copy link
Author

Thanks for your reply.
I noted that you just implemented the transformer-CTC architecture of the paper. I implemented the transformer-seq2seq based on Espnet, but the performance of the AV model is slightly worse than the AO model. I think it's due to context vector fusion step.

The paper said concatenating channel-wise the context vectors of the audio and video modalities and fed to feed-forward layers. I just concatenated the two context vectors (e.g, the dimension both are (32, 123, 512)) into (32, 123, 1024) and used a linear layer into (32, 123, 512), then fed to feed-forward layers, but the performance is always slightly worse than the AO model. Do you have any idea on how to implement the channel-wise concatenation?
I noted you used a convolution layer to fuse the audio and video embedding in the transformer-CTC. Should I use the same method for fusion of the two context vectors?

Any idea or suggestions will be appreciated. Thanks!

@yuexianghubit
Copy link
Author

Hi,
I am trying to train the video-only model, when the 'PRETRAIN_NUM_WORDS' is 1, it seems that the WER of training and testing set are both 1 all the time and there is no any improvement.

`Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 182 || Tr.Loss: 3.239813 Val.Loss: 3.226672 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Epoch 183: reducing learning rate of group 0 to 1.0000e-06.
Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 183 || Tr.Loss: 3.241490 Val.Loss: 3.221430 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 184 || Tr.Loss: 3.240253 Val.Loss: 3.238177 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 185 || Tr.Loss: 3.228107 Val.Loss: 3.234346 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 186 || Tr.Loss: 3.234290 Val.Loss: 3.216766 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 187 || Tr.Loss: 3.241915 Val.Loss: 3.232590 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 188 || Tr.Loss: 3.233189 Val.Loss: 3.228462 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 189 || Tr.Loss: 3.236741 Val.Loss: 3.223365 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 190 || Tr.Loss: 3.235876 Val.Loss: 3.216625 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 191 || Tr.Loss: 3.241944 Val.Loss: 3.242806 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 192 || Tr.Loss: 3.237240 Val.Loss: 3.243809 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000

Train: 0%| | 0/512 [00:00<?, ?it/s]Step: 193 || Tr.Loss: 3.238747 Val.Loss: 3.219588 || Tr.CER: 1.000 Val.CER: 1.000 || Tr.WER: 1.000 Val.WER: 1.000
`
Is this situation normal?
Thanks for your suggestions.

@smeetrs
Copy link
Owner

smeetrs commented Jul 14, 2020

I have replied to your queries in issue #5 and #6 respectively. Kindly open new issue for questions unrelated to any of the current issues. Thanks. 😃

@smeetrs smeetrs closed this as completed Jul 14, 2020
@yuexianghubit
Copy link
Author

no problem.
Thanks for answering my questions and giving many training details and tricks.
I'll try what you suggested. Thanks.

@kasri-mids
Copy link

Hi,
Thanks for your interest in the code.

Early stopping is implemented by manually terminating the program once the scheduler has decreased the learning rate to 1e-6 and the validation set WER has stopped decreasing. After that, you just pick up the best weights based on validation set WER from the checkpoints/models directory. Most often the model is trained for around 200-300 steps only.

I had trained the model on a single GeForce GTX 1080 Ti GPU. The number of words in the curriculum learning iterations were 1,2,3,5,7,9,13,17,21,29,37. I trained the AO and VO models first and used them to initialize the AV model (I haven't trained AV from scratch). For these settings, it took about 7 days to train each model.

What is the expected val WER for these pretrains... I got down to 0.74 for PRETRAIN_NUM_WORD of 1 but have not gotten past 0.85 for subsequent pretrains. For the pretrains I also notice that the val loss does not reduce and the model seems to be overfitted right from the get go.

@smeetrs
Copy link
Owner

smeetrs commented Jul 16, 2020

That is exactly what happens. The val WER will stay around 0.85 for some iterations and then gradually start decreasing after the 13 or 17 word iteration upto about 0.77 at the 37 word iteration. It will drop further to approx. 0.7 after training on the train set.

The train loss does go much lower than the val loss while the latter stays almost constant (or may increase slightly). However, as long as the val WER does not increase and is almost constant, we can still say that there is no degradation in the generalization capability of the model. I had tried to add regularization to the loss function as well, but that didn't improve the performance.

@kasri-mids
Copy link

Thanks!..will try and get this to run through and report back.

@smeetrs
Copy link
Owner

smeetrs commented Jul 25, 2020

@kasri-mids are you able to achieve the specified WER after complete training of the model?

@kasri-mids
Copy link

kasri-mids commented Jul 25, 2020 via email

@yuexianghubit
Copy link
Author

I performed the whole pretrain schedule as you said, and I also got the about 57% WER. I think it's okay.

@smeetrs
Copy link
Owner

smeetrs commented Jul 26, 2020

That's great! I am happy that you both have been able to reproduce the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants