Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spend times on training #3

Closed
kunshou123 opened this issue Aug 21, 2019 · 5 comments
Closed

spend times on training #3

kunshou123 opened this issue Aug 21, 2019 · 5 comments

Comments

@kunshou123
Copy link

@sailordiary hi, thank you for your code!
i want to know whether the model have a quick convergence speed? i use Dense3D model to train but find loss cannot quickly decrease .Do you know some good lipreading models spend little time on training,thank you

@sailordiary
Copy link
Owner

Hi,

LipNet is one of the smallest lip reading models that I know of. It converges fairly quickly; however the GRID dataset is small so it might take a bit longer to train to optimal performance (e.g. to match the WERs reported in the paper).

On the other hand you can also experiment with simple 2D video encoders like VGG-M. I think it converges quickly too.

@sailordiary
Copy link
Owner

sailordiary commented Aug 30, 2019

@kunshou123 , some updates: I just reproduced the overlapped speakers setup in the paper. It took 91 epochs to reach the "Baseline-NoLM" results reported in the paper (notably, I used greedy decoding, not beam search decoding, so the actual performance could be even better). Training takes about 40 min per epoch, using the parameter settings in the current revision, which are taken directly from the authors.

For those who happen to have dropped by, I plan to release the pre-trained checkpoints, as soon as I have the time to clean the dataset preparation code (clearly, preprocessing matters for the model to be useful).

@sailordiary
Copy link
Owner

Here are the training curves for overlapped speakers, if anyone's interested. (The discontinuities were accidental; I restored optimizer states.)
training_curve

@kunshou123
Copy link
Author

oh!! Thank you very much for your reply. I have been confused in this aspect for a long time. I am a novice in lipreading, and the effect of 3d CNN training is very bad and loss cannot significantly decreased.i will try it

@WeicongChen
Copy link

@kunshou123 , some updates: I just reproduced the overlapped speakers setup in the paper. It took 91 epochs to reach the "Baseline-NoLM" results reported in the paper (notably, I used greedy decoding, not beam search decoding, so the actual performance could be even better). Training takes about 40 min per epoch, using the parameter settings in the current revision, which are taken directly from the authors.

For those who happen to have dropped by, I plan to release the pre-trained checkpoints, as soon as I have the time to clean the dataset preparation code (clearly, preprocessing matters for the model to be useful).

Hi, can you kindly share your preprocessing code? I am struggling with reproducing LipNet these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants