Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importance of utterance lengths #156

Open
ericbolo opened this issue Nov 25, 2017 · 5 comments
Open

Importance of utterance lengths #156

ericbolo opened this issue Nov 25, 2017 · 5 comments

Comments

@ericbolo
Copy link

ericbolo commented Nov 25, 2017

The utterances in the TEDLIUM dataset roughly range from 8 to 15 seconds.

I have a dataset with shorter utterances, ~5 to 10 seconds long.

What are the optimal and minimum lengths of utterances for RNN-CTC?

@ericbolo
Copy link
Author

ericbolo commented Nov 25, 2017

Related question: I know CMVN (cepstral mean and variance normalization) can suffer from short utterances.
In my current dataset I have only one locution per speaker.

Has anyone trained on a similar dataset (short utterances, one utterance per speaker)?

Thanks all !

@fmetze
Copy link
Contributor

fmetze commented Nov 26, 2017 via email

@ericbolo
Copy link
Author

Thank you, @fmetze !

This Kaldi module applies a sliding window for CMVN computation: http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html

However, I don't understand the advantage of sliding windows. Is it simply a kind of data augmentation?

As for running CMVN on voiced frames only, I could try using a few voice activity detection algorithms I have at hand.

I will first run the experiment with the simple CMVN, then try these optimizations if needed. In any case, I'll keep you appraised, thanks again for your prompt and detailed answer!

@fmetze
Copy link
Contributor

fmetze commented Nov 28, 2017 via email

@ericbolo
Copy link
Author

ericbolo commented Dec 6, 2017

A quick update: with regular CMVN, no sliding window, the phonetic model reaches 79% token accuracy. So the model learns fairly well in spite of there being short utterances, and only one utterance per speaker.

(edit: to be more precise, reaches 90% train token accuracy, and 79% on cross-validation set)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants