-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importance of utterance lengths #156
Comments
Related question: I know CMVN (cepstral mean and variance normalization) can suffer from short utterances. Has anyone trained on a similar dataset (short utterances, one utterance per speaker)? Thanks all ! |
Yes, cmvn can be sensitive to short utterances. You may want to smooth utterances, or have a sliding window - if your data supports that.
We did some experiments with using power (signal energy) to determine where to compute the CMVN statistics on in the lorelei branch (new files in the featbin directory), but they were ultimately inconclusive. The process is to get an alignment (using Kaldi in this case), and compute the CMVN on the non-silence frames only, then apply it to all frames. Alternatively, you can fake the alignment using power (signal energy) only, or with some other criterion, and determine the non-silence frames with them only. The purpose of this is to make the CMVN calculation independent of the actual segmentation, which may be arbitrary.
Let me know if this works for you, we’d be interested in an update as well.
On Nov 25, 2017, at 3:26 PM, ericbolo ***@***.***> wrote:
Related question: I know CMVN can suffer from short utterances.
In my current dataset I have only one locution per speaker.
Has anyone trained on a similar dataset (short utterances, one utterance per speaker)?
Thanks all !
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#156 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8XawbOH-8ipwCuhrMvPvyIUFj6S4ks5s6HfsgaJpZM4QqYfc>.
Florian Metze <http://www.cs.cmu.edu/directory/florian-metze>
Associate Research Professor
Carnegie Mellon University
|
Thank you, @fmetze ! This Kaldi module applies a sliding window for CMVN computation: http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html However, I don't understand the advantage of sliding windows. Is it simply a kind of data augmentation? As for running CMVN on voiced frames only, I could try using a few voice activity detection algorithms I have at hand. I will first run the experiment with the simple CMVN, then try these optimizations if needed. In any case, I'll keep you appraised, thanks again for your prompt and detailed answer! |
The sliding window should typically be a few seconds long, not? Then it just computes some local context and assumes that the speaker characteristics don’t change quickly. For talks or telephony speech, this is certainly true. For Meetings, it may be less true. Keep me posted - I’ve always wanted to look into this, too.
… On Nov 26, 2017, at 3:09 AM, ericbolo ***@***.***> wrote:
Thank you, @fmetze <https://github.com/fmetze> !
This Kaldi module applies a sliding window for CMVN computation: http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html <http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html>
However, I don't understand the advantage of sliding windows. Is it simply a kind of data augmentation?
As for running CMVN on voiced frames only, I could try using a few voice activity detection algorithms I have at hand.
I will first run the experiment with the simple CMVN, then try these optimizations if needed. In any case, I'll keep you appraised, thanks again for your prompt and detailed answer!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#156 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8SW1tdZk03oPWD0aJca4mNk-Afvpks5s6RzGgaJpZM4QqYfc>.
|
A quick update: with regular CMVN, no sliding window, the phonetic model reaches 79% token accuracy. So the model learns fairly well in spite of there being short utterances, and only one utterance per speaker. (edit: to be more precise, reaches 90% train token accuracy, and 79% on cross-validation set) |
The utterances in the TEDLIUM dataset roughly range from 8 to 15 seconds.
I have a dataset with shorter utterances, ~5 to 10 seconds long.
What are the optimal and minimum lengths of utterances for RNN-CTC?
The text was updated successfully, but these errors were encountered: