Importance of utterance lengths #156

ericbolo · 2017-11-25T10:35:49Z

The utterances in the TEDLIUM dataset roughly range from 8 to 15 seconds.

I have a dataset with shorter utterances, ~5 to 10 seconds long.

What are the optimal and minimum lengths of utterances for RNN-CTC?

ericbolo · 2017-11-25T20:26:20Z

Related question: I know CMVN (cepstral mean and variance normalization) can suffer from short utterances.
In my current dataset I have only one locution per speaker.

Has anyone trained on a similar dataset (short utterances, one utterance per speaker)?

Thanks all !

fmetze · 2017-11-26T03:52:03Z

Yes, cmvn can be sensitive to short utterances. You may want to smooth utterances, or have a sliding window - if your data supports that. We did some experiments with using power (signal energy) to determine where to compute the CMVN statistics on in the lorelei branch (new files in the featbin directory), but they were ultimately inconclusive. The process is to get an alignment (using Kaldi in this case), and compute the CMVN on the non-silence frames only, then apply it to all frames. Alternatively, you can fake the alignment using power (signal energy) only, or with some other criterion, and determine the non-silence frames with them only. The purpose of this is to make the CMVN calculation independent of the actual segmentation, which may be arbitrary. Let me know if this works for you, we’d be interested in an update as well.

On Nov 25, 2017, at 3:26 PM, ericbolo ***@***.***> wrote: Related question: I know CMVN can suffer from short utterances. In my current dataset I have only one locution per speaker. Has anyone trained on a similar dataset (short utterances, one utterance per speaker)? Thanks all ! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#156 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8XawbOH-8ipwCuhrMvPvyIUFj6S4ks5s6HfsgaJpZM4QqYfc>.

Florian Metze <http://www.cs.cmu.edu/directory/florian-metze> Associate Research Professor Carnegie Mellon University

ericbolo · 2017-11-26T08:09:41Z

Thank you, @fmetze !

This Kaldi module applies a sliding window for CMVN computation: http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html

However, I don't understand the advantage of sliding windows. Is it simply a kind of data augmentation?

As for running CMVN on voiced frames only, I could try using a few voice activity detection algorithms I have at hand.

I will first run the experiment with the simple CMVN, then try these optimizations if needed. In any case, I'll keep you appraised, thanks again for your prompt and detailed answer!

fmetze · 2017-11-28T22:45:57Z

The sliding window should typically be a few seconds long, not? Then it just computes some local context and assumes that the speaker characteristics don’t change quickly. For talks or telephony speech, this is certainly true. For Meetings, it may be less true. Keep me posted - I’ve always wanted to look into this, too.

…

On Nov 26, 2017, at 3:09 AM, ericbolo ***@***.***> wrote: Thank you, @fmetze <https://github.com/fmetze> ! This Kaldi module applies a sliding window for CMVN computation: http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html <http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html> However, I don't understand the advantage of sliding windows. Is it simply a kind of data augmentation? As for running CMVN on voiced frames only, I could try using a few voice activity detection algorithms I have at hand. I will first run the experiment with the simple CMVN, then try these optimizations if needed. In any case, I'll keep you appraised, thanks again for your prompt and detailed answer! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#156 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8SW1tdZk03oPWD0aJca4mNk-Afvpks5s6RzGgaJpZM4QqYfc>.

ericbolo · 2017-12-06T12:01:35Z

A quick update: with regular CMVN, no sliding window, the phonetic model reaches 79% token accuracy. So the model learns fairly well in spite of there being short utterances, and only one utterance per speaker.

(edit: to be more precise, reaches 90% train token accuracy, and 79% on cross-validation set)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importance of utterance lengths #156

Importance of utterance lengths #156

ericbolo commented Nov 25, 2017 •

edited

Loading

ericbolo commented Nov 25, 2017 •

edited

Loading

fmetze commented Nov 26, 2017 via email

ericbolo commented Nov 26, 2017

fmetze commented Nov 28, 2017 via email

ericbolo commented Dec 6, 2017 •

edited

Loading

Importance of utterance lengths #156

Importance of utterance lengths #156

Comments

ericbolo commented Nov 25, 2017 • edited Loading

ericbolo commented Nov 25, 2017 • edited Loading

fmetze commented Nov 26, 2017 via email

ericbolo commented Nov 26, 2017

fmetze commented Nov 28, 2017 via email

ericbolo commented Dec 6, 2017 • edited Loading

ericbolo commented Nov 25, 2017 •

edited

Loading

ericbolo commented Nov 25, 2017 •

edited

Loading

ericbolo commented Dec 6, 2017 •

edited

Loading