What is the role of the downsample_rate variable？ #411

YangHao97 · 2022-10-10T10:29:10Z

YangHao97
Oct 10, 2022

Hi, I'm adding my encoder into upstream models, and I set downsample_rate = 320. But as the following code block, it can not run correctly.

The length of my output feature is about 1400-2000, it longer than feature_len = round(len(wav) / self.downsample_rate (length = 200-300). When I want the downsample rate can be calculated dynamically, I found it seems the SUPERB codes are not completed.
Q1: Why do we need a downsample rate?
Q2: Am I missing the right way to use dynamic downsample rate?
Thank you!

leo19941227 · 2022-10-10T12:54:27Z

leo19941227
Oct 10, 2022
Maintainer

Hi,

Can you give an example of the input waveform lengths v.s. output feature lengths of your encoder?

For Q1, the downsample rate property is useful to prepare training labels when we want frame level classification/regression. Like speaker diarization, speech enhancement and separation.

0 replies

YangHao97 · 2022-10-11T03:55:54Z

YangHao97
Oct 11, 2022
Author

Hi, the output feature length is fixed. For each sample input, we always get the feature with the same length. I mentioned (1400-2000) is because we have several different encoders, they have different output length. But for any one encoder, the output length is fixed.

0 replies

bagustris · 2022-10-11T08:40:35Z

bagustris
Oct 11, 2022

I think this topic should be moved into the discussion (instead of the issue) and the (new) documentation should clarify the role of donwnsample_rate, including how to use this variable on new data (an example). Many people asked about this, including me.

1 reply

leo19941227 Oct 11, 2022
Maintainer

Sounds good!

leo19941227 · 2022-10-11T12:08:23Z

leo19941227
Oct 11, 2022
Maintainer

Hi, the output feature length is fixed. For each sample input, we always get the feature with the same length. I mentioned (1400-2000) is because we have several different encoders, they have different output length. But for any one encoder, the output length is fixed.

You mean, no matter it is a 3 secs utterance or a 20 secs utterance, your encoder always produce the same output sequence length? Then this representation will not be suitable for tasks requiring variable length representation like ASR and diarization right?
If so, I think you might want to do a simple trick to do sliding window when extracting representation, like feeding 1 sec window with 0.5 sec overlap into your encoder, so you can have a variable length of sequence of representation.

1 reply

YangHao97 Oct 15, 2022
Author

Thanks for you help, I'll try it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the role of the downsample_rate variable？ #411

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What is the role of the downsample_rate variable？ #411

YangHao97 Oct 10, 2022

Replies: 4 comments · 2 replies

leo19941227 Oct 10, 2022 Maintainer

YangHao97 Oct 11, 2022 Author

bagustris Oct 11, 2022

leo19941227 Oct 11, 2022 Maintainer

leo19941227 Oct 11, 2022 Maintainer

YangHao97 Oct 15, 2022 Author

YangHao97
Oct 10, 2022

Replies: 4 comments 2 replies

leo19941227
Oct 10, 2022
Maintainer

YangHao97
Oct 11, 2022
Author

bagustris
Oct 11, 2022

leo19941227 Oct 11, 2022
Maintainer

leo19941227
Oct 11, 2022
Maintainer

YangHao97 Oct 15, 2022
Author