Replies: 4 comments 2 replies
-
Hi, Can you give an example of the input waveform lengths v.s. output feature lengths of your encoder? For Q1, the downsample rate property is useful to prepare training labels when we want frame level classification/regression. Like speaker diarization, speech enhancement and separation. |
Beta Was this translation helpful? Give feedback.
-
Hi, the output feature length is fixed. For each sample input, we always get the feature with the same length. I mentioned (1400-2000) is because we have several different encoders, they have different output length. But for any one encoder, the output length is fixed. |
Beta Was this translation helpful? Give feedback.
-
I think this topic should be moved into the discussion (instead of the issue) and the (new) documentation should clarify the role of |
Beta Was this translation helpful? Give feedback.
-
You mean, no matter it is a 3 secs utterance or a 20 secs utterance, your encoder always produce the same output sequence length? Then this representation will not be suitable for tasks requiring variable length representation like ASR and diarization right? |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm adding my encoder into upstream models, and I set downsample_rate = 320. But as the following code block, it can not run correctly.
The length of my output feature is about 1400-2000, it longer than feature_len = round(len(wav) / self.downsample_rate (length = 200-300). When I want the downsample rate can be calculated dynamically, I found it seems the SUPERB codes are not completed.
Q1: Why do we need a downsample rate?
Q2: Am I missing the right way to use dynamic downsample rate?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions