-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
F.vad
batch consistency behaviour
#1348
Comments
Thanks for bringing this up, @jcaw. Apologies for a delayed response. Current vad implementation is based on the behavior of sox and "Attempts to trim silence and quiet background sounds..." (http://sox.sourceforge.net/sox.html). Sox produces a truncated waveform, not the same-sized filled with zeros. Since sox implementation of vad has multi-channel support, batched input is treated as multiple channels, and the entire batch is truncated to the earliest start of voice activity in any channel. I see how the current behavior can be confusing and might require a more thorough design. For now we can disable batch consistency testing for vad and update the documentation explaining current behavior. Thoughts, @mthrok and @vincentqb? |
Thanks for the explanation. I do not think it make sense to treat channels and samples same here. For channels, we can make some strong assumptions reasonably, such as, audio tracks in different channels share the same time axis, and they record the same event, which can justify the implementation of So wrapping My opinion is that VAD as library function should be used more for inference than training. This is because when you train a model, you typically need to have annotated data, and you do not want to mess with the time frame. On the other hand, in application layer, you want to have VAD to determine which frame you want to feed to ASR engine. For this kind of process, what you want is a detector function that tells if an input frame is voice activity or not. So the return type is boolean, and it has to be fast. VAD from WebRTC is one of such popular implementation. https://chromium.googlesource.com/external/webrtc/+/518c683f3e413523a458a94b533274bd7f29992d/webrtc/modules/audio_processing/vad/voice_activity_detector.h IIRC, it was also used in DeepSpeech example. https://github.com/mozilla/DeepSpeech-examples/blob/r0.9/mic_vad_streaming/mic_vad_streaming.py Therefore I propose to remove |
Filed the task for removing |
Considering your points above, @mthrok, here is the suggestion:
|
Just to re-emphasize some of what has been said here: I agree with comment that we can disable the batch consistency test and need to make sure the docstring represents what is being done. As to how VAD should trim when presented with a batch, I'd like to get the opinion of someone who may have used this in the wild. @faroit maybe you have an opinion on this?
Do you mean to suggest to disable batch + channel support (e.g. N x C x T)?
(I'd note that this interface would be different from apply_effects_tensor (though the latter does not apply to batches) which apply transformations from sox.)
Is there a performance cost when offsetting the application of a mask returned by VAD? I'd expect elementwise multiplication by boolean to be fairly cheap. |
yes. this way we don't have to be opinionated about multi-channel behavior. the user code can re-shape a 3D tensor that has channels into 2D, apply the transform, and then apply a resulting mask and re-shape the output in a way that makes sense for downstream task: it could be an iterator over samples trimmed to voice activity in either channel, or elementwise multiplication, or something else.
agreed. should we keep backend-specific function signatures aligned with conventions of their respective backends?
i would expect that too. offsetting the application of a mask returned by VAD is proposed to let the user implement the behavior that makes the most sense for a given task. |
i'd note however that disabling batching is BC breaking, since we silently have been opinionated.
we are not bound by constraints and design decisions from other backends, and it is more important to be consistent in how we present the interfaces to the users. we can always show examples say in the docstring of how to replicate with other backends -- or even let tests document the mapping :)
the options to address the batch consistency behavior that I see are the following, and please feel free to comment if i'm missing any,
I see "changing the interface to return a mask" as addressing a different issue since we still end up having to decide how batching would be supported by the function. it may make going for option 1 (or 3) easier. there are many "solid science-backed" ways to do VAD (and note that we have two already: the one discussed above and the one in the example). Having multiple algorithms available (and even adding new ones like WebRTC's mentioned above) to the user for VAD adds to the richness of the library. I'd love to hear from our users on which algorithms are desired (opened #1468 for this). The most important thing is to document what happens and communicate to the user how the code has been and is working, and the rest can wait for strong feedback from the community. I favor option 2 (leaving this as is, but add doc), until we hear such feedback. |
@astaff and @mthrok -- I see PR #1513 merged with a mention that the PR should bring this issue to a conclusion, but there are no comments about this here. It looks like option 1 is selected through the PR. However, the warning message is missing alternative steps the user could follow to reproduce the previous behavior. This can be done in the description of the relevant issue to make it easier for the user to follow. |
The change in #1513 does not change the behavior, only adds a warning and it seems to be closer to no. 2 proposed above. |
I am not sure what you two are referring by |
* Updates index.rst for LTB go-live * Update index.rst, delete unused files, improve readability on intro.html * Rename tensor and autograd files Co-authored-by: Brian Johnson <brianjo@fb.com>
I isolated a potential issue with batch consistency and F.vad in #1341:
@mthrok replied:
Opening a dedicated issue so the discussion can continue here.
The text was updated successfully, but these errors were encountered: