Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F.vad batch consistency behaviour #1348

Open
jcaw opened this issue Mar 3, 2021 · 10 comments
Open

F.vad batch consistency behaviour #1348

jcaw opened this issue Mar 3, 2021 · 10 comments

Comments

@jcaw
Copy link
Contributor

jcaw commented Mar 3, 2021

I isolated a potential issue with batch consistency and F.vad in #1341:

[Describing the change] Tweak the VAD consistency tests by using longer samples for the random test, and using lower thresholds for the file test. This now causes many of the VAD tests to fail.

This has me thinking - if the purpose of F.vad is to outright crop the input tensor (not, for example, zero it), shouldn't we expect it to produce different results on a sample vs. a batch? I believe my tweaks just contain the right conditions for that to happen, but I might be misreading. If this is the case, should F.vad be removed from the batch consistency tests?

@mthrok replied:

I see. I did not review the design of F.vad / T.Vad in detail, but your finding sounds like that the design of T.Vad is ill-formed. (I thought that F.vad would zero-out the non-vocal part). To me it does not make sense to batch the results that could be cropped in any way. We should check the behavior of the sox command, but @astaff do you have an idea of how it is supposed to behave?

Opening a dedicated issue so the discussion can continue here.

@astaff
Copy link
Contributor

astaff commented Apr 7, 2021

Thanks for bringing this up, @jcaw. Apologies for a delayed response.

Current vad implementation is based on the behavior of sox and "Attempts to trim silence and quiet background sounds..." (http://sox.sourceforge.net/sox.html). Sox produces a truncated waveform, not the same-sized filled with zeros.

Since sox implementation of vad has multi-channel support, batched input is treated as multiple channels, and the entire batch is truncated to the earliest start of voice activity in any channel.

I see how the current behavior can be confusing and might require a more thorough design. For now we can disable batch consistency testing for vad and update the documentation explaining current behavior.

Thoughts, @mthrok and @vincentqb?

@mthrok
Copy link
Collaborator

mthrok commented Apr 12, 2021

@astaff

Thanks for the explanation.

I do not think it make sense to treat channels and samples same here. For channels, we can make some strong assumptions reasonably, such as, audio tracks in different channels share the same time axis, and they record the same event, which can justify the implementation of F.vad. For samples, such assumption does not apply. Especially when training a model and randomly shuffling audio files from various recording situations.

So wrapping F.vad in T.VAD just because other functionals for filtering do so does not make sense. If we want to have VAD for batch processing, we need to think of an expected usage before we start designing it. VAD itself is a big thing in audio application domain and active research/production. There are novel NN approaches published for VADs.

My opinion is that VAD as library function should be used more for inference than training. This is because when you train a model, you typically need to have annotated data, and you do not want to mess with the time frame.

On the other hand, in application layer, you want to have VAD to determine which frame you want to feed to ASR engine. For this kind of process, what you want is a detector function that tells if an input frame is voice activity or not. So the return type is boolean, and it has to be fast. VAD from WebRTC is one of such popular implementation. https://chromium.googlesource.com/external/webrtc/+/518c683f3e413523a458a94b533274bd7f29992d/webrtc/modules/audio_processing/vad/voice_activity_detector.h IIRC, it was also used in DeepSpeech example. https://github.com/mozilla/DeepSpeech-examples/blob/r0.9/mic_vad_streaming/mic_vad_streaming.py

Therefore I propose to remove T.VAD from the library code. Padding the output Tensor is one way to resolve the issue, but since there is no universally agreed way to apply multi-channel VAD function on multiple samples, we should let users decide how different samples in a mini-batch should be handled.

@mthrok
Copy link
Collaborator

mthrok commented Apr 12, 2021

Filed the task for removing T.VAD from batch consistency test #1449

@astaff
Copy link
Contributor

astaff commented Apr 19, 2021

Considering your points above, @mthrok, here is the suggestion:

  1. Make it an explicit expectation that VAD takes and returns tensor of the same size and shape: either T or N x T (T number of frames/samples, N - batch size) and the input is a single channel or a batch of single channels.
  2. VAD returns a mask for voice activity.
  3. Calling code takes care of re-shaping multi-channel input and applying returned mask in a way that makes most sense for the task at hand.

@vincentqb
Copy link
Contributor

Just to re-emphasize some of what has been said here: I agree with comment that we can disable the batch consistency test and need to make sure the docstring represents what is being done.

As to how VAD should trim when presented with a batch, I'd like to get the opinion of someone who may have used this in the wild. @faroit maybe you have an opinion on this?

  1. Make it an explicit expectation that VAD takes and returns tensor of the same size and shape: either T or N x T (T number of frames/samples, N - batch size) and the input is a single channel or a batch of single channels.

Do you mean to suggest to disable batch + channel support (e.g. N x C x T)?

  1. VAD returns a mask for voice activity.

(I'd note that this interface would be different from apply_effects_tensor (though the latter does not apply to batches) which apply transformations from sox.)

  1. Calling code takes care of re-shaping multi-channel input and applying returned mask in a way that makes most sense for the task at hand.

Is there a performance cost when offsetting the application of a mask returned by VAD? I'd expect elementwise multiplication by boolean to be fairly cheap.

@astaff
Copy link
Contributor

astaff commented Apr 20, 2021

Do you mean to suggest to disable batch + channel support (e.g. N x C x T)?

yes. this way we don't have to be opinionated about multi-channel behavior. the user code can re-shape a 3D tensor that has channels into 2D, apply the transform, and then apply a resulting mask and re-shape the output in a way that makes sense for downstream task: it could be an iterator over samples trimmed to voice activity in either channel, or elementwise multiplication, or something else.

I'd note that this interface would be different from apply_effects_tensor (though the latter does not apply to batches) which apply transformations from sox.

agreed. should we keep backend-specific function signatures aligned with conventions of their respective backends?

Is there a performance cost when offsetting the application of a mask returned by VAD? I'd expect elementwise multiplication by boolean to be fairly cheap.

i would expect that too. offsetting the application of a mask returned by VAD is proposed to let the user implement the behavior that makes the most sense for a given task.

@vincentqb
Copy link
Contributor

vincentqb commented Apr 21, 2021

yes. this way we don't have to be opinionated about multi-channel behavior. the user code can re-shape a 3D tensor that has channels into 2D, apply the transform, and then apply a resulting mask and re-shape the output in a way that makes sense for downstream task: it could be an iterator over samples trimmed to voice activity in either channel, or elementwise multiplication, or something else.

i'd note however that disabling batching is BC breaking, since we silently have been opinionated.

I'd note that this interface would be different from apply_effects_tensor (though the latter does not apply to batches) which apply transformations from sox.

agreed. should we keep backend-specific function signatures aligned with conventions of their respective backends?

we are not bound by constraints and design decisions from other backends, and it is more important to be consistent in how we present the interfaces to the users. we can always show examples say in the docstring of how to replicate with other backends -- or even let tests document the mapping :)

Is there a performance cost when offsetting the application of a mask returned by VAD? I'd expect elementwise multiplication by boolean to be fairly cheap.

i would expect that too. offsetting the application of a mask returned by VAD is proposed to let the user implement the behavior that makes the most sense for a given task.

the options to address the batch consistency behavior that I see are the following, and please feel free to comment if i'm missing any,

  1. disabling batching completely and suggesting in the docstring possible code snippet to apply batching in different ways. this is BC-breaking.
  2. leaving this as is (where cropping is applied to the whole batch uniformly) but adding explanations to the docstring
  3. replacing batching by a for-loop and applying F.vad iteratively (this is both BC-breaking and opinionated)
  4. removing F.vad and T.vad entirely. can we confirm if this is duplicating the functionality of apply_effects_tensor? both are torchscriptable. F.vad works on GPU, but not apply_effects_tensor, Clarify sox_effects.apply_effects_tensor as CPU-only #1456. can we confirm performance differences?
  5. removing T.vad only. The original design choice for functionals and transforms though was that both were doing the same, but one was a functional interface, and the other had a state.

I see "changing the interface to return a mask" as addressing a different issue since we still end up having to decide how batching would be supported by the function. it may make going for option 1 (or 3) easier.

there are many "solid science-backed" ways to do VAD (and note that we have two already: the one discussed above and the one in the example). Having multiple algorithms available (and even adding new ones like WebRTC's mentioned above) to the user for VAD adds to the richness of the library. I'd love to hear from our users on which algorithms are desired (opened #1468 for this).

The most important thing is to document what happens and communicate to the user how the code has been and is working, and the rest can wait for strong feedback from the community. I favor option 2 (leaving this as is, but add doc), until we hear such feedback.

@vincentqb
Copy link
Contributor

@astaff and @mthrok -- I see PR #1513 merged with a mention that the PR should bring this issue to a conclusion, but there are no comments about this here. It looks like option 1 is selected through the PR. However, the warning message is missing alternative steps the user could follow to reproduce the previous behavior. This can be done in the description of the relevant issue to make it easier for the user to follow.

@astaff
Copy link
Contributor

astaff commented Jun 17, 2021

The change in #1513 does not change the behavior, only adds a warning and it seems to be closer to no. 2 proposed above.

@mthrok
Copy link
Collaborator

mthrok commented Jun 18, 2021

I am not sure what you two are referring by 2 as both of you left comment with ordered list.
My understanding is that #1513 is a temporary major as this discussion is not concluded.
I think, if we want to retain the VADs we should change it to return mask, as I cannot think of a use case for the current interface. Returning mask would allow to handle batched/multi-channel input easily.

mthrok pushed a commit to mthrok/audio that referenced this issue Dec 13, 2022
* Updates index.rst for LTB go-live

* Update index.rst, delete unused files, improve readability on intro.html

* Rename tensor and autograd files

Co-authored-by: Brian Johnson <brianjo@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants