New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port sox::vad #578
Port sox::vad #578
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for working on this!
For the coverage of tests that we expect, you can see this readme that will be merged as part of #566. In particular, we need
- sox compatibility test
- jitability test
- batch test
In terms of code organization, we put the computation in a functional when possible so that the transform is more of a thin wrapper around this.
Since this is WIP, I suggest prefixing the title with [WIP] so we know when we are ready to consider merging. As an alternative to prefixing by WIP, Github allows the creation of draft pull request, though I don't see a way to convert an existing pull request to draft mode. Do you? |
Can you give a little more details? Is the current implementation working on a single channel? How is sox handling multiple channel? Is the output meant to be per channel? |
Got it. It was my original intent in the first commit, but after I ended up with two classes representing state I moved things to |
I tested current implementation on a single channel. I will confirm sox behavior on multiple channels and report back. Looking at the original C code, I think it triggers on voice activity in any channel and outputs all channels after the trigger. |
The |
If the the VAD works per channel, then it's easy to add batching by simply reshaping the tensor from (batch, channels, time) to (batch * channels, time), and back. If sox runs on each channel, and takes the union of detected regions over the channels, it may be a good idea to leave the detection per channel, and let the user decide what to do from there. |
That's right. I can technically pull the state inside the computation, declaring classes inside the function - let's see how it's gonna affect the readability and vectorization. |
EDIT: Wait, I no longer see anywhere
It is also worth noting that common practice is to use Having said that, if we are adding backport module as a new dependencies, the following has to be corrected, in addition to the implementation.
So I see a lot of maintenance overhead in adding backport library, while the benefit does not outweigh the overheads. |
@mthrok ugh, that's correct. sorry about that. having a state class was mostly the artifact of moving code from C. I ended up refactoring it, after I got numerical parity with SoX. |
Codecov Report
@@ Coverage Diff @@
## master #578 +/- ##
==========================================
+ Coverage 87.85% 88.58% +0.72%
==========================================
Files 19 19
Lines 2051 2182 +131
==========================================
+ Hits 1802 1933 +131
Misses 249 249
Continue to review full report at Codecov.
|
Awesome you were able to remove it! |
I also see that you have the jit consistency test now. Marked test as added :) |
Batch as well. |
test/test_batch_consistency.py
Outdated
def test_vad(self): | ||
filepath = common_utils.get_asset_path("vad-hello-mono-32000.wav") | ||
waveform, _ = torchaudio.load(filepath) | ||
_test_batch(F.vad, waveform, sample_rate=32000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is the sample_rate=32000
different than what waveform, sample_rate = torchaudio.load(...)
would return as sample_rate
?
test/test_torchscript_consistency.py
Outdated
def test_Vad(self): | ||
filepath = common_utils.get_asset_path("vad-hello-mono-32000.wav") | ||
waveform, _ = torchaudio.load(filepath) | ||
self._assert_consistency(T.Vad(32000), waveform) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same about sample_rate
torchaudio/functional.py
Outdated
noise_down_time: float = .01, | ||
noise_reduction_amount: float = 1.35, | ||
measure_freq: float = 20.0, | ||
measure_duration: Optional[float] = None, # by default, twice the measurement period; i.e. with overlap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: adding this comment in the documentation as you have done is enough to me
torchaudio/functional.py
Outdated
trigger_level: float = 7.0, | ||
trigger_time: float = 0.25, | ||
search_time: float = 1.0, | ||
allowed_gap: float = 0.25, | ||
pre_trigger_time: float = 0.0, | ||
# Fine-tuning parameters | ||
boot_time: float = .35, | ||
noise_up_time: float = .1, | ||
noise_down_time: float = .01, | ||
noise_reduction_amount: float = 1.35, | ||
measure_freq: float = 20.0, | ||
measure_duration: Optional[float] = None, # by default, twice the measurement period; i.e. with overlap. | ||
measure_smooth_time: float = .4, | ||
hp_filter_freq: float = 50., | ||
lp_filter_freq: float = 6000., | ||
hp_lifter_freq: float = 150., | ||
lp_lifter_freq: float = 2000., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we set defaults both on the functional and the transform or just the transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the existing codebase there are cases of both. For example, complex_norm
and compute_deltas
, have defaults in functional
and transforms
. Some are implemented only in transforms (Fade), some are in functional (Overdrive). Personally, I think that the user experience of calling a transform
should be no different from functional
, that means both should have defaults.
The question that is out of scope of this Pull Request is: is there a value in synthesizing transform classes out of functions, including docstrings.
Besides the minor point I mentioned, this looks good to me! Can you also add to the documentation in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this!
This PR adds Voice Activity Detector (vad) to Transforms to further reduce dependency on sox (see #260)
Original implementation: https://sourceforge.net/p/sox/code/ci/master/tree/src/vad.c
Notes:
_measure
, it is still about 2-3x slower than original C implementation.Couldn't get parity with sox whennormalization=True
astorchaudio.load
andsox
seem to normalize differently. (is that right?)Need help with handling multi-channel audio correctly.Some variable names reflect ones in the original code for troubleshooting purposes.