Better Voice Activity Detection (VAD) (volume threshold) algorithm? #46

WofWca · 2022-04-21T12:19:00Z

Currently we're simply calculating the loudness of incoming audio and if it's below a certain value, we say it's silence. Would be cool to find a specialized algorithm that does this better, like what voice communication apps do.
FYI in theory we can use an implementation written in a different language, such as C, C++, Rust, Go - we could compile it to WASM.

And it's not necessary to replace the current implementation, we can make an option to switch between different silence detection algorithms.

➕ Advantages:

Ignore noise, background music, only consider actual speech.
Make the "volume threshold" input less (or completely) unnecessary - make the extension simpler to use - not only if you're a beginner, but also because the best "volume threshold" value is different for different videos.
"Margin before" / "margin after" values also become the concern of the voice activity detection algorithm, not ours or user's.
May give an ability to add noise cancelation for output audio.

➖ Disadvantages:

Will probably need to do something about the chart, because it kind of makes drawing the volume threshold unnecessary (or even impossible)?
Need to make sure it's not too resource-demanding
It may have a bigger delay (i.e. more time to switch from silence speed to sounded speed). Although it's not that big of a concern when using the cloning algorithm.

Where to start (I update the list from time to time):

https://www.npmjs.com/search?q=voice%20activity%20detection
https://en.wikipedia.org/wiki/Voice_activity_detection
https://alternativeto.net/software/discord-app/?license=opensource
https://github.com/search?q=webrtc+vad+language%3AJavaScript+&type=repositories
maybe also see analogous software implementation, e.g. https://ffmpeg.org/ffmpeg-filters.html#silenceremove - they have options for RMS, peak, etc. There is https://github.com/ffmpegwasm/ffmpeg.wasm and https://github.com/Yahweasel/libav.js/ . Other software: https://alternativeto.net/software/jump-cutter/
https://github.com/solyarisoftware/webad - this project focuses on speech detection, README is very comprehensive. Also has a list of relevant projects at the bottom, worth checking out as well.
https://github.com/sapphi-red/web-noise-suppressor - noise suppression.
A simple thing to do is to add a BiquadFilterNode that would cut out frequencies that human speech is usually not associated with. See this StackOverflow answer also.

Also see #164, there is a good collection of various VAD algorithms

I would appreciate your advice (as always).

I also found this: https://developer.mozilla.org/en-US/docs/Web/API/MediaTrackConstraints/noiseSuppression
Idk what it is, but may help.
Also this: w3c/webrtc-extensions#76

The text was updated successfully, but these errors were encountered:

zznidar · 2022-05-08T10:00:09Z

I was thinking about doing some frequency comparison. Maybe by getting the ratio between certain frequencies inside human voice range and outside of it. Some research would be needed, though, to find the right values. Also, non-instrumental music would probably cause some problems.

WofWca · 2022-05-08T11:05:51Z

Finding an existing one would be cooler, at least in other languages, or at least in maths form.

Kj-is-escape · 2022-07-03T22:29:31Z

What exactly are we looking for, would something like webvoicesdk be it? the demo seems to do what we are looking for.

WofWca · 2022-07-04T19:21:46Z

What do you mean? The pros and cons in the issue description should give a rough idea.
Have you looked further into tech specs of the library you suggested, how it's better than what we have now? Based on the demo it seems to also get activated by music.
Anyway, thanks for suggestion.

Kj-is-escape · 2022-07-04T20:49:40Z

I didn't look much into it.
"State of the art Recurent Neural Network that uses WebAssembly portable runtime for voice activity detection." gave me the idea that it was taking into account more than volume.
I think "finding something that doesn't get activated by music" is a good way of measuring it.

WofWca · 2022-09-17T05:57:30Z

People who use NewPipe say that it does its job well (also it doesn't requre the user to specify volumeThreshold), so maybe copying their algorithm would be enough. It uses ExoPlayer, where the silence skipping feature is implemented:

https://github.com/google/ExoPlayer/blob/ab4d37f499ba49bb3cac7938eb03ebf7133ccfe6/library/core/src/main/java/com/google/android/exoplayer2/audio/SilenceSkippingAudioProcessor.java

Also see vantezzen/skip-silence#36 (adaptive (dynamic) volume threshold)

WofWca · 2024-05-06T11:32:54Z

A few thoughts about OpenAI's Whisper (also mentioned in #164). I'm not an expert, but at a glance it looks to me like VAD with extra steps? I'm looking at FUTO Voice Input and they're using both Whisper and WebRTC VAD (check the "credits" section in their app:

WebRTC - Thanks to the voice activity detection to stop recognition on silence

WofWca added the help wanted Extra attention is needed label Apr 21, 2022

WofWca changed the title ~~Better voice activity detection algorithm?~~ Better voice activity detection (volume threshold) algorithm? Sep 17, 2022

WofWca mentioned this issue May 28, 2023

"Easy mode" settings #158

Open

WofWca mentioned this issue Jul 20, 2023

feat: jump based on "Voice Activity Detection"/"Speech Recognition" #164

Closed

WofWca changed the title ~~Better voice activity detection (volume threshold) algorithm?~~ Better voice activity detection (VAD) (volume threshold) algorithm? May 6, 2024

WofWca changed the title ~~Better voice activity detection (VAD) (volume threshold) algorithm?~~ Better Voice Activity Detection (VAD) (volume threshold) algorithm? May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Voice Activity Detection (VAD) (volume threshold) algorithm? #46

Better Voice Activity Detection (VAD) (volume threshold) algorithm? #46

WofWca commented Apr 21, 2022 •

edited

Loading

zznidar commented May 8, 2022

WofWca commented May 8, 2022

Kj-is-escape commented Jul 3, 2022

WofWca commented Jul 4, 2022

Kj-is-escape commented Jul 4, 2022

WofWca commented Sep 17, 2022 •

edited

Loading

WofWca commented May 6, 2024 •

edited

Loading

Better Voice Activity Detection (VAD) (volume threshold) algorithm? #46

Better Voice Activity Detection (VAD) (volume threshold) algorithm? #46

Comments

WofWca commented Apr 21, 2022 • edited Loading

zznidar commented May 8, 2022

WofWca commented May 8, 2022

Kj-is-escape commented Jul 3, 2022

WofWca commented Jul 4, 2022

Kj-is-escape commented Jul 4, 2022

WofWca commented Sep 17, 2022 • edited Loading

WofWca commented May 6, 2024 • edited Loading

WofWca commented Apr 21, 2022 •

edited

Loading

WofWca commented Sep 17, 2022 •

edited

Loading

WofWca commented May 6, 2024 •

edited

Loading