Vocal-VAD in Music Scene

The best is yet to come.

A project implementing vocal VAD (Voice Activity Detection) in music scene.

Task

Input: a complete music in wav format
Output: vocal probability in per 10 ms

Dataset

MUSDB18: contains 150 music tracks (mixture) along with their isolated drums, bass, vocals and others stems.

Methods

Feature Engineering

STE: Short Time Energy, represents the energy of a frame of speech signal.
ZCC: Zero Crossing Counter, represents the number of times the time domain signal of a frame passes through zero.

In general, vocal fragments have high STE and low ZCC, while non-vocal fragments have low STE and high ZCC.

The calculation methods of STE and ZCC is optimized in implementation.

Vocal Extraction

Spleeter is a U-Net based model to extract the vocal track from an audio, implemented in tensorflow. It provides pre-trained model and can be used straight from command line.

Experiment

Reached an AUC of 0.88, a relatively high performance.

Usage

Install dependencies:
```
> pip install -r requirements.txt
```
audio_segments.py: to segment audio files in 10 ms
spleeter_process.py: to automatically run spleeter to extract vocal tracks from original audios
data_process.py: to process the extracted audio and output the VAD result
AUC.py: to compare with ground truth to get ROC curve and AUC value

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
img		img
pretrained_models/2stems		pretrained_models/2stems
spleeter		spleeter
.gitignore		.gitignore
AUC.py		AUC.py
README.md		README.md
__Parameters.py		__Parameters.py
__tag_tools.py		__tag_tools.py
audio_segment.py		audio_segment.py
data_process.py		data_process.py
requirements.txt		requirements.txt
spleeter_process.py		spleeter_process.py

Vaceee/Vocal-VAD

Folders and files

Latest commit

History

Repository files navigation

Vocal-VAD in Music Scene

Task

Dataset

Methods

Feature Engineering

Vocal Extraction

Experiment

Usage

About

Topics

Resources

Stars

Watchers

Forks

Languages