Waveformer (a DNN for low-latency audio processing)

This repository provides code for the Waveformer architecture proposed in the paper, Real-Time Target Sound Extraction, presented at ICASSP 2023. Waveformer is a low-latency audio processing model implementing streaming inference -- the model process a ~10 ms input audio chunk at each time step, while only looking at past chunks and no future chunks. On a Core i5 CPU using a single thread, real-time factors (RTFs) of different model configurations range from 0.66 to 0.94, with an end-to-end latency less than 20 ms.

video_demo.mp4

Architecture

Non-causal Waveformer

For the purpose of comparing the Waveformer architecture with other non-causal source separation and target source extraction architectures, we provide a non-causal version of the architecture at src/training/non_causal_dcc_tf.py.

Setup

# Commands in all sections except the Dataset section are run from repo's toplevel directory
conda create --name waveformer python=3.8
conda activate waveformer
pip install -r requirements.txt

Bring Your Own Audio

You could run the model on your audio files using the Waveformer.py script. Example commands below use the sample audio mixture provided at data/Sample.wav. If running for the first time, the script downloads the default configuration file and checkpoint to the current directory.

# Usage: python Waveformer.py [-h] [--targets TARGETS [TARGETS ...]] input output

# Single-target extraction
python Waveformer.py data/Sample.wav output_typing.wav --targets Computer_keyboard

# Multi-target extraction
python Waveformer.py data/Sample.wav output_bark_cough.wav --targets Bark Cough

List of all possible targets can be found using:

python Waveformer.py -h

Training and Evaluation

Dataset

We use Scaper toolkit to synthetically generate audio mixtures. Each audio mixture is generated on-the-fly, during training or evaluation, using Scaper's generate_from_jams function on a .jams specification file. We provide (in the step 3 below) .jams specification files for all training, validation and evaluation samples used in our experiments. The .jams specifications are generated using FSDKaggle2018 and TAU Urban Acoustic Scenes 2019 datasets as sources for foreground and background sounds, respectively. Steps to create the dataset:

Go to the data directory:
```
 cd data
```
Download FSDKaggle2018, TAU Urban Acoustic Scenes 2019, Development dataset and TAU Urban Acoustic Scenes 2019, Evaluation dataset datasets using the data/download.py script:
```
 python download.py
```
Download and uncompress FSDSoundScapes dataset:
```
 wget https://targetsound.cs.washington.edu/files/FSDSoundScapes.zip
 unzip FSDSoundScapes.zip
```
This step creates the data/FSDSoundScapes directory. FSDSoundScapes would contain .jams specifications for training, validation and test samples used in the paper. Training and evaluation pipeline expect source samples (samples in FSDKaggle2018 and TAU Urban Acoustic Scenes 2019 datasets) at specific locations realtive to FSDSoundScapes. Following steps move source samples to appropriate locations.

Uncompress FSDKaggle2018 dataset and create scaper source:

 unzip FSDKaggle2018/\*.zip -d FSDKaggle2018
 python fsd_scaper_source_gen.py FSDKaggle2018 ./FSDSoundScapes/FSDKaggle2018 ./FSDSoundScapes/FSDKaggle2018

Uncompress TAU Urban Acoustic Scenes 2019 dataset to FSDSoundScapes directory:
```
 unzip TAU-acoustic-sounds/\*.zip -d FSDSoundScapes/TAU-acoustic-sounds/
```

Training

python -W ignore -m src.training.train experiments/<Experiment dir with config.json> --use_cuda

Evaluation

Pretrained checkpoints are available at experiments.zip. These can be downloaded and uncompressed to appropriate locations using:

wget https://targetsound.cs.washington.edu/files/experiments.zip
unzip -o experiments.zip -d experiments

Run evaluation script:

python -W ignore -m src.training.eval experiments/<Experiment dir with config.json and checkpoints> --use_cuda

Note

During the sample generation, when the amplitude of mixture sum to greater than 1, peak normalization is used to renormalize the mixtures. This results in a bunch of Scaper warnings during training and evaluation. -W ignore flag is used for a clearner output to the console.

Citation

@misc{veluri2022realtime,
  title={Real-Time Target Sound Extraction}, 
  author={Bandhav Veluri and Justin Chan and Malek Itani and Tuochao Chen and Takuya Yoshioka and Shyamnath Gollakota},
  year={2022},
  eprint={2211.02250},
  archivePrefix={arXiv},
  primaryClass={cs.SD}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
experiments		experiments
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Waveformer.py		Waveformer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Waveformer (a DNN for low-latency audio processing)

Architecture

Non-causal Waveformer

Setup

Bring Your Own Audio

Training and Evaluation

Dataset

Training

Evaluation

Note

Citation

About

Releases

Packages

Contributors 2

Languages

License

vb000/Waveformer

Folders and files

Latest commit

History

Repository files navigation

Waveformer (a DNN for low-latency audio processing)

Architecture

Non-causal Waveformer

Setup

Bring Your Own Audio

Training and Evaluation

Dataset

Training

Evaluation

Note

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages