SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

This repository is the official implementation of "SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation"

Paper: arxiv
Demo page: Audio Samples
Chekpoints: Hugging Face (Now only checkpoints are avaiable.）

Contact:

Koichi SAITO: koichi.saito@sony.com

Checkpoints

Download and put the teacher model's checkpoints and AudioLDM-s-full checkpoints for VAE+Vocoder part to soundctm/ckpt
SoundCTM checkpoint on AudioCaps (ema=0.999, 30K training iterations)

For inference, both AudioLDM-s-full (for VAE's decoder+Vocoder) and SoundCTM checkpoints will be used.

Prerequisites

Install docker to your own server and build docker container:

docker build -t soundctm .

Then run scripts in the container.

Training

Please see ctm_train.sh and ctm_train.py and modify folder path dependeing on your environment.

Then run bash ctm_train.sh

Inference

Please see ctm_inference.sh and ctm_inference.py and modify folder path dependeing on your environment.

Then run bash ctm_inference.sh

Numerical evaluation

Please see numerical_evaluation.sh and numerical_evaluation.py and modify folder path dependeing on your environment.

Then run bash numerical_evaluation.sh

Dataset

Follow the instructions given in the AudioCaps repository for downloading the data. Data locations are needed to be spesificied in ctm_train.sh. You can also see some examples at data/train.csv.

WandB for logging

The training code also requires a Weights & Biases account to log the training outputs and demos. Create an account and log in with:

$ wandb login

Or you can also pass an API key as an environment variable WANDB_API_KEY. (You can obtain the API key from https://wandb.ai/authorize after logging in to your account.)

$ WANDB_API_KEY="12345x6789y..."

Citation

@article{saito2024soundctm,
  title={SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation}, 
  author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
  journal={arXiv preprint arXiv:2405.18503},
  year={2024}
}

Reference

Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.

https://github.com/sony/ctm

https://github.com/declare-lab/tango

https://github.com/haoheliu/AudioLDM

https://github.com/haoheliu/audioldm_eval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Checkpoints

Prerequisites

Training

Inference

Numerical evaluation

Dataset

WandB for logging

Citation

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Checkpoints

Prerequisites

Training

Inference

Numerical evaluation

Dataset

WandB for logging

Citation

Reference