# **Tango: LLM-guided Text-to-Audio Generation and DPO-based Alignment**

TANGO is a latent diffusion model (LDM) for text-to-audio (TTA) generation. TANGO can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We perform comparably to current state-of-the-art models for TTA across both objective and subjective metrics, despite training the LDM on a 63 times smaller dataset. We release our model, training, inference code, and pre-trained checkpoints for the research community.


In [1]:
!git clone https://github.com/saravananbcs/tango-colab-demo.git

Cloning into 'tango-colab-demo'...
remote: Enumerating objects: 2363, done.[K
remote: Counting objects: 100% (349/349), done.[K
remote: Compressing objects: 100% (203/203), done.[K
remote: Total 2363 (delta 197), reused 258 (delta 143), pack-reused 2014[K
Receiving objects: 100% (2363/2363), 18.96 MiB | 15.07 MiB/s, done.
Resolving deltas: 100% (939/939), done.


In [2]:
%cd /content/tango-colab-demo
!pip install -r requirements.txt

/content/tango-colab-demo
Collecting torch==1.13.1 (from -r requirements.txt (line 1))
  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==0.13.1 (from -r requirements.txt (line 2))
  Downloading torchaudio-0.13.1-cp310-cp310-manylinux1_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m95.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.14.1 (from -r requirements.txt (line 3))
  Downloading torchvision-0.14.1-cp310-cp310-manylinux1_x86_64.whl (24.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.2/24.2 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.27.0 (from -r requirements.txt (line 4))
  Downloading transformers-4.27.0-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [1]:
# make sure to restart the runtime
!pip install jax==0.4.23
!pip install jaxlib==0.4.23

Collecting jax==0.4.23
  Downloading jax-0.4.23-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
Collecting scipy>=1.9 (from jax==0.4.23)
  Downloading scipy-1.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scipy, jax
  Attempting uninstall: scipy
    Found existing installation: scipy 1.8.0
    Uninstalling scipy-1.8.0:
      Successfully uninstalled scipy-1.8.0
  Attempting uninstall: jax
    Found existing installation: jax 0.4.26
    Uninstalling jax-0.4.26:
      Successfully uninstalled jax-0.4.26
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chex 0.1.86 requires numpy>=1.

In [2]:
%cd /content/tango-colab-demo
import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango2")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

/content/tango-colab-demo


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Downloading main_config.json:   0%|          | 0.00/207 [00:00<?, ?B/s]

Downloading .gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/2.00k [00:00<?, ?B/s]

Downloading pytorch_model_stft.bin:   0%|          | 0.00/8.54M [00:00<?, ?B/s]

Downloading stft_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

Downloading pytorch_model_main.bin:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

Downloading pytorch_model_vae.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/207 [00:00<?, ?B/s]

Downloading vae_config.json:   0%|          | 0.00/326 [00:00<?, ?B/s]

  fft_window = pad_center(fft_window, filter_length)
  mel_basis = librosa_mel_fn(


Downloading (…)cheduler_config.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

UNet initialized randomly.


Downloading tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Some weights of the model checkpoint at google/flan-t5-large were not used when initializing T5EncoderModel: ['decoder.block.3.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.2.DenseReluDense.wo.weight', 'decoder.block.5.layer.1.EncDecAttention.q.weight', 'decoder.block.18.layer.0.SelfAttention.q.weight', 'decoder.block.12.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.0.layer_norm.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.10.layer.1.EncDecAttention.k.weight', 'decoder.block.1.layer.0.SelfAttention.k.weight', 'decoder.block.18.layer.1.EncDecAttention.q.weight', 'decoder.block.21.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.10.layer.0.SelfAttention.o.weight', 'decoder.block.9.layer.1.layer_norm.weight', 'decoder.block.22.layer.0.SelfAttention.q.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.

Successfully loaded checkpoint from: declare-lab/tango2


In [3]:
prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)

In [None]:
prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)
for audio_data in audios:
  IPython.display.Audio(data=audio, rate=16000)

Credits to the authors.

```
@article{ghosal2023tango,
  title={Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model},
  author={Ghosal, Deepanway and Majumder, Navonil and Mehrish, Ambuj and Poria, Soujanya},
  journal={arXiv preprint arXiv:2304.13731},
  year={2023}
}
```

