# **Tango: LLM-guided Text-to-Audio Generation and DPO-based Alignment**

TANGO is a latent diffusion model (LDM) for text-to-audio (TTA) generation. TANGO can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We perform comparably to current state-of-the-art models for TTA across both objective and subjective metrics, despite training the LDM on a 63 times smaller dataset. We release our model, training, inference code, and pre-trained checkpoints for the research community.


In [1]:
!git clone https://github.com/saravananbcs/tango.git

Cloning into 'tango'...
remote: Enumerating objects: 2355, done.[K
remote: Counting objects: 100% (341/341), done.[K
remote: Compressing objects: 100% (195/195), done.[K
remote: Total 2355 (delta 194), reused 259 (delta 143), pack-reused 2014[K
Receiving objects: 100% (2355/2355), 18.34 MiB | 17.56 MiB/s, done.
Resolving deltas: 100% (936/936), done.


In [2]:
%cd /content/tango
!pip install -r requirements.txt

/content/tango
Collecting torch==1.13.1 (from -r requirements.txt (line 1))
  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==0.13.1 (from -r requirements.txt (line 2))
  Downloading torchaudio-0.13.1-cp310-cp310-manylinux1_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m95.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.14.1 (from -r requirements.txt (line 3))
  Downloading torchvision-0.14.1-cp310-cp310-manylinux1_x86_64.whl (24.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.2/24.2 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.27.0 (from -r requirements.txt (line 4))
  Downloading transformers-4.27.0-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [4]:
%cd /content/tango
import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango2")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

/content/tango


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

  fft_window = pad_center(fft_window, filter_length)
  mel_basis = librosa_mel_fn(


UNet initialized randomly.


Some weights of the model checkpoint at google/flan-t5-large were not used when initializing T5EncoderModel: ['decoder.block.1.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.9.layer.0.SelfAttention.o.weight', 'decoder.block.16.layer.0.SelfAttention.k.weight', 'decoder.block.9.layer.0.SelfAttention.v.weight', 'decoder.block.16.layer.1.layer_norm.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.0.layer_norm.weight', 'decoder.block.7.layer.2.DenseReluDense.wo.weight', 'decoder.block.11.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.2.DenseReluDense.wo.weight', 'decoder.block.18.layer.0.SelfAttention.q.weight', 'decoder.block.13.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.2.layer.1.EncDecAttention.o.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.13.layer.0.layer_norm.weight', 'decoder.block.20.layer.0.SelfAttention.k.weight', 'decoder.block.0.layer.0.SelfAttention.v.weight', 'decoder.block.0.laye

Successfully loaded checkpoint from: declare-lab/tango2


In [6]:
prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)

In [None]:
prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)
for audio_data in audios:
  IPython.display.Audio(data=audio, rate=16000)

Credits to the authors.

```
@article{ghosal2023tango,
  title={Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model},
  author={Ghosal, Deepanway and Majumder, Navonil and Mehrish, Ambuj and Poria, Soujanya},
  journal={arXiv preprint arXiv:2304.13731},
  year={2023}
}
```

