### Audio generation with a pipeline

Audio generation encompasses a versatile set of tasks that involve producing an audio output. The tasks that we will look into here are speech generation (aka “text-to-speech”) and music generation. In text-to-speech, a model transforms a piece of text into lifelike spoken language sound, opening the door to applications such as virtual assistants, accessibility tools for the visually impaired, and personalized audiobooks. On the other hand, music generation can enable creative expression, and finds its use mostly in entertainment and game development industries.

In 🤗 Transformers, you’ll find a pipeline that covers both of these tasks. This pipeline is called "text-to-audio", but for convenience, it also has a "text-to-speech" alias. Here we’ll use both, and you are free to pick whichever seems more applicable for your task.

Let’s explore how you can use this pipeline to start generating audio narration for texts, and music with just a few lines of code.

This pipeline is new to 🤗 Transformers and comes part of the version 4.32 release. Thus you’ll need to upgrade the library to the latest version to get the feature:

### Generating speech

Let’s begin by exploring text-to-speech generation. First, just as it was the case with audio classification and automatic speech recognition, we’ll need to define the pipeline. We’ll define a text-to-speech pipeline since it best describes our task, and use the suno/bark-small checkpoint:

In [2]:
from transformers import pipeline
pipe = pipeline("text-to-speech", model="suno/bark-small")

Device set to use cpu


The next step is as simple as passing some text through the pipeline. All the preprocessing will be done for us under the hood:

In [None]:
#English
text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
output = pipe(text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [4]:
output

{'audio': array([[-9.9278670e-03, -9.4693527e-03, -1.0387977e-02, ...,
         -1.2958838e-04, -1.0461944e-04, -8.4813000e-05]], dtype=float32),
 'sampling_rate': 24000}

In a notebook, we can use the following code snippet to listen to the result:

In [5]:
from IPython.display import Audio
Audio(output["audio"], rate=output["sampling_rate"])

The model that we’re using with the pipeline, Bark, is actually multilingual, so we can easily substitute the initial text with a text in, say, French, and use the pipeline in the exact same way. It will pick up on the language all by itself:

In [None]:
#French
fr_text = "Contrairement à une idée répandue, le nombre de points sur les élytres d'une coccinelle ne correspond pas à son âge, ni en nombre d'années, ni en nombre de mois. "
output = pipe(fr_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [7]:
Audio(output["audio"], rate=output["sampling_rate"])

In [8]:
#Portuguese
pt_text = "Olá, seja bem vindo ao Portal de Negociações da Hash Technology. "
output = pipe(pt_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [9]:
Audio(output["audio"], rate=output["sampling_rate"])

Not only is this model multilingual, it can also generate audio with non-verbal communications and singing. Here’s how you can make it sing:

In [10]:
song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ "
output = pipe(song)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [11]:
Audio(output["audio"], rate=output["sampling_rate"])

We’ll dive deeper into Bark specifics in the later unit dedicated to Text-to-speech, and will also show how you can use other models for this task. Now, let’s generate some music!

### Generating music

Just as before, we’ll begin by instantiating a pipeline. For music generation, we’ll define a text-to-audio pipeline, and initialise it with the pretrained checkpoint facebook/musicgen-small

In [12]:
music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small")

model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

Config of the text_encoder: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> is overwritten by shared text_encoder config: T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summ

generation_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu


Let’s create a text description of the music we’d like to generate:

In [13]:
text = "90s rock song with electric guitar and heavy drums"

We can control the length of the generated output by passing an additional max_new_tokens parameter to the model.

In [14]:
forward_params = {"max_new_tokens": 512}
output = music_pipe(text, forward_params=forward_params)



In [15]:
Audio(output["audio"][0], rate=output["sampling_rate"])