<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/Bark_HuggingFace_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bark在 🤗 Transformers

从 v4.31.0 开始，Bark 模型在 🤗 Transformers 中可用！

在这个笔记本中，我们将演示如何使用 🤗 Transformers 库中的 Bark 模型，涵盖无条件生成、带有发言者提示的生成以及用于可控生成的高级文本提示。

## Bark 架构

Bark 是一个基于 Transformer 的文本转语音模型，由 Suno AI 在 [suno-ai/bark](https://github.com/suno-ai/bark) 中提出。

Bark 由四个主要模型组成：

- `BarkSemanticModel`（也称为“文本”模型）：一个因果自回归 Transformer 模型，它将标记化的文本作为输入，并预测捕捉文本含义的语义文本标记。
- `BarkCoarseModel`（也称为“粗略声学”模型）：一个因果自回归 Transformer 模型，它将 `BarkSemanticModel` 的结果作为输入，旨在预测 EnCodec 所需的前两个音频代码簿。
- `BarkFineModel`（“细致声学”模型），这是一个非因果自编码器 Transformer，它基于前面代码簿嵌入的总和迭代地预测最后的代码簿。
- 在预测了 `EncodecModel` 的所有代码簿通道后，Bark 使用它来解码输出音频数组。

需要注意的是，前三个模块中的每一个都可以支持条件发言者嵌入，以根据特定预定义的声音来调整输出声音。

## 准备环境

让我们确保连接到一个 GPU 来运行这个笔记本。要获取 GPU，点击 Runtime -> Change runtime type，然后将 Hardware accelerator 从 None 更改为 GPU。我们可以通过 nvidia-smi 命令验证我们是否被分配了一个 GPU，并查看其规格：

In [1]:
!nvidia-smi

Tue May 21 09:53:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

我们看到这里我们有一个 Tesla T4 16GB GPU，尽管这可能会根据 GPU 的可用性和 Colab GPU 的分配而有所不同。

接下来，我们从主分支安装 🤗 Transformers 包：

In [2]:
!pip install --upgrade --quiet pip
!pip install --quiet git+https://github.com/huggingface/transformers.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[0m

# 载入模型

可以从 Hugging Face Hub 上的 [pre-trained weights](https://huggingface.co/suno/bark) 加载预训练的 Bark small 和 large checkpoints。您可以根据您希望使用的检查点大小更改 repo-id。

我们将默认使用大型检查点，以获得更好的质量但推理速度较慢。但您可以使用小型检查点，方法是使用 `"suno/bark-small"` 而不是 `"suno/bark"`。

In [4]:
from transformers import BarkModel

model = BarkModel.from_pretrained("suno/bark-small")

config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

In [6]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

In [7]:
model_million_params = sum(p.numel() for p in model.parameters())/1e6
print(model)
print(f"{model_million_params}M parameters")


BarkModel(
  (semantic): BarkSemanticModel(
    (input_embeds_layer): Embedding(129600, 768)
    (position_embeds_layer): Embedding(1024, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x BarkBlock(
        (layernorm_1): BarkLayerNorm()
        (layernorm_2): BarkLayerNorm()
        (attn): BarkSelfAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (att_proj): Linear(in_features=768, out_features=2304, bias=False)
          (out_proj): Linear(in_features=768, out_features=768, bias=False)
        )
        (mlp): BarkMLP(
          (in_proj): Linear(in_features=768, out_features=3072, bias=False)
          (out_proj): Linear(in_features=3072, out_features=768, bias=False)
          (dropout): Dropout(p=0.0, inplace=False)
          (gelu): GELU(approximate='none')
        )
      )
    )
    (layernorm_final): BarkLayerNorm()
    (lm_head): Linear(in_features=76

## 生成语音

Bark 是一个高度可控的文本转语音模型，这意味着您可以使用各种设置来生成语音，如我们将要看到的。

首先，加载 `BarkProcessor` 以便能够预处理输入。

处理器在这里的作用是双重的：
1. 它用于标记化输入文本，即将其切割成模型可以理解的小片段。
2. 它存储发言者嵌入，即可以调节生成的语音预设。

In [8]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("suno/bark")

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

speaker_embeddings_path.json:   0%|          | 0.00/61.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

### 无条件生成

首先，让我们以最简单的方式生成语音，不添加任何花哨的设置。

In [16]:
# prepare the inputs
#text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
text_prompt = "让我们试着用Bark， 一个文本到语音的模型生成语音"

inputs = processor(text_prompt)

# generate speech
speech_output = model.generate(**inputs.to(device))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


音频输出是一个形状为 `(batch_size, num_channels, sequence_length)` 的三维 Torch 张量。要聆听生成的音频样本，您可以在 ipynb 笔记本中播放它们：

In [18]:
from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

或者使用第三方库将它们保存为。wav文件，例如scipy(注意，我们还需要从音频张量中删除通道维度):

In [17]:
import scipy

scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_output[0].cpu().numpy())

In [19]:
Audio("bark_out.wav")

In [21]:
# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"

inputs = processor(text_prompt)

# generate speech
speech_output = model.generate(**inputs.to(device))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [22]:
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

### 条件生成

Suno AI 团队提供了一个[预设声音库](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c)，用于调节生成的语音。换句话说，它生成的语音看起来是由预定义的条件声音生成的。

处理器可以在标记化输入文本时自动加载这些发言者提示。

让我们尝试一个声音预设：

In [20]:
voice_preset = "v2/en_speaker_6"

# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

en_speaker_6_semantic_prompt.npy:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

en_speaker_6_coarse_prompt.npy:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

en_speaker_6_fine_prompt.npy:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Great, let's try another voice preset:

In [23]:
voice_preset = "v2/en_speaker_3"

# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

en_speaker_3_semantic_prompt.npy:   0%|          | 0.00/3.54k [00:00<?, ?B/s]

en_speaker_3_coarse_prompt.npy:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

en_speaker_3_fine_prompt.npy:   0%|          | 0.00/20.6k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [24]:
voice_preset = "v2/en_speaker_3"

# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [25]:
voice_preset = "v2/zh_speaker_8"

# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

zh_speaker_8_semantic_prompt.npy:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

zh_speaker_8_coarse_prompt.npy:   0%|          | 0.00/5.68k [00:00<?, ?B/s]

zh_speaker_8_fine_prompt.npy:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [26]:
voice_preset = "v2/zh_speaker_8"

# prepare the inputs
text_prompt = "Suno AI 团队提供了一个预设声音库，用于调节生成的语音。换句话说，它生成的语音看起来是由预定义的条件声音生成的。处理器可以在标记化输入文本时自动加载这些发言者提示"
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### 更高级的生成技术

前面的生成方法都是默认使用采样模式 (`do_sample=True`) 生成的，但您也可以使用 [更高级的生成技术](https://huggingface.co/docs/transformers/generation_strategies)，比如 `beam_search` 来获得更好的质量。

您还可以为每个子模型指定特定的生成参数，只需将您想要的生成参数前置 `semantic_`、`coarse_` 或 `fine_` 即可。

让我们将其与之前的 `text_prompt` 一起使用。

In [27]:
speech_output = model.generate(**inputs, num_beams = 4, temperature = 0.1, semantic_temperature = 0.8)

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


**tips：** 中文支持不好，都是老外的中文腔调， 而且中英文混合会出现badcase 😞

### 多语种语音

Bark 还可以生成多语种语音，比如法语和中文语音。

In [28]:
# Multilingual speech - simplified Chinese
inputs = processor("惊人的！我会说中文")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [29]:
# Multilingual speech - French - let's use a voice_preset as well
inputs = processor("Je peux générer du son facilement avec ce modèle.", voice_preset="fr_speaker_3")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

fr_speaker_3_semantic_prompt.npy:   0%|          | 0.00/3.19k [00:00<?, ?B/s]

fr_speaker_3_coarse_prompt.npy:   0%|          | 0.00/9.33k [00:00<?, ?B/s]

fr_speaker_3_fine_prompt.npy:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### **非语言**沟通

该模型还可以产生**非语言沟通**，比如笑、叹息和哭泣。

In [30]:
# Adding non-speech cues to the input text
inputs = processor("[clears throat] Hello uh ..., my dog is cute [laughter]")


# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### 更多应用：

Bark 还可以生成音乐。您可以通过在歌词周围添加音符来辅助生成。

In [31]:
inputs = processor("♪ In the jungle, the mighty jungle, the lion barks tonight ♪")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [32]:
# more advanced prompts!

text_prompt = """
    WOMAN: I would like an oatmilk latte please.
    MAN: Wow, that's expensive!
"""

inputs = processor(text_prompt)

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


## 结论

Bark 是一个多功能模型，通过试用它来发现更多它的功能和限制吧！