## Notebook 4: 生成播客

现在我们的讲稿已完全准备好，可以生成播客音频了。

在这个 Notebook 中，我们将首先学习如何使用[lucasnewman/f5-tts-mlx](https://huggingface.co/lucasnewman/f5-tts-mlx)模型生成音频。

接下来，将使用 Notebook 3 处理好的内容来生成我们的播客音频。

In [1]:
import IPython.display as ipd
from f5_tts_mlx.generate import generate, SAMPLE_RATE
from tqdm import tqdm


### 测试音频生成

我们尝试用模型生成音频，以了解它是如何工作的。

#### F5 TTS MLX 模型

先用模型生成一段简短的音频。

In [2]:
TEST_AUDIO_FILE = "./resources/f5_tts_mlx_test_audio.wav"
MODEL = "lucasnewman/f5-tts-mlx"

In [3]:
# 定义文本
text_prompt = """静夜思
唐·李白
床前明月光，
疑是地上霜。
举头望明月，
低头思故乡。
"""

generate(
    generation_text=text_prompt,
    model_name=MODEL,
    output_path=TEST_AUDIO_FILE,
    speed=0.5  # 默认是0.8，会导致中文语速很快，这里改0.5慢点
)

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/94/vx2ydz_56c7_v5lml34d4g9w0000gn/T/jieba.cache


Got reference audio with duration: 5.33 seconds


Loading model cost 0.421 seconds.
Prefix dict has been built successfully.


Got duration of 1288 frames (6.9285688400268555 secs) for generated speech.
Generated speech in 0:00:19.757508
Generated 8.44 seconds of audio in 0:00:19.757712.


In [4]:
# 播放音频
ipd.Audio(TEST_AUDIO_FILE, rate=SAMPLE_RATE)

## 整合：制作播客

好了，现在正式开始生成我们的播客音频

In [5]:
import pickle

with open('./resources/podcast_ready_data.pkl', 'rb') as file:
    PODCAST_TEXT = pickle.load(file)

In [6]:
outputs = []
speed = 0.5

speaker 1 生成音频函数

In [7]:
def generate_speaker1_audio(text, output_path):
    generate(
        generation_text=text,
        model_name=MODEL,
        output_path=output_path,
        speed=speed
    )

speaker 2 生成音频函数

In [8]:
def generate_speaker2_audio(text, output_path):
    generate(
        generation_text=text,
        model_name=MODEL,
        output_path=output_path,
        ref_audio_path="../resources/test_en_2_ref_short.wav",
        ref_audio_text="Some call me nature, others call me mother nature.",
        speed=speed
    )

In [9]:
PODCAST_TEXT

'[\n    ("Speaker 1", "欢迎收听我们的播客，让我们深入探讨一个充满魔法和奇迹的话题：知识蒸馏。我是你们的引导者，今天我们将讨论如何让开源的大规模语言模型如LLaMa和Mistral变得更强壮、更聪明。我们将会用一些有趣的比喻和生动的例子来解释这一复杂的主题。"),\n    ("Speaker 2", "哦，听起来好神奇啊！什么是知识蒸馏呢？"),\n    ("Speaker 1", "哈哈，你很兴奋嘛！知识蒸馏就像是给一个初学者魔术师提供了一份详细的表演指南。具体来说，就是如何将一个高级模型的知识和技巧倾囊相授，让一个低级模型也能像高级模型一样出色。想象一下，如果一个顶级魔术大师(Jo RooGan模型)把他的所有技巧传授给一个初学者(JaLaMa模型)，这个初学者能否学会用最专业的方式表演魔术。"),\n    ("Speaker 2", "嗯，这听起来就像是在给初学者一个详细的指南，而不是直接告诉他怎么做。那高级模型的技巧如何被传授给初学者呢？"),\n    ("Speaker 1", "没错，这就像是老师在旁边一步步示范教学。高级模型会通过一系列的任务和问题来训练初学者。就像一个高级魔术师在表演过程中，会不断展示各种技巧和手法，让初学者学习如何通过一系列的动作完成复杂的表演。"),\n    ("Speaker 2", "哈哈，就像是在看魔术表演学习一样！那这种知识转移对初学者模型有什么具体帮助呢？"),\n    ("Speaker 1", "很好，你已经触到了关键之处！通过这些一步步的学习，初学者模型可以说在理解能力和应对复杂任务的能力上都有了显著提升。就像是初学者慢慢从只能做简单的魔术，到能够表演最复杂的魔术一样。"),\n    ("Speaker 2", "那如果我做一个类比，高级魔术师的表演就像是知识蒸馏的过程，而初学者的训练就像是提高了数据增强吗？"),\n    ("Speaker 1", "哈哈，你又洞察到了关键点。数据增强不仅仅是添加更多的训练数据，而更多是通过知识蒸馏，让模型理解和掌握更深层次的知识。就像一个高级魔术师通过不断的实践，学习各种技巧，然后通过这些技巧来完成最复杂的表演。"),\n    ("Speaker 2", "那么，知识蒸馏的过程对于实现实体例的突变，是不是就像是高级魔术师使用魔法道具一样强

现在我们将文本转成数组

In [10]:
import ast

ast.literal_eval(PODCAST_TEXT)

[('Speaker 1',
  '欢迎收听我们的播客，让我们深入探讨一个充满魔法和奇迹的话题：知识蒸馏。我是你们的引导者，今天我们将讨论如何让开源的大规模语言模型如LLaMa和Mistral变得更强壮、更聪明。我们将会用一些有趣的比喻和生动的例子来解释这一复杂的主题。'),
 ('Speaker 2', '哦，听起来好神奇啊！什么是知识蒸馏呢？'),
 ('Speaker 1',
  '哈哈，你很兴奋嘛！知识蒸馏就像是给一个初学者魔术师提供了一份详细的表演指南。具体来说，就是如何将一个高级模型的知识和技巧倾囊相授，让一个低级模型也能像高级模型一样出色。想象一下，如果一个顶级魔术大师(Jo RooGan模型)把他的所有技巧传授给一个初学者(JaLaMa模型)，这个初学者能否学会用最专业的方式表演魔术。'),
 ('Speaker 2', '嗯，这听起来就像是在给初学者一个详细的指南，而不是直接告诉他怎么做。那高级模型的技巧如何被传授给初学者呢？'),
 ('Speaker 1',
  '没错，这就像是老师在旁边一步步示范教学。高级模型会通过一系列的任务和问题来训练初学者。就像一个高级魔术师在表演过程中，会不断展示各种技巧和手法，让初学者学习如何通过一系列的动作完成复杂的表演。'),
 ('Speaker 2', '哈哈，就像是在看魔术表演学习一样！那这种知识转移对初学者模型有什么具体帮助呢？'),
 ('Speaker 1',
  '很好，你已经触到了关键之处！通过这些一步步的学习，初学者模型可以说在理解能力和应对复杂任务的能力上都有了显著提升。就像是初学者慢慢从只能做简单的魔术，到能够表演最复杂的魔术一样。'),
 ('Speaker 2', '那如果我做一个类比，高级魔术师的表演就像是知识蒸馏的过程，而初学者的训练就像是提高了数据增强吗？'),
 ('Speaker 1',
  '哈哈，你又洞察到了关键点。数据增强不仅仅是添加更多的训练数据，而更多是通过知识蒸馏，让模型理解和掌握更深层次的知识。就像一个高级魔术师通过不断的实践，学习各种技巧，然后通过这些技巧来完成最复杂的表演。'),
 ('Speaker 2', '那么，知识蒸馏的过程对于实现实体例的突变，是不是就像是高级魔术师使用魔法道具一样强大？'),
 ('Speaker 1',
  '简直就是！知识蒸馏

#### 生成最终的播客音频

最后，我们通过遍历数组并使用我们的辅助函数来生成对应的音频。

In [11]:
final_audio = None

i = 1

for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT), desc="Generating podcast segments", unit="segment"):
    output_path = f"./resources/segments/_podcast_segment_{i}.wav"
    if speaker == "Speaker 1":
        generate_speaker1_audio(text, output_path)
    else:  # Speaker 2
        generate_speaker2_audio(text, output_path)
    i += 1

Generating podcast segments:   0%|          | 0/25 [00:00<?, ?segment/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 2219 frames (11.933658599853516 secs) for generated speech.


Generating podcast segments:   4%|▍         | 1/25 [00:38<15:34, 38.94s/segment]

Generated speech in 0:00:37.576497
Generated 18.37 seconds of audio in 0:00:37.576647.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 934 frames (5.024754047393799 secs) for generated speech.


Generating podcast segments:   8%|▊         | 2/25 [00:54<09:35, 25.04s/segment]

Generated speech in 0:00:13.924289
Generated 6.06 seconds of audio in 0:00:13.924451.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 2398 frames (12.89692211151123 secs) for generated speech.


Generating podcast segments:  12%|█▏        | 3/25 [01:37<12:15, 33.45s/segment]

Generated speech in 0:00:41.770920
Generated 20.28 seconds of audio in 0:00:41.771136.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1237 frames (6.655817985534668 secs) for generated speech.


Generating podcast segments:  16%|█▌        | 4/25 [01:57<09:48, 28.01s/segment]

Generated speech in 0:00:18.527986
Generated 9.29 seconds of audio in 0:00:18.528189.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 1810 frames (9.73123550415039 secs) for generated speech.


Generating podcast segments:  20%|██        | 5/25 [02:28<09:41, 29.07s/segment]

Generated speech in 0:00:30.006554
Generated 14.01 seconds of audio in 0:00:30.006716.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1112 frames (5.980731964111328 secs) for generated speech.


Generating podcast segments:  24%|██▍       | 6/25 [02:45<07:56, 25.07s/segment]

Generated speech in 0:00:16.184001
Generated 7.96 seconds of audio in 0:00:16.184171.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 1808 frames (9.722575187683105 secs) for generated speech.


Generating podcast segments:  28%|██▊       | 7/25 [03:15<08:01, 26.74s/segment]

Generated speech in 0:00:28.804278
Generated 13.98 seconds of audio in 0:00:28.804446.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1157 frames (6.22338342666626 secs) for generated speech.


Generating podcast segments:  32%|███▏      | 8/25 [03:34<06:48, 24.04s/segment]

Generated speech in 0:00:17.207085
Generated 8.44 seconds of audio in 0:00:17.207246.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 1868 frames (10.044198036193848 secs) for generated speech.


Generating podcast segments:  36%|███▌      | 9/25 [04:05<07:02, 26.38s/segment]

Generated speech in 0:00:30.297740
Generated 14.62 seconds of audio in 0:00:30.297942.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1159 frames (6.233282566070557 secs) for generated speech.


Generating podcast segments:  40%|████      | 10/25 [04:23<05:57, 23.81s/segment]

Generated speech in 0:00:16.929675
Generated 8.46 seconds of audio in 0:00:16.929840.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 2036 frames (10.947996139526367 secs) for generated speech.


Generating podcast segments:  44%|████▍     | 11/25 [04:58<06:20, 27.16s/segment]

Generated speech in 0:00:33.253204
Generated 16.42 seconds of audio in 0:00:33.253465.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1374 frames (7.38759708404541 secs) for generated speech.


Generating podcast segments:  48%|████▊     | 12/25 [05:20<05:31, 25.49s/segment]

Generated speech in 0:00:20.388493
Generated 10.75 seconds of audio in 0:00:20.388642.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 2049 frames (11.019887924194336 secs) for generated speech.


Generating podcast segments:  52%|█████▏    | 13/25 [05:55<05:42, 28.54s/segment]

Generated speech in 0:00:34.015169
Generated 16.56 seconds of audio in 0:00:34.015357.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1187 frames (6.386105537414551 secs) for generated speech.


Generating podcast segments:  56%|█████▌    | 14/25 [06:13<04:39, 25.42s/segment]

Generated speech in 0:00:17.030449
Generated 8.76 seconds of audio in 0:00:17.030613.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 2338 frames (12.575190544128418 secs) for generated speech.


Generating podcast segments:  60%|██████    | 15/25 [06:55<05:02, 30.29s/segment]

Generated speech in 0:00:40.403572
Generated 19.64 seconds of audio in 0:00:40.403723.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1317 frames (7.082666873931885 secs) for generated speech.


Generating podcast segments:  64%|██████▍   | 16/25 [07:16<04:07, 27.51s/segment]

Generated speech in 0:00:19.238829
Generated 10.14 seconds of audio in 0:00:19.238987.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 1690 frames (9.087151527404785 secs) for generated speech.


Generating podcast segments:  68%|██████▊   | 17/25 [07:44<03:41, 27.64s/segment]

Generated speech in 0:00:26.592357
Generated 12.73 seconds of audio in 0:00:26.592510.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1243 frames (6.68318510055542 secs) for generated speech.


Generating podcast segments:  72%|███████▏  | 18/25 [08:04<02:56, 25.22s/segment]

Generated speech in 0:00:18.099418
Generated 9.35 seconds of audio in 0:00:18.099561.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 1843 frames (9.911506652832031 secs) for generated speech.


Generating podcast segments:  76%|███████▌  | 19/25 [08:34<02:40, 26.79s/segment]

Generated speech in 0:00:29.253455
Generated 14.36 seconds of audio in 0:00:29.253657.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1258 frames (6.767056941986084 secs) for generated speech.


Generating podcast segments:  80%|████████  | 20/25 [08:54<02:04, 24.87s/segment]

Generated speech in 0:00:18.590813
Generated 9.51 seconds of audio in 0:00:18.591068.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 1855 frames (9.973958015441895 secs) for generated speech.


Generating podcast segments:  84%|████████▍ | 21/25 [09:26<01:47, 26.86s/segment]

Generated speech in 0:00:30.247227
Generated 14.49 seconds of audio in 0:00:30.247388.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 985 frames (5.298693656921387 secs) for generated speech.


Generating podcast segments:  88%|████████▊ | 22/25 [09:42<01:11, 23.73s/segment]

Generated speech in 0:00:15.166220
Generated 6.60 seconds of audio in 0:00:15.166424.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 1680 frames (9.037110328674316 secs) for generated speech.


Generating podcast segments:  92%|█████████▏| 23/25 [10:09<00:49, 24.75s/segment]

Generated speech in 0:00:26.055403
Generated 12.62 seconds of audio in 0:00:26.055567.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 3.94 seconds
Got duration of 1141 frames (6.1369948387146 secs) for generated speech.


Generating podcast segments:  96%|█████████▌| 24/25 [10:27<00:22, 22.57s/segment]

Generated speech in 0:00:16.244240
Generated 8.27 seconds of audio in 0:00:16.244391.


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Got reference audio with duration: 5.33 seconds
Got duration of 1512 frames (8.129292488098145 secs) for generated speech.


Generating podcast segments: 100%|██████████| 25/25 [10:51<00:00, 26.06s/segment]

Generated speech in 0:00:22.913903
Generated 10.83 seconds of audio in 0:00:22.914062.





In [18]:
# 合并音频片段 ./resources/segments
import re
import os
import numpy as np
import soundfile as sf

audio_files = sorted([f"./resources/segments/{file}" for file in os.listdir("./resources/segments")],
                     key=lambda x: int(re.search(r'segment_(\d+)\.wav', x).group(1)))

print("audio_files -> ", audio_files)

audio_data = []
for file in audio_files:
    data, rate = sf.read(file)
    audio_data.append(data)

audio_data = np.concatenate(audio_data)

audio_files ->  ['./resources/segments/_podcast_segment_1.wav', './resources/segments/_podcast_segment_2.wav', './resources/segments/_podcast_segment_3.wav', './resources/segments/_podcast_segment_4.wav', './resources/segments/_podcast_segment_5.wav', './resources/segments/_podcast_segment_6.wav', './resources/segments/_podcast_segment_7.wav', './resources/segments/_podcast_segment_8.wav', './resources/segments/_podcast_segment_9.wav', './resources/segments/_podcast_segment_10.wav', './resources/segments/_podcast_segment_11.wav', './resources/segments/_podcast_segment_12.wav', './resources/segments/_podcast_segment_13.wav', './resources/segments/_podcast_segment_14.wav', './resources/segments/_podcast_segment_15.wav', './resources/segments/_podcast_segment_16.wav', './resources/segments/_podcast_segment_17.wav', './resources/segments/_podcast_segment_18.wav', './resources/segments/_podcast_segment_19.wav', './resources/segments/_podcast_segment_20.wav', './resources/segments/_podcast_s

### 输出播客音频

现在我们将其保存为wav文件

In [19]:
sf.write("./resources/_podcast.wav", audio_data, SAMPLE_RATE)

### 最后的建议

- 修改 Prompt：你可以尝试修改 SYSTEM_PROMPT，让其能生成你想要的风格
- 可以尝试扩展更多发言者
- 尝试使用其他 TTS 模型
- 将语音增强模型作为第5步进行尝试。

In [14]:
#fin