## Notebook 4: TTS Workflow

We have the exact podcast transcripts ready now to generate our audio for the Podcast.

In this notebook, we will learn how to generate Audio using both [lucasnewman/f5-tts-mlx](https://huggingface.co/lucasnewman/f5-tts-mlx) models first. 

After that, we will use the output from Notebook 3 to generate our complete podcast

In [3]:
import IPython.display as ipd
from tqdm import tqdm
import requests



## Bringing it together: Making the Podcast

Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast

Let's load in our pickle file from earlier and proceed:

In [5]:
import pickle

with open('./resources/podcast_ready_data.pkl', 'rb') as file:
    PODCAST_TEXT = pickle.load(file)

We will concatenate the generated segments of audio and also their respective sampling rates since we will require this to generate the final audio

In [6]:
outputs = []

Function generate text for speaker 1

In [7]:
def generate_speaker1_audio_web(text, output_path):
    response = requests.post(url='http://localhost:8080/v1/audio/speech_gpt',json={'input':text,'speaker':'speaker1'})
    with open(output_path, 'wb') as file:
        file.write(response.content)

Function to generate text for speaker 2

In [8]:
def generate_speaker2_audio_web(text, output_path):
    response = requests.post(url='http://localhost:8080/v1/audio/speech_gpt',json={'input':text,'speaker':'speaker2'})
    with open(output_path, 'wb') as file:
        file.write(response.content)

Helper function to convert the numpy output from the models into audio

In [9]:
PODCAST_TEXT

'[\n    ("Speaker 1", "Alright folks, welcome to our podcast, where we dive deep into the cutting-edge world of Large Language Models (LLMs) and knowledge distillation. Today, we\'re exploring how we can enhance the capabilities of smaller models by transferring knowledge from larger, proprietary models. Imagine if we could make your humble assistant bot as smart as the mighty GPT-4! That\'s what this talk is all about. So, let\'s get started!"),\n    ("Speaker 2", "Wow, that sounds amazing! So, are we talking about making open-source models smarter just like the fancy proprietary ones?"),\n    ("Speaker 1", "Exactly! You got it. Open-source models are incredibly accessible, but they often don\'t have the depth and breadth of knowledge that proprietary models have. So, our goal is to bridge that gap and make them more powerful without needing to pay the hefty price tag."),\n    ("Speaker 2", "That\'s awesome! How do we do that? Is it like transferring skills or something?"),\n    ("Spe

Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy. 

We will take the string from the pickle file and load it in as a Tuple with the help of `ast.literal_eval()`

In [10]:
import ast

ast.literal_eval(PODCAST_TEXT)

[('Speaker 1',
  "Alright folks, welcome to our podcast, where we dive deep into the cutting-edge world of Large Language Models (LLMs) and knowledge distillation. Today, we're exploring how we can enhance the capabilities of smaller models by transferring knowledge from larger, proprietary models. Imagine if we could make your humble assistant bot as smart as the mighty GPT-4! That's what this talk is all about. So, let's get started!"),
 ('Speaker 2',
  'Wow, that sounds amazing! So, are we talking about making open-source models smarter just like the fancy proprietary ones?'),
 ('Speaker 1',
  "Exactly! You got it. Open-source models are incredibly accessible, but they often don't have the depth and breadth of knowledge that proprietary models have. So, our goal is to bridge that gap and make them more powerful without needing to pay the hefty price tag."),
 ('Speaker 2',
  "That's awesome! How do we do that? Is it like transferring skills or something?"),
 ('Speaker 1',
  "Yes, it'

#### Generating the Final Podcast

Finally, we can loop over the Tuple and use our helper functions to generate the audio

In [None]:
final_audio = None

i = 1

for item in tqdm(ast.literal_eval(PODCAST_TEXT), desc="Generating podcast segments", unit="segment"):
    speaker, text = item[0],item[1]
    output_path = f"./resources/segments/_podcast_segment_{i}.wav"
    if speaker == "Speaker 1":
        generate_speaker1_audio_web(text, output_path)
    else:  # Speaker 2
        generate_speaker2_audio_web(text, output_path)
    i += 1

In [1]:
# Combine the segments ./resources/segments
import re
import os
import numpy as np
import soundfile as sf

audio_files = sorted([f"./resources/segments/{file}" for file in os.listdir("./resources/segments")],
                     key=lambda x: int(re.search(r'segment_(\d+)\.wav', x).group(1)))

print("audio_files -> ", audio_files)
audio_data = []
for file in audio_files:
    data, rate = sf.read(file)
    audio_data.append(data)

audio_data = np.concatenate(audio_data)

audio_files ->  ['./resources/segments/_podcast_segment_1.wav', './resources/segments/_podcast_segment_2.wav', './resources/segments/_podcast_segment_3.wav', './resources/segments/_podcast_segment_4.wav', './resources/segments/_podcast_segment_5.wav', './resources/segments/_podcast_segment_6.wav', './resources/segments/_podcast_segment_7.wav', './resources/segments/_podcast_segment_8.wav', './resources/segments/_podcast_segment_9.wav', './resources/segments/_podcast_segment_10.wav', './resources/segments/_podcast_segment_11.wav', './resources/segments/_podcast_segment_12.wav', './resources/segments/_podcast_segment_13.wav', './resources/segments/_podcast_segment_14.wav', './resources/segments/_podcast_segment_15.wav', './resources/segments/_podcast_segment_16.wav', './resources/segments/_podcast_segment_17.wav', './resources/segments/_podcast_segment_18.wav', './resources/segments/_podcast_segment_19.wav', './resources/segments/_podcast_segment_20.wav', './resources/segments/_podcast_s

### Output the Podcast

We can now save this as a wav file

In [4]:
SAMPLE_RATE = 32000
sf.write("./resources/_podcast.wav", audio_data, SAMPLE_RATE)

### Suggested Next Steps:

- Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks
- Extend workflow beyond two speakers
- Test other TTS Models
- Experiment with Speech Enhancer models as a step 5.

In [None]:
#fin