# Podcast Voice Synthesis
This notebook generates a podcast-style conversation using AI-generated voices.

In [1]:
# Install necessary dependencies
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch pydub
!apt-get install -y espeak-ng
!apt-get install -y festival festvox-kallpc16k


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
zsh:1: command not found: apt-get
zsh:1: command not found: apt-get


# Install Git LFS (Large File Storage)
!git lfs install

# Clone the model repository
!git clone https://huggingface.co/hexgrad/Kokoro-82M

# Change directory to the cloned model
%cd Kokoro-82M


In [10]:
!pip install kokoro==0.7.16


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import os

# Set the environment variable
os.environ["PHONEMIZER_ESPEAK_LIBRARY"] = "/opt/homebrew/Cellar/espeak/1.48.04_1/lib/libespeak.dylib"

# Verify if it's set correctly
print(os.environ.get("PHONEMIZER_ESPEAK_LIBRARY"))

/opt/homebrew/Cellar/espeak/1.48.04_1/lib/libespeak.dylib


In [1]:
import torch
import numpy as np
from pydub import AudioSegment
from pydub.playback import play
from IPython.display import display, Audio
from kokoro import generate
from phonemizer.backend import EspeakBackend

ImportError: cannot import name 'generate' from 'kokoro' (c:\Users\Vamshi Krishna Gundu\AppData\Local\Programs\Python\Python311\Lib\site-packages\kokoro\__init__.py)

In [7]:
# Check device
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Load model
from models import build_model
MODEL = build_model('kokoro-v0_19.pth', device)

# Define voices
voice_list = [
    'af', 'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky'
]
voice_3 = voice_list[3]  # Default voice
voice_2 = voice_list[2]  # Default voice
voice_10 = voice_list[10]  # Alternate voice


In [8]:
# Load voices
voicepack_2 = torch.load(f'voices/{voice_2}.pt', weights_only=True).to(device)
voicepack_10 = torch.load(f'voices/{voice_10}.pt', weights_only=True).to(device)

# Define podcast script
script = [
("Hello everyone! Welcome to the Employee Platforms & Employee Experience podcast insights! We’re so excited to dive into something truly special today—a first-of-its-kind event! That’s right, we’re talking about the Employee Experience Expo 2025, hosted by JPMorgan Chase in Bengaluru on March 10th! Trust me, you don’t want to miss this!", voice_2),

("That’s right! And let me tell you—this event was huge! We had a staggering 3,500 attendees, gathered over 1,200 product feedback responses, and received more than 70 pieces of expo feedback. Just wow! What an incredible success!", voice_10),

("So, what made it such a big hit? Three major things: First, an eye-catching setup right in the courtyard—impossible to miss! Second, the showcased products were right there—visible, accessible, and super engaging! And third… come on, who doesn’t love goodies and giveaways? They definitely added to the fun!", voice_2),

("Absolutely! But you know what really had people talking? The passport and stamping concept—it was a game-changer! Seriously, such a cool and interactive idea. And let’s not forget the AI-related product showcases—there was so much buzz around them!", voice_10),

("Of course, even the best events can have some room for improvement. A couple of key takeaways? Better kiosk management to cut down on sound interference and, oh, definitely stronger pre-event marketing to get people even more excited before they arrive!", voice_2),

("Totally agree! But overall? This first edition of the Employee Experience Expo set the bar high! It’s laid the foundation for even bigger, better, and more unforgettable events in the future!", voice_10),

("And that’s a wrap! Thanks for tuning in to Employee Platforms & Employee Experience podcast insights. Stay tuned for more exciting updates—we’ll see you next time!", voice_2),
]

In [9]:

# Initialize phonemizer dictionary
phonemizers = {'en-us': EspeakBackend('en-us')}

# Generate audio clips and transcript
audio_clips = []
transcript = []

# Debugging: Check available functions in MODEL
print("Available functions in MODEL:", dir(MODEL))

for idx, (text, voice) in enumerate(script):
    voicepack = voicepack_3 if voice == voice_3 else voicepack_10
    print(f"Generating voice {idx}: {voice}")

    # Ensure phonemizer has the correct language
    if phonemizers is None:
        phonemizers = {'en-us': EspeakBackend('en-us')}
    if 'en-us' not in phonemizers:
        phonemizers['en-us'] = EspeakBackend('en-us')

    # Test if generate() returns valid audio
    test_audio = generate(MODEL, "Hello, this is a test.", voicepack)
    print(f"Output Type: {type(test_audio)}")
    if isinstance(test_audio, (list, tuple)):
        print(f"First Element Type: {type(test_audio[0])}")

    # Call generate and ensure audio output is correctly formatted
    audio = generate(MODEL, text, voicepack)

Available functions in MODEL: ['bert', 'bert_encoder', 'decoder', 'predictor', 'text_encoder']
Generating voice 0: af_sarah
Output Type: <class 'tuple'>
First Element Type: <class 'numpy.ndarray'>
Generating voice 1: af_sky
Output Type: <class 'tuple'>
First Element Type: <class 'numpy.ndarray'>
Generating voice 2: af_sarah
Output Type: <class 'tuple'>
First Element Type: <class 'numpy.ndarray'>
Generating voice 3: af_sky
Output Type: <class 'tuple'>
First Element Type: <class 'numpy.ndarray'>
Generating voice 4: af_sarah
Output Type: <class 'tuple'>
First Element Type: <class 'numpy.ndarray'>
Generating voice 5: af_sky
Output Type: <class 'tuple'>
First Element Type: <class 'numpy.ndarray'>
Generating voice 6: af_sarah
Output Type: <class 'tuple'>
First Element Type: <class 'numpy.ndarray'>


In [11]:
# Initialize phonemizer dictionary
phonemizers = {'en-us': EspeakBackend('en-us')}

# Generate audio clips and transcript
audio_clips = []
transcript = []

# Debugging: Check available functions in MODEL
print("Available functions in MODEL:", dir(MODEL))

for idx, (text, voice) in enumerate(script):
    voicepack = voicepack_3 if voice == voice_3 else voicepack_10
    print(f"Generating voice {idx}: {voice}")

    # Ensure phonemizer has the correct language
    if phonemizers is None:
        phonemizers = {'en-us': EspeakBackend('en-us')}
    if 'en-us' not in phonemizers:
        phonemizers['en-us'] = EspeakBackend('en-us')

    # Call generate and extract audio from tuple output
    audio, phonetic_text = generate(MODEL, text, voicepack)  # Unpack tuple
    print(f"Phonetic transcription: {phonetic_text}")

    # Convert only the numerical audio data
    audio = np.array(audio, dtype=np.float32).flatten()

    transcript.append(f"Voice {idx % 2}: {text}")

    audio_clips.append(AudioSegment(
        np.array(audio * 32767, dtype=np.int16).tobytes(),
        frame_rate=24000,
        sample_width=2,
        channels=1
    ))

Available functions in MODEL: ['bert', 'bert_encoder', 'decoder', 'predictor', 'text_encoder']
Generating voice 0: af_sarah
Phonetic transcription: həlˈoʊ ˈɛvɹɪwˌʌn! wˈɛlkʌm tə ðɪ ɛmplˈɔɪiː plˈætfɔːɹmz ænd ɛmplˈɔɪiː ɛkspˈiəɹɪəns pˈɑːdkæst ˈɪnsaɪts! wɪɹ sˌoʊ ɛksˈaɪɾᵻd tə dˈaɪv ˌɪntʊ sˈʌmθɪŋ tɹˈuːli spˈɛʃəl tədˈeɪ—ɐ fˈɜːstʌvɪtskˈaɪnd ɪvˈɛnt! ðæts ɹˈaɪt, wɪɹ tˈɔːkɪŋ ɐbˌaʊt ðɪ ɛmplˈɔɪiː ɛkspˈiəɹɪəns ˈɛkspoʊ twˈɛnti twˈɛntifˈaɪv, hˈoʊstᵻd baɪ dʒˌeɪpˈiː mˈɔːɹɡən tʃˈeɪs ɪn bˈɛŋɡɐlˌʊɹuː ˌɑːn mˈɑːɹtʃ tˈɛnθ! tɹˈʌst mˌiː, juː dˈoʊnt wˈɑːnt tə mˈɪs ðˈɪs!
Generating voice 1: af_sky
Phonetic transcription: ðæts ɹˈaɪt! ænd lˈɛt mˌiː tˈɛl juː—ðɪs ɪvˈɛnt wʌz hjˈuːdʒ! wiː hɐd ɐ stˈæɡɚɹɪŋ θɹˈiː θˈaʊzənd fˈaɪvhˈʌndɹəd ɐtˈɛndiːz, ɡˈæðɚd ˌoʊvɚ wˈʌn θˈaʊzənd tˈuːhˈʌndɹəd pɹˈɑːdʌkt fˈiːdbæk ɹɪspˈɑːnsᵻz, ænd ɹɪsˈiːvd mˈoːɹ ðɐn sˈɛvənti pˈiːsᵻz ʌv ˈɛkspoʊ fˈiːdbæk. dʒˈʌst wˈaʊ! wˌʌt ɐn ɪnkɹˈɛdɪbəl səksˈɛs!
Generating voice 2: af_sarah
Phonetic transcription: sˈoʊ, wˌʌt mˌeɪd ɪt sˈʌtʃ ɐ bˈɪɡ hˈɪt? θɹˈiː mˈeɪdʒɚ 

In [12]:
# Stitch audio files together
stitched_audio = sum(audio_clips)
stitched_audio.export("podcast.wav", format="wav")

<_io.BufferedRandom name='podcast.wav'>

In [13]:
# Display transcript
print("\nGenerated Podcast Transcript:")
for line in transcript:
    print(line)


Generated Podcast Transcript:
Voice 0: Hello everyone! Welcome to the Employee Platforms & Employee Experience podcast insights! We’re so excited to dive into something truly special today—a first-of-its-kind event! That’s right, we’re talking about the Employee Experience Expo 2025, hosted by JPMorgan Chase in Bengaluru on March 10th! Trust me, you don’t want to miss this!
Voice 1: That’s right! And let me tell you—this event was huge! We had a staggering 3,500 attendees, gathered over 1,200 product feedback responses, and received more than 70 pieces of expo feedback. Just wow! What an incredible success!
Voice 0: So, what made it such a big hit? Three major things: First, an eye-catching setup right in the courtyard—impossible to miss! Second, the showcased products were right there—visible, accessible, and super engaging! And third… come on, who doesn’t love goodies and giveaways? They definitely added to the fun!
Voice 1: Absolutely! But you know what really had people talking? T

In [14]:
# Play stitched audio
display(Audio("podcast.wav", rate=24000))