## Multi-Accent and Multi-Lingual Voice Clone Demo with MeloTTS

In [21]:
import os
import torch
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
from IPython.display import Audio, display

### Initialization

In this example, we will use the checkpoints from OpenVoiceV2. OpenVoiceV2 is trained with more aggressive augmentations and thus demonstrate better robustness in some cases.

In [22]:
ckpt_converter = 'checkpoints_v2/converter'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = 'outputs_v2'

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)

Loaded checkpoint 'checkpoints_v2/converter/checkpoint.pth'
missing/unexpected keys: [] []


### Obtain Tone Color Embedding
We only extract the tone color embedding for the target speaker. The source tone color embeddings can be directly loaded from `checkpoints_v2/ses` folder.

In [23]:

reference_speaker = 'resources/demo_speaker0.mp3' # This is the voice you want to clone (use a male voice for male output)
target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=True)

OpenVoice version: v2


Python(39574) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[(0.0, 19.278375)]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
after vad: dur = 19.27798185941043


Python(39589) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(39590) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


#### Use MeloTTS as Base Speakers

MeloTTS is a high-quality multi-lingual text-to-speech library by @MyShell.ai, supporting languages including English (American, British, Indian, Australian, Default), Spanish, French, Chinese, Japanese, Korean. In the following example, we will use the models in MeloTTS as the base speakers. 

In [24]:
from melo.api import TTS

texts = {
    'EN_NEWEST': "Did you ever hear a folk tale about a giant turtle?",  # The newest English base speaker model
}


src_path = f'{output_dir}/tmp.wav'

# Speed is adjustable
speed = 1.0

for language, text in texts.items():
    model = TTS(language=language, device=device)
    speaker_ids = model.hps.data.spk2id
    
    for speaker_key in speaker_ids.keys():
        speaker_id = speaker_ids[speaker_key]
        speaker_key = speaker_key.lower().replace('_', '-')
        
        source_se = torch.load(f'checkpoints_v2/base_speakers/ses/{speaker_key}.pth', map_location=device)
        if torch.backends.mps.is_available() and device == 'cpu':
            torch.backends.mps.is_available = lambda: False
        model.tts_to_file(text, speaker_id, src_path, speed=speed)
        save_path = f'{output_dir}/output_v2_{speaker_key}.wav'

        # Run the tone color converter
        encode_message = "@MyShell"
        tone_color_converter.convert(
            audio_src_path=src_path, 
            src_se=source_se, 
            tgt_se=target_se, 
            output_path=save_path,
            message=encode_message)

  WeightNorm.apply(module, name, dim)


 > Text split to sentences.
Did you ever hear a folk tale about a giant turtle?


100%|██████████| 1/1 [00:00<00:00,  1.05it/s]


In [30]:

from melo.api import TTS

print("=" * 60)
print("ALL AVAILABLE SPEAKERS BY LANGUAGE")
print("=" * 60)

# List of all supported languages
languages = ['EN', 'ES', 'FR', 'ZH', 'JP', 'KR']

all_speakers_dict = {}

for lang in languages:
    try:
        model = TTS(language=lang, device='cpu')
        speaker_ids = model.hps.data.spk2id
        
        print(f"\n{lang}:")
        if (lang == 'EN'):
            speaker_ids_global = speaker_ids
        print("-" * 40)
        for idx, (speaker_key, speaker_id) in enumerate(speaker_ids.items(), 1):
            print(f"  {idx}. {speaker_key} (ID: {speaker_id})")
        
        all_speakers_dict[lang] = speaker_ids
    except Exception as e:
        print(f"\n{lang}: Error loading - {str(e)}")

print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
for lang, speakers in all_speakers_dict.items():
    print(f"{lang}: {len(speakers)} speaker(s)")

print("\n✓ Speaker list loaded. Use 'all_speakers_dict' to access speakers programmatically.")

speaker_ids_global


ALL AVAILABLE SPEAKERS BY LANGUAGE


  WeightNorm.apply(module, name, dim)



EN:
----------------------------------------
  1. EN-US (ID: 0)
  2. EN-BR (ID: 1)
  3. EN_INDIA (ID: 2)
  4. EN-AU (ID: 3)
  5. EN-Default (ID: 4)

ES:
----------------------------------------
  1. ES (ID: 0)

FR:
----------------------------------------
  1. FR (ID: 0)

ZH:
----------------------------------------
  1. ZH (ID: 1)

JP:
----------------------------------------
  1. JP (ID: 0)

KR:
----------------------------------------
  1. KR (ID: 0)

SUMMARY
EN: 5 speaker(s)
ES: 1 speaker(s)
FR: 1 speaker(s)
ZH: 1 speaker(s)
JP: 1 speaker(s)
KR: 1 speaker(s)

✓ Speaker list loaded. Use 'all_speakers_dict' to access speakers programmatically.


{'EN-US': 0, 'EN-BR': 1, 'EN_INDIA': 2, 'EN-AU': 3, 'EN-Default': 4}

In [38]:
selected_speaker = 'EN-Default'  # Change this to your preferred speaker

if selected_speaker not in speaker_ids_global:
    print(f"\nWarning: '{selected_speaker}' not found. Using first available speaker.")
    selected_speaker = list(speaker_ids_global.keys())[0]

print(f"\n✓ Selected speaker: {selected_speaker}")
speaker_id = speaker_ids_global[selected_speaker]
speaker_key_normalized = selected_speaker.lower().replace('_', '-')

# Determine the language from the speaker name
if selected_speaker.startswith('EN'):
    language = 'EN'
elif selected_speaker.startswith('ES'):
    language = 'ES'
elif selected_speaker.startswith('FR'):
    language = 'FR'
elif selected_speaker.startswith('ZH'):
    language = 'ZH'
elif selected_speaker.startswith('JP'):
    language = 'JP'
elif selected_speaker.startswith('KR'):
    language = 'KR'
else:
    language = 'EN'  # Default to English

# Initialize the TTS model for the selected language
print(f"Initializing TTS model for language: {language}")
model = TTS(language=language, device=device)


# Generate TTS with selected speaker
text = "Hi, I’m Woddy, your AI career coach. I’m here to help you grow your career with clarity, intention, and confidence. Let's update your C V to continue"
src_path = f'{output_dir}/tmp.wav'
# Speed control for energy level:
# - 0.8-0.9 = slower, calmer
# - 1.0 = normal
# - 1.1-1.2 = faster, more energetic (recommended for energetic voice)
# - 1.3-1.5 = very fast, very energetic
# ===== Naturalness Parameters =====
# Adjust these to reduce robotic/glitchy artifacts and make voice more human-like

# Speed control:
# - 0.8-0.9 = slower, calmer
# - 1.0 = normal
# - 1.1-1.2 = faster, more energetic
speed = 1.0

# Noise scale: Controls randomness in voice generation
# - Lower (0.4-0.5) = more deterministic, but can sound robotic
# - Higher (0.7-0.9) = more natural variation, less robotic (recommended)
noise_scale = 0.8  # Default: 0.6, try 0.7-0.8 for more natural sound

# Noise scale for duration: Controls pacing variation
# - Lower (0.6-0.7) = more consistent timing
# - Higher (0.9-1.0) = more natural pacing variation
noise_scale_w = 0.9  # Default: 0.8, try 0.9-1.0 for more natural pacing

# SDP ratio: Balance between stochastic and deterministic duration prediction
# - Lower (0.1-0.15) = more consistent, but can sound robotic
# - Higher (0.3-0.4) = more natural variation in speech rhythm
sdp_ratio = 0.3  # Default: 0.2, try 0.3-0.35 for more natural rhythm

# Tau: Controls how much of your reference voice tone is applied
# - Lower (0.2-0.25) = preserves more base speaker characteristics, more natural
# - Higher (0.4-0.5) = applies more of your voice, but may introduce artifacts
# - Default: 0.3
tau = 0.2  # Try 0.2-0.3 for more natural blending

# Load source speaker embedding
source_se = torch.load(f'checkpoints_v2/base_speakers/ses/{speaker_key_normalized}.pth', map_location=device)

if torch.backends.mps.is_available() and device == 'cpu':
    torch.backends.mps.is_available = lambda: False

# Generate TTS with naturalness parameters
print(f"Generating speech with {selected_speaker}...")
model.tts_to_file(
    text, 
    speaker_id, 
    src_path, 
    speed=speed,
    noise_scale=noise_scale,
    noise_scale_w=noise_scale_w,
    sdp_ratio=sdp_ratio
)

# Convert tone color to match your reference voice
save_path = f'{output_dir}/output_v2_{speaker_key_normalized}.wav'
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=source_se, 
    tgt_se=target_se, 
    output_path=save_path,
    tau=tau,  # Apply tau parameter for better tone blending
    message=encode_message)

# Display the generated audio
print(f"\n✓ Generated: {save_path}")
display(Audio(save_path))


✓ Selected speaker: EN-Default
Initializing TTS model for language: EN
Generating speech with EN-Default...
 > Text split to sentences.
Hi, I'm Woddy, your AI career coach. I'm here to help you grow your career with clarity, intention, and confidence. Let's update your C V to continue


100%|██████████| 1/1 [00:02<00:00,  2.63s/it]



✓ Generated: outputs_v2/output_v2_en-default.wav
