## Multi-Accent and Multi-Lingual Voice Clone Demo with MeloTTS

In [1]:
import os
import torch
from openvoice import se_extractor
from openvoice.api import ToneColorConverter

  from .autonotebook import tqdm as notebook_tqdm


Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.



### Initialization

In this example, we will use the checkpoints from OpenVoiceV2. OpenVoiceV2 is trained with more aggressive augmentations and thus demonstrate better robustness in some cases.

In [3]:
root_path = r"D:\OpenVoice"
wav_relative_p = "./models--M4869--WavMark/snapshots/0ad3c7b74f641bddb61f6b85cdf2de0d93a5bfef/step59000_snr39.99_pesq4.35_BERP_none0.30_mean1.81_std1.81.model.pkl"

ckpt_converter = 'checkpoints_v2/converter'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = 'outputs_v2'

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device, wav_dir=os.path.join(root_path, ".cache", wav_relative_p))
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)



Loaded checkpoint 'checkpoints_v2/converter/checkpoint.pth'
missing/unexpected keys: [] []


### Obtain Tone Color Embedding
We only extract the tone color embedding for the target speaker. The source tone color embeddings can be directly loaded from `checkpoints_v2/ses` folder.

In [12]:
reference_speaker = 'resources/demo_speaker0.mp3' # This is the voice you want to clone
target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=False, cache_dir=os.path.join(root_path, ".cache"))

Estimating duration from bitrate, this may be inaccurate


OpenVoice version: v2


#### Use MeloTTS as Base Speakers

MeloTTS is a high-quality multi-lingual text-to-speech library by @MyShell.ai, supporting languages including English (American, British, Indian, Australian, Default), Spanish, French, Chinese, Japanese, Korean. In the following example, we will use the models in MeloTTS as the base speakers. 

In [14]:
from MeloTTS.melo.api import TTS

# texts = {
#     'EN_NEWEST': "Did you ever hear a folk tale about a giant turtle?",  # The newest English base speaker model
#     'EN': "Did you ever hear a folk tale about a giant turtle?",
#     'ES': "El resplandor del sol acaricia las olas, pintando el cielo con una paleta deslumbrante.",
#     'FR': "La lueur dorée du soleil caresse les vagues, peignant le ciel d'une palette éblouissante.",
#     'ZH': "在这次vacation中，我们计划去Paris欣赏埃菲尔铁塔和卢浮宫的美景。",
#     'JP': "彼は毎朝ジョギングをして体を健康に保っています。",
#     'KR': "안녕하세요! 오늘은 날씨가 정말 좋네요.",
# }

texts = {
    'ZH': "在这次vacation中，我们计划去Paris欣赏埃菲尔铁塔和卢浮宫的美景。你好，我是老方，欢迎光临我的“方自学堂”！本堂课我要分享的是“如何讲好一个故事”。有关“如何讲好一个故事”，我是从人物之间的对话模式中总结出了四种比较实用的结构框架。分别是信息交换式、情感表达式、 揭示秘密式和冲突与解决式。",
}

src_path = f'{output_dir}/tmp.wav'

# Speed is adjustable
speed = 1.0

for language, text in texts.items():
    model = TTS(language=language, device=device)
    speaker_ids = model.hps.data.spk2id
    
    for speaker_key in speaker_ids.keys():
        speaker_id = speaker_ids[speaker_key]
        speaker_key = speaker_key.lower().replace('_', '-')
        
        source_se = torch.load(f'checkpoints_v2/base_speakers/ses/{speaker_key}.pth', map_location=device)
        model.tts_to_file(text, speaker_id, src_path, speed=speed)
        save_path = f'{output_dir}/output_v2_{speaker_key}.wav'

        # Run the tone color converter
        encode_message = "@MyShell"
        tone_color_converter.convert(
            audio_src_path=src_path, 
            src_se=source_se, 
            tgt_se=target_se, 
            output_path=save_path,
            message=encode_message)

 > Text split to sentences.
在这次vacation中,
我们计划去Paris欣赏埃菲尔铁塔和卢浮宫的美景.
你好, 我是老方, 欢迎光临我的“方自学堂”.
本堂课我要分享的是“如何讲好一个故事”.
有关“如何讲好一个故事”,
我是从人物之间的对话模式中总结出了四种比较实用的结构框架.
分别是信息交换式、情感表达式、 揭示秘密式和冲突与解决式.


100%|██████████| 7/7 [00:01<00:00,  4.33it/s]


In [15]:
from typing import Optional, Any

def synthesis_toncvtr(
    model: Optional[TTS],
    tone_color_converter: Optional[ToneColorConverter],
    source_se: Any, 
    target_se: Any,
    texts: str,
    speed: float = 1.12
):
    speaker_ids = model.hps.data.spk2id

    audio = model.tts_to_file(texts, list(speaker_ids.values())[0], speed=speed)
    audio = torch.tensor(audio).float()

    # Run the tone color converter
    encode_message = "@MyShell"
    audio = tone_color_converter.convert(
        audio_src_path=audio,
        src_se=source_se,
        tgt_se=target_se,
        message=encode_message,
        output_path=r"D:\OpenVoice\outputs_v2\nihao.wav",
    )

    return audio

In [16]:
synthesis_toncvtr(
        model=model,
        tone_color_converter=tone_color_converter,
        source_se=source_se,
        target_se=target_se,
        texts="在这次vacation中，我们计划去Paris欣赏埃菲尔铁塔和卢浮宫的美景。你好，我是老方，欢迎光临我的“方自学堂”！本堂课我要分享的是“如何讲好一个故事”。有关“如何讲好一个故事”，我是从人物之间的对话模式中总结出了四种比较实用的结构框架。分别是信息交换式、情感表达式、 揭示秘密式和冲突与解决式。"
    )

 > Text split to sentences.
在这次vacation中,
我们计划去Paris欣赏埃菲尔铁塔和卢浮宫的美景.
你好, 我是老方, 欢迎光临我的“方自学堂”.
本堂课我要分享的是“如何讲好一个故事”.
有关“如何讲好一个故事”,
我是从人物之间的对话模式中总结出了四种比较实用的结构框架.
分别是信息交换式、情感表达式、 揭示秘密式和冲突与解决式.


100%|██████████| 7/7 [00:01<00:00,  4.36it/s]
