## Voice Style Control Demo


In [1]:
import os
import torch
from openvoice import se_extractor
from openvoice.api import BaseSpeakerTTS, ToneColorConverter

  from .autonotebook import tqdm as notebook_tqdm


Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.



### Initialization


In [2]:
ckpt_base = "checkpoints/base_speakers/EN"
ckpt_converter = "checkpoints/converter"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = "outputs"

base_speaker_tts = BaseSpeakerTTS(f"{ckpt_base}/config.json", device=device)
base_speaker_tts.load_ckpt(f"{ckpt_base}/checkpoint.pth")

tone_color_converter = ToneColorConverter(
    f"{ckpt_converter}/config.json", device=device
)
tone_color_converter.load_ckpt(f"{ckpt_converter}/checkpoint.pth")

os.makedirs(output_dir, exist_ok=True)

  WeightNorm.apply(module, name, dim)
  checkpoint_dict = torch.load(ckpt_path, map_location=torch.device(self.device))


Loaded checkpoint 'checkpoints/base_speakers/EN/checkpoint.pth'
missing/unexpected keys: [] []


  checkpoint = torch.load(resume_path, map_location=torch.device('cpu'))


Loaded checkpoint 'checkpoints/converter/checkpoint.pth'
missing/unexpected keys: [] []


### Obtain Tone Color Embedding


The `source_se` is the tone color embedding of the base speaker.
It is an average of multiple sentences generated by the base speaker. We directly provide the result here but
the readers feel free to extract `source_se` by themselves.


In [3]:
source_se = torch.load(f"{ckpt_base}/en_default_se.pth").to(device)

  source_se = torch.load(f"{ckpt_base}/en_default_se.pth").to(device)


The `reference_speaker.mp3` below points to the short audio clip of the reference whose voice we want to clone. We provide an example here. If you use your own reference speakers, please **make sure each speaker has a unique filename.** The `se_extractor` will save the `targeted_se` using the filename of the audio and **will not automatically overwrite.**


In [4]:
reference_speaker = "resources/5 wild new AI tools you can try right now - Fireship (youtube).mp3"  # This is the voice you want to clone
print(f"Cloning voice from {reference_speaker}")
print(os.path.abspath("resources/example_reference.mp3"))
print("Current working directory:", os.getcwd())

Cloning voice from resources/5 wild new AI tools you can try right now - Fireship (youtube).mp3
d:\Dev\ariolas-tech\rnd\OpenVoice\resources\example_reference.mp3
Current working directory: d:\Dev\ariolas-tech\rnd\OpenVoice


In [5]:
%pip install whisper
# ffmpeg should be installed in your system as well

Note: you may need to restart the kernel to use updated packages.


In [5]:
target_se, audio_name = se_extractor.get_se(
    reference_speaker, tone_color_converter, target_dir="processed", vad=True
)

OpenVoice version: v1
[(0.0, 165.33), (166.19, 168.21), (169.422, 254.304)]
after vad: dur = 252.232


Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:878.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]


### Inference


In [25]:
save_path = f"{output_dir}/output_en_sample.wav"

# Run the base speaker tts
text = """
    This audio is generated by OpenVoice.
    It features voice style control such as friendly, cheerful,
    sad, angry, terrified, shouting, whispering, or default speakers.
    The voice speed can also be adjusted.
    OpenAI text-to-speech can be used as base speaker to generate the audio for different languages.
    """
# text = """
#     Hello, I am Claude, your personal assistant. My audio is cloned as reference and generated this audio using OpenVoice V1.
#     A custom text can be passed to generate the audio in the desired tone.
#     """

print("Text resource: ", text)

src_path = f"{output_dir}/tmp.wav"
base_speaker_tts.tts(text, src_path, speaker="default", language="English", speed=1.0)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path=save_path,
    message=encode_message,
)

Text resource:  
    This audio is generated by OpenVoice.
    It features voice style control such as friendly, cheerful,
    sad, angry, terrified, shouting, whispering, or default speakers.
    The voice speed can also be adjusted.
    OpenAI text-to-speech can be used as base speaker to generate the audio for different languages.
    
 > Text splitted to sentences.
This audio is generated by OpenVoice. It features voice style control such as friendly,
cheerful, sad, angry, terrified, shouting, whispering, or default speakers. The voice speed can also be adjusted.
OpenAI text-to-speech can be used as base speaker to generate the audio for different languages.
ðɪs ˈɑdiˌoʊ ɪz ˈdʒɛnəɹˌeɪtɪd baɪ ˈoʊpən vɔɪs. ɪt ˈfitʃəɹz vɔɪs staɪɫ kənˈtɹoʊɫ sətʃ ɛz ˈfɹɛndli,
 length:96
 length:96
ˈtʃɪɹfəɫ, sæd, ˈæŋgɹi, ˈtɛɹəˌfaɪd, ˈʃaʊtɪŋ, ˈwɪspəɹɪŋ, əɹ dɪˈfɔɫt ˈspikəɹz. ðə vɔɪs spid kən ˈɔlsoʊ bi əˈdʒəstɪd.
 length:113
 length:113
ˈoʊpən eɪaɪ text-to-speech* kən bi juzd ɛz beɪs ˈspikəɹ tɪ ˈdʒɛnəɹˌeɪt ð

In [7]:
save_path = f"{output_dir}/output_en_default.wav"

# Run the base speaker tts
# text = "This audio is generated by OpenVoice."
text_src = "resources/texts/en.txt"


def read_text_from_file(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        return file.read()


text = read_text_from_file(text_src)
print("Text resource: ", text)

src_path = f"{output_dir}/tmp.wav"
base_speaker_tts.tts(text, src_path, speaker="default", language="English", speed=1.0)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path=save_path,
    message=encode_message,
)

Text resource:  One year ago, this unbelievable video of Will Smith eating spaghetti took the world by storm. We humans joked about it. We could easily tell that it was fake, and at that point, nobody was really afraid. But fast forward one year later, and generative AI tech has taken another huge leap forward. Will Smith eating spaghetti in 2024 is nothing to joke around about. If it doesn't plateau soon, it could put our Hollywood idols out of business, and there would be no one left to brainwash us. In today's video, we'll descend further into Uncanny Valley, and look at five new generative AI tools that you can actually use today. By the end of this video, you'll be able to fire your human photographer, videographer, sound engineer, and programmer. It is June 17th, 2024, and you're watching The Code Report. A few months ago, OpenAI previewed Sora, and teased us with a bunch of AI videos. Google later followed that up with Veeo, which was also quite impressive, but just this week, t

OutOfMemoryError: CUDA out of memory. Tried to allocate 850.00 MiB. GPU 0 has a total capacity of 6.00 GiB of which 0 bytes is free. Of the allocated memory 10.43 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

**Try with different styles and speed.** The style can be controlled by the `speaker` parameter in the `base_speaker_tts.tts` method. Available choices: friendly, cheerful, excited, sad, angry, terrified, shouting, whispering. Note that the tone color embedding need to be updated. The speed can be controlled by the `speed` parameter. Let's try whispering with speed 0.9.


In [26]:
source_se = torch.load(f"{ckpt_base}/en_style_se.pth").to(device)
save_path = f"{output_dir}/output_whispering.wav"

# Run the base speaker tts
# text = "This audio is generated by OpenVoice."

text_src = "resources/texts/en.txt"


def read_text_from_file(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        return file.read()


text = read_text_from_file(text_src)
print("Text resource: ", text)

src_path = f"{output_dir}/tmp.wav"
base_speaker_tts.tts(
    text, src_path, speaker="whispering", language="English", speed=0.9
)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path=save_path,
    message=encode_message,
)

  source_se = torch.load(f"{ckpt_base}/en_style_se.pth").to(device)


Text resource:  ### English Translation:
One year ago, this unbelievable video of Will Smith eating spaghetti took the world by storm. We humans joked about it. We could easily tell that it was fake, and at that point, nobody was really afraid. But fast forward one year later, and generative AI tech has taken another huge leap forward. Will Smith eating spaghetti in 2024 is nothing to joke around about. If it doesn't plateau soon, it could put our Hollywood idols out of business, and there would be no one left to brainwash us. In today's video, we'll descend further into Uncanny Valley, and look at five new generative AI tools that you can actually use today. By the end of this video, you'll be able to fire your human photographer, videographer, sound engineer, and programmer. It is June 17th, 2024, and you're watching The Code Report.

A few months ago, OpenAI previewed Sora, and teased us with a bunch of AI videos. Google later followed that up with Veeo, which was also quite impress

**Try with different languages.** OpenVoice can achieve multi-lingual voice cloning by simply replace the base speaker. We provide an example with a Chinese base speaker here and we encourage the readers to try `demo_part2.ipynb` for a detailed demo.


In [12]:
ckpt_base = "checkpoints/base_speakers/ZH"
base_speaker_tts = BaseSpeakerTTS(f"{ckpt_base}/config.json", device=device)
base_speaker_tts.load_ckpt(f"{ckpt_base}/checkpoint.pth")

source_se = torch.load(f"{ckpt_base}/zh_default_se.pth").to(device)
save_path = f"{output_dir}/output_chinese.wav"

# Run the base speaker tts
text = "今天天气真好，我们一起出去吃饭吧。"

src_path = f"{output_dir}/tmp.wav"
base_speaker_tts.tts(text, src_path, speaker="default", language="Chinese", speed=1.0)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path=save_path,
    message=encode_message,
)

  WeightNorm.apply(module, name, dim)
  checkpoint_dict = torch.load(ckpt_path, map_location=torch.device(self.device))
  source_se = torch.load(f"{ckpt_base}/zh_default_se.pth").to(device)
Building prefix dict from the default dictionary ...


Loaded checkpoint 'checkpoints/base_speakers/ZH/checkpoint.pth'
missing/unexpected keys: [] []
 > Text splitted to sentences.
今天天气真好, 我们一起出去吃饭吧.


Dumping model to file cache C:\Users\Taniyow\AppData\Local\Temp\jieba.cache
Loading model cost 0.570 seconds.
Prefix dict has been built successfully.


tʃ⁼in→tʰjɛn→tʰjɛn→tʃʰi↓ ts`⁼ən→ xɑʊ↓↑,  wo↓↑mən i↓tʃʰi↓↑ ts`ʰu→tʃʰɥ↓ ts`ʰɹ`→fan↓ p⁼a.
 length:85
 length:85


**Tech for good.** For people who will deploy OpenVoice for public usage: We offer you the option to add watermark to avoid potential misuse. Please see the ToneColorConverter class. **MyShell reserves the ability to detect whether an audio is generated by OpenVoice**, no matter whether the watermark is added or not.
