<a href="https://colab.research.google.com/github/trunear/background-removal-js/blob/main/notebooks/RealTimeVoiceCloning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Real-Time Voice Cloning

This is a colab demo notebook using the open source project [CorentinJ/Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)
to clone a voice.

For other deep-learning Colab notebooks, visit [tugstugi/dl-colab-notebooks](https://github.com/tugstugi/dl-colab-notebooks).


Original issue: https://github.com/tugstugi/dl-colab-notebooks/issues/18

## Setup CorentinJ/Real-Time-Voice-Cloning

In [4]:
#@title Setup CorentinJ/Real-Time-Voice-Cloning

#@markdown * clone the project
#@markdown * download pretrained models
#@markdown * initialize the voice cloning models

# Remove the line below as TensorFlow 1.x is no longer supported in Colab
# %tensorflow_version 1.x
import os
from os.path import exists, join, basename, splitext

# Install required Python packages
git_repo_url = 'https://github.com/CorentinJ/Real-Time-Voice-Cloning.git'
project_name = splitext(basename(git_repo_url))[0]
if exists(project_name):
  # clone and install
  !git clone -q --recursive {git_repo_url}
  # install dependencies
  !cd {project_name} && pip install -q -r requirements.txt
  !pip install -q --upgrade gdown
  !apt-get install -qq libportaudio2
  !pip install -q https://github.com/tugstugi/dl-colab-notebooks/archive/colab_utils.zip

  # Install the missing unidecode library in the main environment
  !apt-get install -y libsndfile1
  !pip install webrtcvad unidecode

  # download pretrained model
  #!cd {project_name} && wget https://github.com/blue-fish/Real-Time-Voice-Cloning/releases/download/v1.0/pretrained.zip && unzip -o pretrained.zip
  !cd {project_name} && mkdir -p saved_models/default/
  !cd {project_name}/saved_models/default/ && gdown https://drive.google.com/uc?id=1q8mEGwCkFy23KZsinbuvdKAQLqNKbYf1
  !cd {project_name}/saved_models/default/ && gdown https://drive.google.com/uc?id=1EqFMIbvxffxtjiVrtykroF6_mUh-5Z3s
  !cd {project_name}/saved_models/default/ && gdown https://drive.google.com/uc?id=1cf2NO6FtI0jDuy8AV3Xgn6leO6dHjIgu

import sys
sys.path.append(project_name)

from IPython.display import display, Audio, clear_output
from IPython.utils import io
import ipywidgets as widgets
import numpy as np
from dl_colab_notebooks.audio import record_audio, upload_audio

from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder
from pathlib import Path

!ls
encoder.load_model(project_name / Path("saved_models/default/encoder.pt"))
synthesizer = Synthesizer(project_name / Path("saved_models/default/synthesizer.pt"))
vocoder.load_model(project_name / Path("saved_models/default/vocoder.pt"))

fatal: destination path 'Real-Time-Voice-Cloning' already exists and is not an empty directory.
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mPreparing metadata [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (pyproject.toml) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating packa

ModuleNotFoundError: No module named 'synthesizer.inference'

In [12]:
#@title Record or Upload
#@markdown * Either record audio from microphone or upload audio from file (.mp3 or .wav)

from encoder import inference as encoder
from IPython.display import Audio, display, clear_output
from google.colab import files
import io
import librosa
import numpy as np

SAMPLE_RATE = 22050
record_or_upload = "Upload (.mp3 or .wav)" #@param ["Record", "Upload (.mp3 or .wav)"]
record_seconds = 10  #@param {type:"number", min:1, max:10, step:1}

embedding = None

def _compute_embedding(wav, sr):
  display(Audio(wav, rate=sr, autoplay=True))
  global embedding
  processed = encoder.preprocess_wav(wav)
  embedding = encoder.embed_utterance(processed)
  print("✅ Voice embedding computed.")

# Use dl_colab_notebooks or your own audio recorder if needed
def _record_audio(b):
  clear_output()
  audio = record_audio(record_seconds, sample_rate=SAMPLE_RATE)
  _compute_embedding(audio, SAMPLE_RATE)

def _upload_audio(b=None):
  clear_output()
  print("📤 Please upload a .wav or .mp3 file...")
  uploaded = files.upload()
  if not uploaded:
    print("❌ No file uploaded.")
    return
  filename = next(iter(uploaded))
  audio_bytes = uploaded[filename]
  audio_stream = io.BytesIO(audio_bytes)
  wav, sr = librosa.load(audio_stream, sr=SAMPLE_RATE)
  _compute_embedding(wav, sr)

# Show the appropriate widget or trigger upload
import ipywidgets as widgets

if record_or_upload == "Record":
  button = widgets.Button(description="🎙️ Record Your Voice")
  button.on_click(_record_audio)
  display(button)
else:
  _upload_audio()


📤 Please upload a .wav or .mp3 file...


Saving Hey there in this video 2 (4).wav to Hey there in this video 2 (4) (12).wav


✅ Voice embedding computed.


In [13]:
#@title Synthesize a text { run: "auto" }
text = "One of the two people who tested positive for the novel coronavirus in the United Kingdom is a student at the University of York in northern England." #@param {type:"string"}

def synthesize(embed, text):
  print("Synthesizing new audio...")
  #with io.capture_output() as captured:
  specs = synthesizer.synthesize_spectrograms([text], [embed])
  generated_wav = vocoder.infer_waveform(specs[0])
  generated_wav = np.pad(generated_wav, (0, synthesizer.sample_rate), mode="constant")
  clear_output()
  display(Audio(generated_wav, rate=synthesizer.sample_rate, autoplay=True))

if embedding is None:
  print("first record a voice or upload a voice file!")
else:
  synthesize(embedding, text)

Synthesizing new audio...


NameError: name 'synthesizer' is not defined