# Real-Time Voice Cloning

This is a colab demo notebook using the open source project [CorentinJ/Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)
to clone a voice.

For other deep-learning Colab notebooks, visit [tugstugi/dl-colab-notebooks](https://github.com/tugstugi/dl-colab-notebooks).


Original issue: https://github.com/tugstugi/dl-colab-notebooks/issues/18

## Setup CorentinJ/Real-Time-Voice-Cloning

1. Mount Drive
2. Install/Import dependencies
3. Load encoder, synthesizer, vocoder

In [0]:
from google.colab import drive
drive.mount('/content/drive')

%tensorflow_version 1.x
import os
from os.path import exists, join, basename, splitext

git_repo_url = 'https://github.com/CorentinJ/Real-Time-Voice-Cloning.git'
project_name = splitext(basename(git_repo_url))[0]
if not exists(project_name):
  # clone and install
  !git clone -q --recursive {git_repo_url}
  # install dependencies
  !cd {project_name} && pip install -q -r requirements.txt
  !pip install -q gdown
  !apt-get install -qq libportaudio2
  !pip install -q https://github.com/tugstugi/dl-colab-notebooks/archive/colab_utils.zip

  # download pretrained model
  !cd {project_name} && gdown https://drive.google.com/uc?id=1n1sPXvT34yXFLT47QZA6FIRGrwMeSsZc && unzip pretrained.zip

import sys
sys.path.append(project_name)

from IPython.display import display, Audio, clear_output
from IPython.utils import io
import ipywidgets as widgets
import numpy as np
from dl_colab_notebooks.audio import upload_audio

from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder
from pathlib import Path

encoder.load_model(project_name / Path("encoder/saved_models/pretrained.pt"))
synthesizer = Synthesizer(project_name / Path("synthesizer/saved_models/logs-pretrained/taco_pretrained"))
vocoder.load_model(project_name / Path("vocoder/saved_models/pretrained/pretrained.pt"))

TensorFlow 1.x selected.
[K     |████████████████████████████████| 377.0MB 46kB/s 
[K     |████████████████████████████████| 686kB 42.7MB/s 
[K     |████████████████████████████████| 71kB 10.8MB/s 
[K     |████████████████████████████████| 245kB 51.8MB/s 
[K     |████████████████████████████████| 63.6MB 45kB/s 
[K     |████████████████████████████████| 3.2MB 41.0MB/s 
[K     |████████████████████████████████| 491kB 56.6MB/s 
[K     |████████████████████████████████| 204kB 60.9MB/s 
[K     |████████████████████████████████| 256kB 58.8MB/s 
[?25h  Building wheel for visdom (setup.py) ... [?25l[?25hdone
  Building wheel for webrtcvad (setup.py) ... [?25l[?25hdone
  Building wheel for torchfile (setup.py) ... [?25l[?25hdone
[31mERROR: tensorflow 1.15.2 has requirement gast==0.2.2, but you'll have gast 0.3.3 which is incompatible.[0m
[31mERROR: tensorflow 1.15.2 has requirement tensorboard<1.16.0,>=1.15.0, but you'll have tensorboard 1.14.0 which is incompatible.[0m
[31

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])



Loaded encoder "pretrained.pt" trained to step 1564501
Found synthesizer "pretrained" trained to step 278000
Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at Real-Time-Voice-Cloning/vocoder/saved_models/pretrained/pretrained.pt


4. Encode voice embedding
5. Save voice embedding

In [0]:
SAMPLE_RATE = 22050
embedding = None

def _compute_embedding(audio):
  global embedding
  embedding = encoder.embed_utterance(encoder.preprocess_wav(audio, SAMPLE_RATE))
def _upload_audio(b):
  audio = upload_audio(sample_rate=SAMPLE_RATE)
  _compute_embedding(audio)

_upload_audio("")

np.save('/content/drive/My Drive/Colab Notebooks/CS485/embedding_8.npy', embedding)

Saving freeman_train_8.wav to freeman_train_8.wav


6. Example synthesis:

In [0]:
embedding = np.load('/content/drive/My Drive/Colab Notebooks/CS485/embedding_8.npy')

text = "He disappeared into his bedroom, and returned in a few minutes in the character of an amiable and simple-minded Nonconformist clergyman."
  
def synthesize(embed, text):
  print("Synthesizing new audio...")
  specs = synthesizer.synthesize_spectrograms([text], [embed])
  generated_wav = vocoder.infer_waveform(specs[0])
  generated_wav = np.pad(generated_wav, (0, synthesizer.sample_rate), mode="constant")
  clear_output()
  display(Audio(generated_wav, rate=synthesizer.sample_rate, autoplay=True))

if embedding is None:
  print("first record a voice or upload a voice file!")
else:
  synthesize(embedding, text)