<a href="https://colab.research.google.com/github/thennal10/zeroshot/blob/main/Zeroshot_w_Gradio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zero-shot and few-shot TTS using YourTTS and Gradio
A quick demo of the abilities of YourTTS using Gradio. Run the setup code blocks, and then run **either** the 'Synthesize from d-vectors' or the 'Finetune and Synthesize' blocks. Click on the link that the last cell outputs.

Make sure to change your runtime to GPU if you're doing finetuning.

## Download and install TTS and models

In [1]:
!git clone https://github.com/coqui-ai/TTS TTS
!pip install -q -e TTS/
!pip install -q torchaudio==0.9.0
!pip install -q gradio

Cloning into 'TTS'...
remote: Enumerating objects: 23791, done.[K
remote: Counting objects: 100% (191/191), done.[K
remote: Compressing objects: 100% (138/138), done.[K
remote: Total 23791 (delta 85), reused 93 (delta 52), pack-reused 23600[K
Receiving objects: 100% (23791/23791), 127.81 MiB | 17.23 MiB/s, done.
Resolving deltas: 100% (17223/17223), done.
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 3.4 MB 12.3 MB/s 
[K     |████████████████████████████████| 284 kB 46.2 MB/s 
[K     |████████████████████████████████| 15.3 MB 26.8 MB/s 
[K     |████████████████████████████████| 66 kB 193 kB/s 
[K     |████████████████████████████████| 212 kB 52.6 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K 

In [2]:
!wget https://coqui.gateway.scarf.sh/v0.5.0_models/tts_models--multilingual--multi-dataset--your_tts.zip
!unzip tts_models--multilingual--multi-dataset--your_tts.zip

--2022-02-19 09:54:51--  https://coqui.gateway.scarf.sh/v0.5.0_models/tts_models--multilingual--multi-dataset--your_tts.zip
Resolving coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)... 18.193.247.98, 3.64.83.114
Connecting to coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)|18.193.247.98|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/coqui-ai/TTS/releases/download/v0.5.0_models/tts_models--multilingual--multi-dataset--your_tts.zip [following]
--2022-02-19 09:54:51--  https://github.com/coqui-ai/TTS/releases/download/v0.5.0_models/tts_models--multilingual--multi-dataset--your_tts.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/265612440/06b726fc-6cd2-4f94-9a17-378b9303dd15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIW

## Setup TTS

In [3]:
# don't ask
!pip uninstall -y numpy
!pip install numpy

Found existing installation: numpy 1.19.5
Uninstalling numpy-1.19.5:
  Successfully uninstalled numpy-1.19.5
Collecting numpy
  Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 10.7 MB/s 
[?25hInstalling collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.0 requires tf-estimator-nightly==2.8.0.dev2021122109, which is not installed.
torchvision 0.11.1+cu111 requires torch==1.10.0, but you have torch 1.9.0 which is incompatible.
torchtext 0.11.0 requires torch==1.10.0, but you have torch 1.9.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully inst

In [4]:
import sys
TTS_PATH = "TTS/"

# add libraries into environment
sys.path.append(TTS_PATH) # set this if TTS is not installed globally

from TTS.config import load_config
from TTS.trainer import Trainer, TrainingArgs
from TTS.tts.models.vits import Vits
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.languages import LanguageManager
from TTS.tts.utils.synthesis import synthesis

In [5]:
import torch

OUT_PATH = "/content/output"

parent = "/content/tts_models--multilingual--multi-dataset--your_tts"
# model vars 
MODEL_PATH = f"{parent}/model_file.pth.tar"
CONFIG_PATH = f"{parent}/config.json"
TTS_LANGUAGES = f"{parent}/language_ids.json"
TTS_SPEAKERS = "/content/speakers.json"
CONFIG_SE_PATH = f"{parent}/config_se.json"
CHECKPOINT_SE_PATH = f"{parent}/model_se.pth.tar"

USE_CUDA = torch.cuda.is_available()

In [6]:
# load the config
C = load_config(CONFIG_PATH)

# load the audio processor
ap = AudioProcessor(**C.audio, verbose=False)
C.output_path = OUT_PATH
C.use_language_weighted_sampler = False

C.test_sentences = []
C.min_seq_len=0
C.max_seq_len=500000

speaker_manager = SpeakerManager(
    encoder_model_path=CHECKPOINT_SE_PATH, 
    encoder_config_path=CONFIG_SE_PATH,
    use_cuda=USE_CUDA)
language_manager = LanguageManager()
language_manager.set_language_ids_from_config(C)

 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400


## Synthesize from d-vectors

In [7]:
C.model_args.use_speaker_encoder_as_loss = False
model = Vits(C, speaker_manager, language_manager)
model.load_checkpoint(C, MODEL_PATH, eval=True)

 > initialization of language-embedding layers.


In [8]:
def synthesis_dvec(text, filepath):
  out_path = "/content/output.wav"
  d_vector = speaker_manager.compute_d_vector_from_clip(filepath)

  wav, alignment, _, _ = synthesis(
    model,
    text,
    C,
    False,
    ap,
    speaker_id=None,
    d_vector=d_vector,
    style_wav=None,
    language_id=0,
    enable_eos_bos_chars=C.enable_eos_bos_chars,
    use_griffin_lim=True,
    do_trim_silence=False,
    ).values()
  ap.save_wav(wav, out_path)
  return out_path

In [9]:
import gradio as gr

iface = gr.Interface(
  synthesis_dvec, 
  ["text", gr.inputs.Audio(source="microphone", type="filepath", label="Say anything!")], 
  "audio",
  allow_screenshot=False,
  allow_flagging="never"
)

iface.launch(share=True, debug=False)

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Your interface requires microphone or webcam permissions - this may cause issues in Colab. Use the External URL in case of issues.
Running on public URL: https://38991.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


(<fastapi.applications.FastAPI at 0x7f6ddec7dc10>,
 'http://127.0.0.1:7860/',
 'https://38991.gradio.app')

## Finetune and synthesize

In [10]:
!wget https://downloads.tatoeba.org/exports/per_language/eng/eng_sentences.tsv.bz2
!bzip2 -d eng_sentences.tsv.bz2

--2022-02-19 09:57:22--  https://downloads.tatoeba.org/exports/per_language/eng/eng_sentences.tsv.bz2
Resolving downloads.tatoeba.org (downloads.tatoeba.org)... 94.130.77.194
Connecting to downloads.tatoeba.org (downloads.tatoeba.org)|94.130.77.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18158301 (17M) [application/octet-stream]
Saving to: ‘eng_sentences.tsv.bz2’


2022-02-19 09:57:22 (69.1 MB/s) - ‘eng_sentences.tsv.bz2’ saved [18158301/18158301]



In [11]:
import csv
import random
read_tsv = csv.reader(open("eng_sentences.tsv"), delimiter="\t")
sentences = random.sample([t[2] for t in read_tsv if len(t[2]) < 70], 10)

In [12]:
C.use_speaker_embedding = C.model_args.use_speaker_embedding = True
C.use_d_vector_file = C.model_args.use_d_vector_file = False
C.model_args.speaker_encoder_model_path = C.speaker_encoder_model_path = CHECKPOINT_SE_PATH
C.model_args.speaker_encoder_config_path = C.speaker_encoder_config_path = CONFIG_SE_PATH

model = Vits(C, speaker_manager, language_manager)

 > initialization of language-embedding layers.


In [13]:
import librosa
import soundfile as sf

def train_and_synth(text, train, epochs, *args):
  if train:
    C.epochs = epochs
    # resample audio
    for path in args:
      audio, sr = librosa.load(path, 16000)
      sf.write(path, audio, sr)

    samples = [[sentences[i], p, 'temp', 'en'] for i, p in enumerate(args)]
    
    trainer = Trainer(
      TrainingArgs(restore_path=MODEL_PATH),
      C,
      OUT_PATH,
      model=model,
      train_samples=samples[:9],
      eval_samples=samples[9:],
      training_assets={"audio_processor": ap}
    )

    trainer.fit()

  out_path = f"/content/output.wav"
  
  wav, alignment, _, _ = synthesis(
    model,
    text,
    C,
    USE_CUDA,
    ap,
    speaker_id=None,
    style_wav=None,
    language_id=0,
    enable_eos_bos_chars=C.enable_eos_bos_chars,
    use_griffin_lim=True,
    do_trim_silence=False,
  ).values()
  ap.save_wav(wav, out_path)
  return out_path

In [14]:
import gradio as gr


iface = gr.Interface(
  train_and_synth, 
  ["text", "checkbox", gr.inputs.Slider(0, 1000, 1, 150)] +
  [gr.inputs.Audio(source="microphone", type="filepath", label=s, optional=False) for s in sentences], 
  "audio",
  allow_screenshot=False,
  allow_flagging="never"
)

iface.launch(share=True, debug=False)

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Your interface requires microphone or webcam permissions - this may cause issues in Colab. Use the External URL in case of issues.
Running on public URL: https://58934.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


(<fastapi.applications.FastAPI at 0x7f6ddec7dc10>,
 'http://127.0.0.1:7861/',
 'https://58934.gradio.app')