<a href="https://colab.research.google.com/github/vutt-ai-models/notebook_tutorials/blob/main/colab_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bark text-to-speech voice cloning.
Clone voices to create speaker history prompt files (.npz) for [bark text-to-speech](https://github.com/suno-ai/bark).
(This version of the notebook is made to work on Google Colab, make sure your runtime hardware accelerator is set to GPU)

# Google Colab: Clone the repository

In [3]:
!git clone https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/
%cd bark-voice-cloning-HuBERT-quantizer

Cloning into 'bark-voice-cloning-HuBERT-quantizer'...
remote: Enumerating objects: 1832, done.[K
remote: Counting objects: 100% (197/197), done.[K
remote: Compressing objects: 100% (95/95), done.[K
remote: Total 1832 (delta 115), reused 174 (delta 97), pack-reused 1635[K
Receiving objects: 100% (1832/1832), 319.74 MiB | 33.57 MiB/s, done.
Resolving deltas: 100% (116/116), done.
/content/bark-voice-cloning-HuBERT-quantizer


## Install packages

In [4]:
%pip install -r requirements.txt
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Ignoring soundfile: markers 'platform_system == "Windows"' don't match your environment
Collecting audiolm-pytorch (from -r requirements.txt (line 1))
  Downloading audiolm_pytorch-1.2.6-py3-none-any.whl (38 kB)
Collecting fairseq (from -r requirements.txt (line 2))
  Downloading fairseq-0.12.2.tar.gz (9.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m73.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub (from -r requirements.txt (line 3))
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m26.1 MB/s[0m eta [36m0:0

## Load models

In [5]:
large_quant_model = False  # Use the larger pretrained model
device = 'cuda'  # 'cuda', 'cpu', 'cuda:0', 0, -1, torch.device('cuda')

import numpy as np
import torch
import torchaudio
from encodec import EncodecModel
from encodec.utils import convert_audio
from hubert.hubert_manager import HuBERTManager
from hubert.pre_kmeans_hubert import CustomHubert
from hubert.customtokenizer import CustomTokenizer

model = ('quantifier_V1_hubert_base_ls960_23.pth', 'tokenizer_large.pth') if large_quant_model else ('quantifier_hubert_base_ls960_14.pth', 'tokenizer.pth')

print('Loading HuBERT...')
hubert_model = CustomHubert(HuBERTManager.make_sure_hubert_installed(), device=device)
print('Loading Quantizer...')
quant_model = CustomTokenizer.load_from_checkpoint(HuBERTManager.make_sure_tokenizer_installed(model=model[0], local_file=model[1]), device)
print('Loading Encodec...')
encodec_model = EncodecModel.encodec_model_24khz()
encodec_model.set_target_bandwidth(6.0)
encodec_model.to(device)

print('Downloaded and loaded models!')

Loading HuBERT...
Downloading HuBERT base model
Downloaded HuBERT
Loading Quantizer...
Downloading HuBERT custom tokenizer


Downloading (…)rt_base_ls960_14.pth:   0%|          | 0.00/104M [00:00<?, ?B/s]

Downloaded tokenizer
Loading Encodec...


Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /root/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
100%|██████████| 88.9M/88.9M [00:00<00:00, 105MB/s]


Downloaded and loaded models!


## Load wav and create speaker history prompt

In [7]:
wav_file = 'voicevn.wav'  # Put the path of the speaker you want to use here.
out_file = 'speaker.npz'  # Put the path to save the cloned speaker to here.

wav, sr = torchaudio.load(wav_file)

wav_hubert = wav.to(device)

if wav_hubert.shape[0] == 2:  # Stereo to mono if needed
    wav_hubert = wav_hubert.mean(0, keepdim=True)

print('Extracting semantics...')
semantic_vectors = hubert_model.forward(wav_hubert, input_sample_hz=sr)
print('Tokenizing semantics...')
semantic_tokens = quant_model.get_token(semantic_vectors)
print('Creating coarse and fine prompts...')
wav = convert_audio(wav, sr, encodec_model.sample_rate, 1).unsqueeze(0)

wav = wav.to(device)

with torch.no_grad():
    encoded_frames = encodec_model.encode(wav)
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze()

codes = codes.cpu()
semantic_tokens = semantic_tokens.cpu()

np.savez(out_file,
         semantic_prompt=semantic_tokens,
         fine_prompt=codes,
         coarse_prompt=codes[:2, :]
         )

print('Done!')

Extracting semantics...
Tokenizing semantics...
Creating coarse and fine prompts...
Done!
