<a href="https://colab.research.google.com/github/charactr-platform/vocos/blob/main/notebooks/Bark%2BVocos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text-to-Audio Synthesis using Bark and Vocos

In this notebook, we use [Bark](https://github.com/suno-ai/bark) generative model to turn a text prompt into EnCodec audio tokens. These tokens then go through two decoders, EnCodec and Vocos, to reconstruct the audio waveform. Compare the results to discover the differences in audio quality and characteristics.

Make sure you have Bark and Vocos installed:

In [1]:
!pip install git+https://github.com/suno-ai/bark.git
!pip install vocos

Collecting git+https://github.com/suno-ai/bark.git
  Cloning https://github.com/suno-ai/bark.git to /tmp/pip-req-build-n1c27wg_
  Running command git clone -q https://github.com/suno-ai/bark.git /tmp/pip-req-build-n1c27wg_
  Resolved https://github.com/suno-ai/bark.git to commit f4f32d4cd480dfec1c245d258174bc9bde3c2148
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
[?25hCollecting funcy
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting boto3
  Downloading boto3-1.34.93-py3-none-any.whl (139 kB)
[K     |████████████████████████████████| 139 kB 46.4 MB/s eta 0:00:01
Collecting botocore<1.35.0,>=1.34.93
  Downloading botocore-1.34.93-py3-none-any.whl (12.2 MB)
[K     |████████████████████████████████| 12.2 MB 89.0 MB/s eta 0:00:01               | 276 kB 89.0 MB/s eta 0:00:01
[?25hCollecting s3transfer<0.11.0,>

Download and load Bark models

In [None]:
from bark import preload_models

preload_models()

Download and load Vocos.

In [12]:
from vocos import Vocos
import torch
import librosa
import soundfile as sf
import os 
import numpy as np
import torchaudio


def file_pathname(target_dir, target_suffix=".wav"):
    find_res = []
    target_suffix_dot = target_suffix
    walk_generator = os.walk(target_dir)
    for root_path, dirs, files in walk_generator:
        if len(files) < 1:
            continue
        for file in files:
            file_name, suffix_name = os.path.splitext(file)
            if suffix_name == target_suffix_dot:
                find_res.append((os.path.normpath(root_path), file_name))
    return find_res
def check_path(path1):
    if type(path1) == list:
        flag = True
        for p in path1:
            flag_tmp = check_path(p)
            flag = (flag and flag_tmp)
        return flag
    flag = os.path.isdir(path1)
    if not flag:
        os.makedirs(path1)
        print(path1+' has been created')
    return flag


In [16]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz").to(device)
L = file_pathname(r'/data3/tansongbin/vocoder_projects/dataset/input','.wav')
for p,n in L:
    y, sr = torchaudio.load(os.path.join(p,n+'.wav'))
    if y.size(0) > 1:  # mix to mono
        y = y.mean(dim=0, keepdim=True)
    y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000).to(device)
    y_g_hat = vocos(y)
    audio = y_g_hat.squeeze().cpu().numpy()
    audio = librosa.resample(y=audio, orig_sr=24000, target_sr=16000)
    if np.max(np.abs(audio)) > 1.0:
        audio/=np.max(np.abs(audio))
    output_file = os.path.join(r'/data3/tansongbin/vocoder_projects/dataset/output/vocos', n+'_generated.wav')
    check_path(r'/data3/tansongbin/vocoder_projects/dataset/output/vocos')
    sf.write(output_file, audio, 16000)
    print(output_file)


/data3/tansongbin/vocoder_projects/dataset/output/vocos/9139459052777997503-359_A1F021wWnWMyn8iR#-40-4+65354297751687267-781_A1D141ceIFmJhMlV#-16-3+8894785898129003698-97_A44121wyOt2SQSsX#-13-4_generated.wav
/data3/tansongbin/vocoder_projects/dataset/output/vocos/test_audio_1_generated.wav
/data3/tansongbin/vocoder_projects/dataset/output/vocos/test_audio_2_generated.wav
/data3/tansongbin/vocoder_projects/dataset/output/vocos/9120253738173975542-20_A2C240MtQXHWxUcS#-6-3+4457760131197602548-324_A2A221203uEoWjmk#-35-4+4315090185852769861-237_A3Z121hkTwz0e0Vg#-21-4_generated.wav


We are going to reuse `text_to_semantic` from Bark API, but to reconstruct audio waveform with a custom vododer, we need to slightly redefine the API to return `fine_tokens`.

In [None]:
from typing import Optional, Union, Dict

import numpy as np
from bark.generation import generate_coarse, generate_fine


def semantic_to_audio_tokens(
    semantic_tokens: np.ndarray,
    history_prompt: Optional[Union[Dict, str]] = None,
    temp: float = 0.7,
    silent: bool = False,
    output_full: bool = False,
):
    coarse_tokens = generate_coarse(
        semantic_tokens, history_prompt=history_prompt, temp=temp, silent=silent, use_kv_caching=True
    )
    fine_tokens = generate_fine(coarse_tokens, history_prompt=history_prompt, temp=0.5)

    if output_full:
        full_generation = {
            "semantic_prompt": semantic_tokens,
            "coarse_prompt": coarse_tokens,
            "fine_prompt": fine_tokens,
        }
        return full_generation
    return fine_tokens

Let's create a text prompt and generate audio tokens:

In [11]:
from bark import text_to_semantic

history_prompt = None
text_prompt = "So, you've heard about neural vocoding? [laughs] We've been messing around with this new model called Vocos."
semantic_tokens = text_to_semantic(text_prompt, history_prompt=history_prompt, temp=0.7, silent=False,)
audio_tokens = semantic_to_audio_tokens(
    semantic_tokens, history_prompt=history_prompt, temp=0.7, silent=False, output_full=False,
)

text_2.pt:   0%|          | 0.00/5.35G [00:00<?, ?B/s]

KeyboardInterrupt: 

Reconstruct audio waveform with EnCodec:

In [None]:
from bark.generation import codec_decode
from IPython.display import Audio

encodec_output = codec_decode(audio_tokens)

import torchaudio
# Upsample to 44100 Hz for better reproduction on audio hardware
encodec_output = torchaudio.functional.resample(torch.from_numpy(encodec_output), orig_freq=24000, new_freq=44100)
Audio(encodec_output, rate=44100)

Reconstruct with Vocos:

In [None]:
audio_tokens_torch = torch.from_numpy(audio_tokens).to(device)
features = vocos.codes_to_features(audio_tokens_torch)
vocos_output = vocos.decode(features, bandwidth_id=torch.tensor([2], device=device))  # 6 kbps
# Upsample to 44100 Hz for better reproduction on audio hardware
vocos_output = torchaudio.functional.resample(vocos_output, orig_freq=24000, new_freq=44100).cpu()
Audio(vocos_output.numpy(), rate=44100)

Optionally save to mp3 files:

In [None]:
torchaudio.save("encodec.mp3", encodec_output[None, :], 44100, compression=128)
torchaudio.save("vocos.mp3", vocos_output, 44100, compression=128)