<a href="https://colab.research.google.com/github/uucad/aigc_LLM_engineering/blob/main/soft_vc_demo_cy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Soft Speech Units for Improved Voice Conversion

Demo for the paper: [A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion](https://ieeexplore.ieee.org/abstract/document/9746484).

- [Companion webpage](https://bshall.github.io/soft-vc/)
- [Home repo](https://github.com/bshall/soft-vc)
- [HuBERT content encoders](https://github.com/bshall/hubert)
- [Acoustic Models](https://github.com/bshall/acoustic-model)
- [HiFiGAN vocoder](https://github.com/bshall/hifigan)

In [None]:
import torch, torchaudio
import requests
import IPython.display as display

Download the HuBERT content encoder (either hubert_soft or hubert_discrete):

In [None]:
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft").cuda()

Downloading: "https://github.com/bshall/hubert/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt" to /root/.cache/torch/hub/checkpoints/hubert-soft-0d54a1f4.pt
100%|██████████| 361M/361M [00:03<00:00, 103MB/s]


Download the acoustic model (either hubert_soft or hubert_discrete)

In [None]:
acoustic = torch.hub.load("bshall/acoustic-model:main", "hubert_soft").cuda()

Downloading: "https://github.com/bshall/acoustic-model/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://github.com/bshall/acoustic-model/releases/download/v0.1/hubert-soft-0321fd7e.pt" to /root/.cache/torch/hub/checkpoints/hubert-soft-0321fd7e.pt
100%|██████████| 71.8M/71.8M [00:01<00:00, 64.7MB/s]


Download the vocoder (either hifigan_hubert_soft or hifigan_hubert_discrete)

In [None]:
hifigan = torch.hub.load("bshall/hifigan:main", "hifigan_hubert_soft").cuda()

Downloading: "https://github.com/bshall/hifigan/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://github.com/bshall/hifigan/releases/download/v0.1/hifigan-hubert-discrete-bbad3043.pt" to /root/.cache/torch/hub/checkpoints/hifigan-hubert-discrete-bbad3043.pt
100%|██████████| 54.9M/54.9M [00:00<00:00, 88.0MB/s]


Removing weight norm...


Download an example utterance:

In [None]:
with open("example.wav", "wb") as file:
  response = requests.get("https://drive.google.com/uc?export=preview&id=1Y3KuPAhB5VcsmIaokBVKu3LUEZOfhSu8")
  file.write(response.content)

Or upload your own:

In [None]:
from google.colab import files

uploaded = files.upload()

Load the source audio (and resample to 16kHz if necessary)

In [None]:
source, sr = torchaudio.load("example.wav")
source = torchaudio.functional.resample(source, sr, 16000)
source = source.unsqueeze(0).cuda()

Convert to the target speaker:

In [None]:
with torch.inference_mode():
    # Extract speech units
    units = hubert.units(source)
    # Generate target spectrogram
    mel = acoustic.generate(units).transpose(1, 2)
    # Generate audio waveform
    target = hifigan(mel)

Lets listen to the results!

The source:

In [None]:
display.Audio(source.squeeze().cpu(), rate=16000)

and the converted utterance:

In [None]:
display.Audio(target.squeeze().cpu(), rate=16000)