# Voice cloning
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/audio/voice_cloning.ipynb)

This tutorial demonstrates how to use the `clone_voices` function from the `senselab` library to convert someone's speech into another person's voice. Currently, `senselab` integrates all `coqui TTS` models for voice cloning, including `KNNVC` and `FREEVC`. In this tutorial, we will see how to use them.

## Importing necessary classes and methods
First, we need to import the necessary modules and classes from the `senselab` package.

In [1]:
%pip install senselab

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os

from senselab.audio.data_structures import Audio
from senselab.audio.tasks.plotting.plotting import play_audio
from senselab.audio.tasks.preprocessing import extract_segments, resample_audios
from senselab.audio.tasks.voice_cloning import clone_voices
from senselab.utils.data_structures import CoquiTTSModel, DeviceType

  available_backends = torchaudio.list_audio_backends()


## Initializations

In [3]:
# Specify the device type for model inference
device = DeviceType.CPU

# Specify the model
model = CoquiTTSModel(path_or_uri="voice_conversion_models/multilingual/multi-dataset/knnvc")

## Loading and preparing the source and target audio clips
We will load an audio file and resample it to 16kHz. This ensures compatibility with the voice cloning model.
We will then extract specific segments from the audio for the source and target voices.

In [4]:
!mkdir -p tutorial_audio_files
!wget -O tutorial_audio_files/audio_48khz_mono_16bits.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav

--2025-09-15 18:53:40--  https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav [following]
--2025-09-15 18:53:40--  https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 472488 (461K) [audio/wav]
Saving to: ‘tutorial_audio_files/audio_48khz_mono_16bits.wav’


2025-09-15 18:53:41 (4.49 MB/s) - ‘tutorial_audio_files/audio_48kh

In [5]:
audio = Audio(filepath=os.path.abspath("tutorial_audio_files/audio_48khz_mono_16bits.wav"))

# Resample the audio to 16kHz
audio = resample_audios([audio], 16000)[0]

# Extract segments from the audio (example segments: 0.0-1.0s and 3.2-4.9s)
chunks = extract_segments([(audio, [(0.0, 1.0), (3.2, 4.9)])])[0]
audio1 = chunks[0]
audio2 = chunks[1]

# Play the extracted audio segments
play_audio(audio1)
play_audio(audio2)


  info = torchaudio.info(self._file_path)
  return AudioMetaData(
  info = torchaudio.info(filepath)


## Cloning the Voices
Now, we will perform the voice cloning by specifying the source and target audios.

In [6]:
source_audios = [audio1]
target_audios = [audio2]

# knnvc
cloned_output = clone_voices(
    source_audios=source_audios,
    target_audios=target_audios,
    model=model,
    device=device
)

# Play the cloned output
play_audio(cloned_output[0])

We can also try with different models.

In [7]:
# freevc24
cloned_output = clone_voices(
    source_audios=source_audios,
    target_audios=target_audios,
    model= CoquiTTSModel(path_or_uri="voice_conversion_models/multilingual/vctk/freevc24"),
    device=device
)

# Play the cloned output
play_audio(cloned_output[0])

100%|██████████| 896M/896M [00:30<00:00, 59.5MiB/s] 

In [8]:
# sparc
cloned_output = clone_voices(
    source_audios=source_audios,
    target_audios=target_audios,
    model= None,
    device=device
)

# Play the cloned output
play_audio(cloned_output[0])

  WeightNorm.apply(module, name, dim)


## Objective Evaluation
To ensure the quality and effectiveness of the voice cloning, we can perform several evaluations:
- Speaker Verification: Use an automatic speaker verification tool to determine if the original speaker, the target speaker, and the cloned speaker can be distinguished from each other.
- Speech Intelligibility: Use an automatic speech recognition system to verify that the content remains unchanged and intelligible.
- Emotion Preservation: Assess if the emotion in the original speech is preserved in the cloned voice.

To run all these analysis, you can use `senselab`.