# Speaker Embeddings Extraction Tutorial

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/audio/extract_speaker_embeddings.ipynb)


## Introduction

This tutorial demonstrates how to use the `extract_speaker_embeddings_from_audios` function to extract speaker embeddings from audio files. Speaker embeddings are fixed-dimensional vector representations that capture the unique characteristics of a speaker's voice, which can be used for various tasks such as speaker identification, verification, and diarization.

## Setup
First, let's import the necessary libraries and the function we'll be using.

In [1]:
%pip install 'senselab[audio]'

Note: you may need to restart the kernel to use updated packages.


In [2]:
from typing import List

import matplotlib.pyplot as plt
import numpy as np
import torch
import os

from senselab.audio.data_structures import Audio
from senselab.audio.tasks.preprocessing import downmix_audios_to_mono, resample_audios
from senselab.audio.tasks.speaker_embeddings import extract_speaker_embeddings_from_audios
from senselab.utils.data_structures import DeviceType, SpeechBrainModel

  available_backends = torchaudio.list_audio_backends()


## Loading Audio Files
Now let's load and process the audio files using senselab's built-in tools to do so.

In [3]:
!mkdir -p tutorial_audio_files
!wget -O tutorial_audio_files/audio_48khz_mono_16bits.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
!wget -O tutorial_audio_files/audio_48khz_stereo_16bits.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_stereo_16bits.wav

audio1 = Audio(filepath=os.path.abspath("tutorial_audio_files/audio_48khz_mono_16bits.wav"))
audio2 = Audio(filepath=os.path.abspath("tutorial_audio_files/audio_48khz_stereo_16bits.wav"))

# Downmix to mono
audio2 = downmix_audios_to_mono([audio2])[0]

# Resample both audios to 16kHz
audios = resample_audios([audio1, audio2], 16000)

--2025-09-15 19:24:02--  https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav [following]
--2025-09-15 19:24:03--  https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 472488 (461K) [audio/wav]
Saving to: ‘tutorial_audio_files/audio_48khz_mono_16bits.wav’


2025-09-15 19:24:03 (5.44 MB/s) - ‘tutorial_audio_files/audio_48kh

  info = torchaudio.info(filepath)
  return AudioMetaData(
  info = torchaudio.info(self._file_path)


## Extracting Speaker Embeddings

Now, let's use the `extract_speaker_embeddings_from_audios` function to extract embeddings from our audio files. We will use the ecapa-tdnn model here, but feel free to use any speechbrain compatible model.

In [4]:
model = SpeechBrainModel(path_or_uri="speechbrain/spkrec-ecapa-voxceleb", revision="main")
device = DeviceType.CUDA if torch.cuda.is_available() else DeviceType.CPU
embeddings = extract_speaker_embeddings_from_audios(audios, model, device)

print(f"Number of embeddings: {len(embeddings)}")
print(f"Embedding size for file 1: {embeddings[0].shape}")
print(f"Embedding size for file 2: {embeddings[1].shape}")

  available_backends = torchaudio.list_audio_backends()
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)


Number of embeddings: 2
Embedding size for file 1: torch.Size([192])
Embedding size for file 2: torch.Size([192])


## Visualizing Embeddings
That's pretty much it! Now we can use the extracted speaker embeddings for any downstream tasks we require.

For example, we can visualize the embeddings in two ways: directly as a heatmap, and as a similarity matrix to directly measure the similarity between the two audio files. From these visualizations, we can easily see that the two audio files are nearly identical.

In [5]:
from senselab.utils.tasks.cosine_similarity import compute_cosine_similarity


# DIRECTLY PLOT THE EMBEDDINGS FOR THE TWO FILES
def plot_embedding_heatmap(embeddings: List[torch.Tensor], titles: List[str]) -> None:
    """Plot a heatmap of a list of speaker embeddings."""
    fig, axes = plt.subplots(len(embeddings), 1, figsize=(10, 5 * len(embeddings)))
    if len(embeddings) == 1:
        axes = [axes]
    
    for ax, embedding, title in zip(axes, embeddings, titles):
        im = ax.imshow(embedding.unsqueeze(0), aspect='auto', cmap='viridis')
        ax.set_title(f"Speaker Embedding: {title}")
        ax.set_xlabel("Embedding Dimension")
        fig.colorbar(im, ax=ax)
    
    plt.tight_layout()
    plt.show()

plot_embedding_heatmap(embeddings, ["file 1", "file 2"])


# PLOT THE COSINE SIMILARITY MATRIX FOR THE TWO FILES
def plot_similarity_matrix(embeddings: List[torch.Tensor], labels: List[str]) -> None:
    """Plot a similarity matrix for a list of embeddings."""
    n = len(embeddings)
    similarity_matrix = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            similarity_matrix[i, j] = compute_cosine_similarity(embeddings[i], embeddings[j])
    
    fig, ax = plt.subplots(figsize=(8, 6))
    im = ax.imshow(similarity_matrix, cmap='coolwarm', vmin=-1, vmax=1)
    
    ax.set_xticks(np.arange(n))
    ax.set_yticks(np.arange(n))
    ax.set_xticklabels(labels)
    ax.set_yticklabels(labels)
    
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    
    for i in range(n):
        for j in range(n):
            ax.text(j, i, f"{similarity_matrix[i, j]:.2f}", ha="center", va="center", color="black")
    
    ax.set_title("Cosine Similarity Between Speaker Embeddings")
    fig.colorbar(im)
    plt.tight_layout()
    plt.show()

plot_similarity_matrix(embeddings, ["file 1", "file 2"])

  plt.show()
  plt.show()


Another common visualization method for a large quantity of embeddings is to use a dimensionality reduction technique to plot the data and easily discover the structure of the data and any clusters within the data. Please see the dimensionality reduction tutorial for more information on how to do this within senselab.

## Conclusion

This tutorial demonstrated how to use the `extract_speaker_embeddings_from_audios` function to extract speaker embeddings from audio files. We visualized the embeddings and compared them using cosine similarity. These embeddings can be used for various speaker recognition tasks, such as speaker identification, verification, and diarization.

Remember that the performance of these embeddings can vary depending on the specific dataset, task, and evaluation protocol used. Always refer to the most recent literature for up-to-date benchmarks and best practices in speaker recognition tasks.