# Conversational data exploration tutorial

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/audio/conversational_data_exploration.ipynb)


In this tutorial we demonstrate how to perform **conversational data exploration** using the `senselab` library.  The goal is to load raw audio samples, prepare them for analysis, perform speaker diarization (to understand who speaks and when) transcribe speech segments (to learn what has been said), extract acoustic features, speaker embeddings and self-supervised model embeddings (to describe each speech segment), and finally assemble the results into a convenient JSON format for further analysis of different kinds.

We will make use of the following `senselab` utilities:

* `read_audios` – load audio files from disk into `Audio` objects;
* `downmix_audios_to_mono` and `resample_audios` – convert the audio to mono and a common sampling rate (16 kHz);
* `diarize_audios` – run the NVIDIA diarization model `nvidia/diar_sortformer_4spk-v1` to label each speaker in the conversation;
* `extract_segments` – cut the audio into segments based on diarization timestamps;
* `transcribe_audios_with_transformers` – perform automatic speech recognition (ASR) using the model `openai/whisper-tiny` and `openai/whisper-small`;
* `extract_features_from_audios` – compute OpenSMILE, Praat/Parselmouth and SQUIM features for each segment;
* `extract_speaker_embeddings_from_audios` – compute a fixed‑length speaker embedding for each segment using `speechbrain/spkrec-ecapa-voxceleb` and `speechbrain/spkrec-resnet-voxceleb`;
* `extract_ssl_embeddings_from_audios` – compute time-pooled fixed‑length self-supervised model embedding for each segment using `microsoft/wavlm-base`.


The final output will be a list of JSON objects, one per speaker turn, containing the transcript, extracted features, embeddings and the speaker identifier. We intend this as a starting point for further analyses. 

## Installation

In [None]:
%pip install senselab

## Data download

The example we use in this case has been automatically generated using Higgs audio v2 by Boson AI. See [here](https://huggingface.co/spaces/smola/higgs_audio_v2) for details.

In [1]:
!mkdir -p tutorial_audio_files
!wget -O tutorial_audio_files/english_conversation_higgs_audio_v2.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/english_conversation_higgs_audio_v2.wav

--2025-11-11 22:08:02--  https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/english_conversation_higgs_audio_v2.wav
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/english_conversation_higgs_audio_v2.wav [following]
--2025-11-11 22:08:02--  https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/english_conversation_higgs_audio_v2.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1031084 (1007K) [audio/wav]
Saving to: ‘tutorial_audio_files/english_conversation_higgs_audio_v2.wav’




## Processing

In [2]:
# Necessary libraries
import json
import os

import torch

# Audio data structure
from senselab.audio.data_structures import Audio

# Feature extraction API
from senselab.audio.tasks.features_extraction import extract_features_from_audios
from senselab.audio.tasks.input_output.utils import read_audios

# Pre‑processing functions
from senselab.audio.tasks.preprocessing.preprocessing import (
    downmix_audios_to_mono,
    extract_segments,
    resample_audios,
)

# Speaker diarization API
from senselab.audio.tasks.speaker_diarization import diarize_audios

# Speaker embeddings extraction API
from senselab.audio.tasks.speaker_embeddings import extract_speaker_embeddings_from_audios

# Automatic Speech Recognition API
from senselab.audio.tasks.speech_to_text import transcribe_audios

# Self-supervised embeddings extraction API
from senselab.audio.tasks.ssl_embeddings import extract_ssl_embeddings_from_audios

# Utility classes for specifying models and devices
from senselab.utils.data_structures import HFModel, SpeechBrainModel

  path = torchaudio.utils.download_asset(f"models/{self._path}")
  path = torchaudio.utils.download_asset(f"models/{self._path}")
  available_backends = torchaudio.list_audio_backends()


In [3]:
# Note: it's recommend to use absolute paths for file operations
# to avoid issues with how Pydra (a Python dependency used under the hood) caches and retrieves files
file_paths = [os.path.abspath("tutorial_audio_files/english_conversation_higgs_audio_v2.wav")]
print("Audio files to process:", file_paths)

Audio files to process: ['/Users/fabiocat/git/senselab/tutorials/audio/tutorial_audio_files/english_conversation_higgs_audio_v2.wav']


In [4]:
# Load the audio files into senselab `Audio` objects.  Each Audio object lazily loads.
audios = read_audios(file_paths)

In [5]:
# Speaker diarization and ASR models typically expect a single channel and a 16kHz sampling rate.

# Convert the audio files to mono. The `downmix_audios_to_mono` function averages multiple channels.
audios = downmix_audios_to_mono(audios)

# Resample the audio to the target sampling rate required by downstream models.
target_sr = 16000
audios = resample_audios(audios, resample_rate=target_sr)

  info = torchaudio.info(filepath)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  return AudioMetaData(
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


In [6]:
# Perform speaker diarization on the pre‑processed audio.
# We use the NVIDIA Sortformer model from Hugging Face.  You can change the model path
# if you want to experiment with other diarization models.
# "nvidia/diar_sortformer_4spk-v1" is generally pretty good, but can only detect up to 4 speakers
diar_model = HFModel(path_or_uri="nvidia/diar_sortformer_4spk-v1")

# Run the diarization.  The output is a list (one per audio file) of lists of
# `ScriptLine` objects.  Each ScriptLine has a `speaker` label and start/end times (in seconds).
diarization_results = diarize_audios(
    audios=audios,
    model=diar_model,
)

  s = torchaudio.io.StreamWriter(uri, format=muxer, buffer_size=buffer_size)


In [7]:
# Build a list of (Audio, list_of_segment_times) pairs.
from typing import List, Tuple

# Each entry corresponds to one source audio file and contains all diarization intervals for that file.
# We convert the ScriptLine objects into (start, end) tuples.
segments_info = []
for audio, script_lines in zip(audios, diarization_results):
    times: List[Tuple[float, float]] = []
    for line in script_lines:
        times.append((line.start, line.end))
    segments_info.append((audio, times))

# Use `extract_segments` to slice the audio into individual speaker turns.  This returns
# a list (one per audio file) of lists of `Audio` segments.  Each segment retains the same
# sampling rate and metadata as the resampled recording.
segmented_audios_list = extract_segments(segments_info)

In [8]:
# Populate metadata from file_paths (per file) and diarization_results (per segment)
for i, (segments, lines) in enumerate(zip(segmented_audios_list, diarization_results)):
    src_path = file_paths[i] if i < len(file_paths) else None
    if len(segments) != len(lines):
        print(f"[warn] file #{i}: {len(segments)} segments vs {len(lines)} diarization entries — zipping to shortest.")
    n = min(len(segments), len(lines))
    for j in range(n):
        seg = segments[j]
        line = lines[j]

        # choose the container
        md = seg.metadata

        # fill fields from order-aligned sources
        md["original_file"] = src_path
        md["sampling_rate"] = seg.sampling_rate
        md["speaker_id"] = getattr(line, "speaker", None)
        md["segment_start"] = getattr(line, "start", None)
        md["segment_end"] = getattr(line, "end", None)

# Flatten to a single list of Audio objects
flattened_segments = [seg for group in segmented_audios_list for seg in group]

In [9]:
# Transcribe the segments using the OpenAI Whisper ASR models.
asr_whisper_tiny_model = HFModel(path_or_uri="openai/whisper-tiny")
asr_whisper_small_model = HFModel(path_or_uri="openai/whisper-small")

asr_whisper_tiny_out = transcribe_audios(audios=flattened_segments, model=asr_whisper_tiny_model)
asr_whisper_small_out = transcribe_audios(audios=flattened_segments, model=asr_whisper_small_model)

for seg, tiny, small in zip(flattened_segments, asr_whisper_tiny_out, asr_whisper_small_out):
    seg.metadata["transcripts"] = {"openai/whisper-tiny": tiny.text, "openai/whisper-small": small.text}

Device set to use cpu
2025-11-11 22:08:40,735 - senselab - INFO - Time taken to initialize the hugging face ASR pipeline: 1.03 seconds
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure Whispe

In [10]:
# Extract features from the audio segments
feat_out = extract_features_from_audios(
    audios=flattened_segments, opensmile=True, parselmouth=True, torchaudio=False, torchaudio_squim=True
)
for seg, feats in zip(flattened_segments, feat_out):
    seg.metadata["features"] = feats

In [11]:
# Extract speaker embeddings from the audio segments
spk_ecapatdnn_model = SpeechBrainModel(path_or_uri="speechbrain/spkrec-ecapa-voxceleb")
spk_resnet_model = SpeechBrainModel(path_or_uri="speechbrain/spkrec-resnet-voxceleb")

spk_ecapatdnn_out = extract_speaker_embeddings_from_audios(audios=flattened_segments, model=spk_ecapatdnn_model)
spk_resnet_out = extract_speaker_embeddings_from_audios(audios=flattened_segments, model=spk_resnet_model)

for seg, emb_ecapa, emb_resnet in zip(flattened_segments, spk_ecapatdnn_out, spk_resnet_out):
    seg.metadata["speaker_embeddings"] = {
        "speechbrain/spkrec-ecapa-voxceleb": emb_ecapa.tolist(),
        "speechbrain/spkrec-resnet-voxceleb": emb_resnet.tolist(),
    }

  available_backends = torchaudio.list_audio_backends()
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)


In [12]:
# Extract self-supervised learning (SSL) embeddings from the audio segments
wavlm_model = HFModel(path_or_uri="microsoft/wavlm-base")

wavlm_ssl_out = extract_ssl_embeddings_from_audios(audios=flattened_segments, model=wavlm_model)

mean_pooled_wavlm_ssl_out = [torch.mean(tensor, dim=1) for tensor in wavlm_ssl_out]
for seg, emb in zip(flattened_segments, mean_pooled_wavlm_ssl_out):
    seg.metadata["ssl_embeddings"] = {
        "microsoft/wavlm-base": emb.squeeze().tolist(),
    }

In [13]:
def build_json_from_segment(seg: Audio) -> object:
    """Build a JSON object from a segment's metadata.

    Args:
        seg: The audio segment to build the JSON object from.

    Returns:
        A JSON object representing the segment's metadata.
    """
    md = getattr(seg, "metadata", {}) or {}

    # transcripts
    transcripts = md.get("transcripts")

    # speaker embeddings
    speaker_embeddings = md.get("speaker_embeddings")

    # self-supervised embeddings
    ssl_embeddings = md.get("ssl_embeddings")

    # features
    features = md.get("features")

    # times
    segment_start = md.get("segment_start")
    segment_end = md.get("segment_end")

    # sampling rate
    segment_sampling_rate = md.get("sampling_rate")

    return {
        "original_file": md.get("original_file"),
        "sampling_rate": segment_sampling_rate,
        "speaker_id": md.get("speaker_id"),
        "start": segment_start,
        "end": segment_end,
        "transcripts": transcripts,
        "features": features,
        "speaker_embeddings": speaker_embeddings,
        "ssl_embeddings": ssl_embeddings,
    }


# Build the list of JSON objects (one per segment)
output_json = [build_json_from_segment(seg) for seg in flattened_segments]

# Optional: pretty-print first item
if output_json:
    print(json.dumps(output_json[0], indent=2))

{
  "original_file": "/Users/fabiocat/git/senselab/tutorials/audio/tutorial_audio_files/english_conversation_higgs_audio_v2.wav",
  "sampling_rate": 16000,
  "speaker_id": "speaker_0",
  "start": 0.0,
  "end": 3.28,
  "transcripts": {
    "openai/whisper-tiny": "I can't believe you did that without even asking me first.",
    "openai/whisper-small": "I can't believe you did that without even asking me first!"
  },
  "features": {
    "opensmile": {
      "F0semitoneFrom27.5Hz_sma3nz_amean": 42.10844039916992,
      "F0semitoneFrom27.5Hz_sma3nz_stddevNorm": 0.08829638361930847,
      "F0semitoneFrom27.5Hz_sma3nz_percentile20.0": 38.28285217285156,
      "F0semitoneFrom27.5Hz_sma3nz_percentile50.0": 42.84529495239258,
      "F0semitoneFrom27.5Hz_sma3nz_percentile80.0": 44.90517807006836,
      "F0semitoneFrom27.5Hz_sma3nz_pctlrange0-2": 6.622325897216797,
      "F0semitoneFrom27.5Hz_sma3nz_meanRisingSlope": 60.14593505859375,
      "F0semitoneFrom27.5Hz_sma3nz_stddevRisingSlope": 34.8432

This is cool, isn't it?! What's even better is that we wrapped up this entire pipeline for you in one method called `explore_conversation`. All you have to do is calling it and specifying what are the models, tools and configs you want to use for speaker diarization, transcription, feats extraction, speaker embeddings, and ssl embeddings.

In [14]:
from senselab.audio.workflows.explore_conversation import explore_conversation

explore_conversation(
    audio_file_paths=file_paths,
    speaker_diarization_model=diar_model,
    transcription_models=[asr_whisper_tiny_model],
    features_config={"opensmile": True, "parselmouth": True, "torchaudio": False, "torchaudio_squim": True},
    speaker_embeddings_models=[spk_ecapatdnn_model, spk_resnet_model],
    ssl_embeddings_models=[wavlm_model],
)

  info = torchaudio.info(filepath)
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  return AudioMetaData(
  s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  s = torchaudio.io.StreamWriter(uri, format=muxer, buffer_size=buffer_size)
2025-11-11 22:09:29,824 - senselab - INFO - Time taken to initialize the hugging face ASR pipeline: 0.00 seconds
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
Whisper did not predict an ending timestamp, which can happen if audio is c

[[{'original_file': '/Users/fabiocat/git/senselab/tutorials/audio/tutorial_audio_files/english_conversation_higgs_audio_v2.wav',
   'sampling_rate': 16000,
   'speaker_id': 'speaker_0',
   'start': 0.0,
   'end': 3.28,
   'transcripts': {'openai/whisper-tiny': "I can't believe you did that without even asking me first."},
   'features': {'opensmile': {'F0semitoneFrom27.5Hz_sma3nz_amean': 42.10844039916992,
     'F0semitoneFrom27.5Hz_sma3nz_stddevNorm': 0.08829638361930847,
     'F0semitoneFrom27.5Hz_sma3nz_percentile20.0': 38.28285217285156,
     'F0semitoneFrom27.5Hz_sma3nz_percentile50.0': 42.84529495239258,
     'F0semitoneFrom27.5Hz_sma3nz_percentile80.0': 44.90517807006836,
     'F0semitoneFrom27.5Hz_sma3nz_pctlrange0-2': 6.622325897216797,
     'F0semitoneFrom27.5Hz_sma3nz_meanRisingSlope': 60.14593505859375,
     'F0semitoneFrom27.5Hz_sma3nz_stddevRisingSlope': 34.84321594238281,
     'F0semitoneFrom27.5Hz_sma3nz_meanFallingSlope': 35.41069412231445,
     'F0semitoneFrom27.5Hz_s