# CallHome Preprocessing: Step-by-Step

This notebook demonstrates a step-by-step pipeline for processing the CallHome English dataset. The pipeline is divided into several cells so that you can inspect outputs after each stage:

1. **Dataset Loading & Inspection:** Load the dataset and inspect its structure.
2. **Identifying Overlapping Speech:** Define and test a function to detect overlapping speech using annotated timestamps.
3. **Extracting Overlapping Speech:** Extract and split overlapping segments longer than 10 seconds.
4. **Simulating Overlapping Speech:** Create simulated overlapping segments from non-overlapping parts.
5. **Saving Processed Data:** Process the entire dataset and save the preprocessed segments to a file.

In [1]:
# Step 1: Load the dataset and inspect a few samples
from datasets import load_dataset

# Load the CallHome English dataset (assumes each sample has an 'annotations' field)
dataset = load_dataset("talkbank/callhome", "eng", split="data")

print(f"Total number of samples in dataset: {len(dataset)}")

# Inspect the first sample to see its structure
first_sample = dataset[0]
print(first_sample)

# Print all keys in the first sample
print("First sample keys:", list(first_sample.keys()))

# Check for the expected annotation features
expected_features = ["timestamps_start", "timestamps_end", "speakers"]
present_features = [key for key in expected_features if key in first_sample]
if present_features:
    print("Annotation features found:", present_features)
else:
    print("No expected annotation features found in the first sample.")


  from .autonotebook import tqdm as notebook_tqdm


Total number of samples in dataset: 140
{'audio': {'path': None, 'array': array([-0.00692749,  0.0284729 ,  0.04934692, ..., -0.0005188 ,
       -0.00088501, -0.00076294]), 'sampling_rate': 16000}, 'timestamps_start': [0.0, 6.070000000000164, 7.710000000000036, 10.259999999999991, 11.070000000000164, 12.710000000000036, 15.860000000000127, 16.600000000000136, 18.5300000000002, 23.920000000000073, 25.6400000000001, 27.360000000000127, 28.5, 30.470000000000027, 31.710000000000036, 33.340000000000146, 34.0, 37.97000000000003, 41.50999999999999, 42.2800000000002, 42.90000000000009, 43.61000000000013, 43.66000000000008, 45.090000000000146, 46.37000000000012, 46.600000000000136, 50.190000000000055, 53.66000000000008, 55.61000000000013, 60.7800000000002, 62.29000000000019, 62.5, 64.3900000000001, 66.43000000000006, 68.30000000000018, 69.79000000000019, 71.31000000000017, 75.45000000000005, 80.69000000000005, 84.83000000000015, 89.43000000000006, 90.2800000000002, 93.54000000000019, 97.5200000

## Step 2: Define Overlapping Speech Extraction Function

This function iterates through the annotations in a sample and identifies overlapping speech segments using the `timestamp_start` and `timestamp_end` fields. If a segment lasts longer than 10 seconds, it is split into smaller segments.

In [2]:
import math
def extract_overlapping_segments(sample, sample_index, max_segment_length=10.0):
    overlapping_segments = []
    if not all(key in sample for key in ['timestamps_start', 'timestamps_end', 'speakers']):
        return overlapping_segments

    # Create a list of annotation dictionaries from the parallel lists
    annotations = []
    for start, end, speaker in zip(sample['timestamps_start'], sample['timestamps_end'], sample['speakers']):
        annotations.append({
            "timestamp_start": start,
            "timestamp_end": end,
            "speaker": speaker
        })

    # Sort annotations by start time
    annotations = sorted(annotations, key=lambda ann: ann["timestamp_start"])
    n = len(annotations)

    # Iterate over each pair to detect overlapping segments
    for i in range(n):
        for j in range(i + 1, n):
            ann1 = annotations[i]
            ann2 = annotations[j]
            if ann2["timestamp_start"] < ann1["timestamp_end"]:
                overlap_start = max(ann1["timestamp_start"], ann2["timestamp_start"])
                overlap_end = min(ann1["timestamp_end"], ann2["timestamp_end"])
                duration = overlap_end - overlap_start
                if duration > 0:
                    if duration > max_segment_length:
                        num_subsegments = math.ceil(duration / max_segment_length)
                        subsegment_duration = duration / num_subsegments
                        for k in range(num_subsegments):
                            sub_start = overlap_start + k * subsegment_duration
                            sub_end = sub_start + subsegment_duration
                            overlapping_segments.append({
                                "timestamp_start": sub_start,
                                "timestamp_end": sub_end,
                                "speakers": [ann1["speaker"], ann2["speaker"]],
                                "source": sample_index,  # store sample index here
                                "type": "extracted"
                            })
                    else:
                        overlapping_segments.append({
                            "timestamp_start": overlap_start,
                            "timestamp_end": overlap_end,
                            "speakers": [ann1["speaker"], ann2["speaker"]],
                            "source": sample_index,  # store sample index here
                            "type": "extracted"
                        })
    return overlapping_segments


In [3]:
extracted_test = extract_overlapping_segments(first_sample,0)
print(f"Extracted overlapping segments in first sample: {len(extracted_test)}")
print(extracted_test)

Extracted overlapping segments in first sample: 27
[{'timestamp_start': 6.070000000000164, 'timestamp_end': 6.410000000000082, 'speakers': ['A', 'B'], 'source': 0, 'type': 'extracted'}, {'timestamp_start': 11.070000000000164, 'timestamp_end': 11.150000000000091, 'speakers': ['B', 'A'], 'source': 0, 'type': 'extracted'}, {'timestamp_start': 43.66000000000008, 'timestamp_end': 44.02000000000021, 'speakers': ['A', 'B'], 'source': 0, 'type': 'extracted'}, {'timestamp_start': 46.600000000000136, 'timestamp_end': 47.930000000000064, 'speakers': ['A', 'B'], 'source': 0, 'type': 'extracted'}, {'timestamp_start': 50.190000000000055, 'timestamp_end': 50.440000000000055, 'speakers': ['B', 'A'], 'source': 0, 'type': 'extracted'}, {'timestamp_start': 62.29000000000019, 'timestamp_end': 62.450000000000045, 'speakers': ['A', 'B'], 'source': 0, 'type': 'extracted'}, {'timestamp_start': 62.5, 'timestamp_end': 63.54000000000019, 'speakers': ['B', 'A'], 'source': 0, 'type': 'extracted'}, {'timestamp_star

## Step 3: Define Simulated Overlapping Speech Function

This function simulates overlapping speech by merging segments from different speakers. It selects segments from each speaker, creates a simulated overlap (with duration between 5 and 10 seconds), and resets the timestamp to start at 0 while preserving speaker order.

In [4]:
import random
from collections import defaultdict

def simulate_overlapping_speech(sample, segment_length_range=(5, 10)):
    """
    Simulate artificial overlapping speech by merging segments from different speakers.
    The simulated segment will have a duration randomly chosen between the values in segment_length_range.
    
    Adjusts the annotations so that the simulated audio is considered to start at 0 while preserving speaker order.
    
    Assumes that the sample has the following keys:
      - 'timestamps_start': a list of start times
      - 'timestamps_end': a list of end times
      - 'speakers': a list of speaker labels corresponding to each segment
    """
    simulated_segments = []
    # Check that all required keys exist
    if not all(key in sample for key in ['timestamps_start', 'timestamps_end', 'speakers']):
        return simulated_segments

    # Build annotation dictionaries from the parallel lists
    annotations = []
    for start, end, speaker in zip(sample['timestamps_start'], sample['timestamps_end'], sample['speakers']):
        annotations.append({
            "timestamp_start": start,
            "timestamp_end": end,
            "speaker": speaker
        })

    # Group annotations by speaker
    segments_by_speaker = defaultdict(list)
    for ann in annotations:
        segments_by_speaker[ann["speaker"]].append(ann)

    speakers = list(segments_by_speaker.keys())
    # Need at least two speakers for simulating overlapping speech
    if len(speakers) < 2:
        return simulated_segments

    # Arbitrarily pair segments from different speakers to simulate overlap
    for i in range(len(speakers)):
        for j in range(i + 1, len(speakers)):
            seg1 = random.choice(segments_by_speaker[speakers[i]])
            seg2 = random.choice(segments_by_speaker[speakers[j]])
            # Choose a random duration between the given range
            duration = random.uniform(segment_length_range[0], segment_length_range[1])
            simulated_segments.append({
                "timestamp_start": 0.0,  # Reset start time for simulated segment
                "timestamp_end": duration,
                "speakers": [speakers[i], speakers[j]],
                "source_segments": [seg1, seg2],
                "type": "simulated"
            })
    return simulated_segments

# Example test on a sample:
# Assuming `first_sample` is already defined (e.g., from your dataset)
simulated_test = simulate_overlapping_speech(first_sample, segment_length_range=(5, 10))
print(f"Simulated overlapping segments in first sample: {len(simulated_test)}")
print(simulated_test)


Simulated overlapping segments in first sample: 1
[{'timestamp_start': 0.0, 'timestamp_end': 9.362687820780735, 'speakers': ['A', 'B'], 'source_segments': [{'timestamp_start': 223.17000000000007, 'timestamp_end': 228.6300000000001, 'speaker': 'A'}, {'timestamp_start': 211.98000000000002, 'timestamp_end': 212.97000000000003, 'speaker': 'B'}], 'type': 'simulated'}]


## Step 4: Process the Entire Dataset

Now, we apply the extraction and simulation functions to every sample in the dataset. You can inspect the total number of extracted and simulated segments.

In [5]:
all_extracted = []
all_simulated = []

# Process each sample in the dataset
for i, sample in enumerate(dataset):
    # Extract overlapping segments, passing the sample index for a valid "source" identifier
    extracted = extract_overlapping_segments(sample, sample_index=i, max_segment_length=10.0)
    # Simulate overlapping speech segments
    simulated = simulate_overlapping_speech(sample, segment_length_range=(5, 10))
    
    all_extracted.extend(extracted)
    all_simulated.extend(simulated)

print(f"Total extracted overlapping segments: {len(all_extracted)}")
print(f"Total simulated overlapping segments: {len(all_simulated)}")


Total extracted overlapping segments: 13358
Total simulated overlapping segments: 170


In [None]:
all_simulated

## Step 5: Save the Preprocessed Data

Finally, we store the preprocessed overlapping segments (both extracted and simulated) to a file so that you can use them later.

In [30]:
import pickle

preprocessed_data = {
    "extracted_segments": all_extracted,
    "simulated_segments": all_simulated
}

with open("preprocessed_callhome_data.pkl", "wb") as f:
    pickle.dump(preprocessed_data, f)

print("Preprocessed data saved to 'preprocessed_callhome_data.pkl'")

Preprocessed data saved to 'preprocessed_callhome_data.pkl'
