# Prepare the Dataset for Fine-Tuning Whisper and Wav2Vec2

The raw data includes files of two types: 

1. `.wav` audio of speech by one or more speakers
2. `.eaf` (ELAN Annotation Format) transcriptions of speech. [ELAN](https://archive.mpi.nl/tla/elan) = European Distriubted Corpora Project ([EUDICO](https://www.mpi.nl/world/tg/lapp/eudico/eudico.html)) Linguistic Annotator.

## Audio Files

The audio files can be used as they are, provided that they use the sample rate expected by the model. I'll check that and resample if needed.

## Transcription Files

The transcription files require some preprocessing. The first step is to manually rename them to match the names of the corresponding `wav` files. 

The next step is to identify the correct "tier" to use. The `eaf` format is a flavor of XML. Fortunately, there is a Python library for reading and parsing ELAN files‚Äî[PymPi](https://github.com/dopefishh/pympi). It will be used here to extract the information needed for fine-tuning.

## Goal

The goal is to produce a dataset in the [JSONL](https://jsonlines.org/) format:

```json
{"audio": "path/to/file1.wav", "text": "transcription1 text"}
{"audio": "path/to/file2.wav", "text": "transcription2 text"}
‚Ä¶
```

## Load the Libraries

The following libraries are required for reading, parsing, formatting, and writing the dataset:

- `json`: For writing the dataset to JSON format
- `os`: For interacting with the computer's operating system
- `pydub`: For working with audio files (<https://github.com/jiaaro/pydub>)
- `pympi`: For reading and parsing ELAN files (<https://github.com/dopefishh/pympi>)
- `shutil`: For high-level file operations like copying and removing files
- `subprocess`: For managing subroutines
- `wave`: For reading and writing WAV files

In [1]:
import os
import pympi
from pydub import AudioSegment
import json
import shutil
import subprocess
import wave

Note that this notebook also makes use of a command-line utility known as `ffmpeg` for manipulating sound file. It is available at <https://ffmpeg.org/>.

## Make Sure the WAV Files Use the Same Sampling Rate

Whisper expects the audio files to use a sampling rate of 16,000 Hz. In this section, I'll inspect the sampling rates of the files, then resample them if necessary, save a backup copy, and save the resampled version in a new directory.

### Check the Sample Rate

In [2]:

data_dir = "../data/have_transcripts"

for file in sorted(os.listdir(data_dir)):
    if file.endswith(".wav"):
        path = os.path.join(data_dir, file)
        with wave.open(path, 'rb') as w:
            rate = w.getframerate()
            print(f"{file}: {rate} Hz")

2019-08-16-MR-narracion.wav: 96000 Hz
2019-08-18-MRR-narracion.wav: 96000 Hz
2019-08-24-ER-narracion.wav: 96000 Hz
2019-08-24-LF-narracion.wav: 96000 Hz
2019-08-24-LM-narracion.wav: 96000 Hz
2019-08-24-MM-narracion.wav: 96000 Hz
2019-08-24-TF-narracion.wav: 96000 Hz
2019-09-11-SSA-narracion-part1.wav: 96000 Hz
2019-09-11-SSA-narracion-part2.wav: 96000 Hz
2019-09-11-SSA-narracion-part3.wav: 96000 Hz
2019-09-13-CA-narracio-part1.wav: 96000 Hz
2019-09-13-CA-narracio-part2.wav: 96000 Hz
2019-09-24-PA-narracion.wav: 96000 Hz
2019-10-03-TR-BT.wav: 48000 Hz
2019-10-03-TR-HM.wav: 48000 Hz
2019-10-3-TR-AR.wav: 48000 Hz
2019-10-3-TR-OM.wav: 48000 Hz
FF_Casanillo_01.wav: 48000 Hz
FF_Casanillo_02.wav: 48000 Hz
FF_Casanillo_03.wav: 48000 Hz


None of the existing `wav` files uses the expected sampling rate. I'll save a backup copy, resample them using `ffmpeg`, and save the resampled versions

In [None]:
# Set the directory containing the original .wav files
data_dir = "../data/have_transcripts"

# === Make backup directory ===
backup_dir = os.path.join(data_dir, "backup_originals")
os.makedirs(backup_dir, exist_ok=True)

# === Process all .wav files ===
for filename in sorted(os.listdir(data_dir)):
    if not filename.lower().endswith(".wav"):
        continue

    original_path = os.path.join(data_dir, filename)
    backup_path = os.path.join(backup_dir, filename)

    # Step 1: Move original to backup
    shutil.move(original_path, backup_path)
    print(f"Moved to backup: {filename}")

    # Step 2: Resample and write to original location
    subprocess.run([
        "ffmpeg", "-y",
        "-i", backup_path,
        "-ar", "16000",
        "-ac", "1",
        original_path
    ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

    print(f"Resampled to 16kHz mono: {filename}")

print("\nAll .wav files backed up and resampled.")
print(f"Original files are in: {backup_dir}")


Moved to backup: 2019-08-16-MR-narracion.wav
Resampled to 16kHz mono: 2019-08-16-MR-narracion.wav
Moved to backup: 2019-08-18-MRR-narracion.wav
Resampled to 16kHz mono: 2019-08-18-MRR-narracion.wav
Moved to backup: 2019-08-24-ER-narracion.wav
Resampled to 16kHz mono: 2019-08-24-ER-narracion.wav
Moved to backup: 2019-08-24-LF-narracion.wav
Resampled to 16kHz mono: 2019-08-24-LF-narracion.wav
Moved to backup: 2019-08-24-LM-narracion.wav
Resampled to 16kHz mono: 2019-08-24-LM-narracion.wav
Moved to backup: 2019-08-24-MM-narracion.wav
Resampled to 16kHz mono: 2019-08-24-MM-narracion.wav
Moved to backup: 2019-08-24-TF-narracion.wav
Resampled to 16kHz mono: 2019-08-24-TF-narracion.wav
Moved to backup: 2019-09-11-SSA-narracion-part1.wav
Resampled to 16kHz mono: 2019-09-11-SSA-narracion-part1.wav
Moved to backup: 2019-09-11-SSA-narracion-part2.wav
Resampled to 16kHz mono: 2019-09-11-SSA-narracion-part2.wav
Moved to backup: 2019-09-11-SSA-narracion-part3.wav
Resampled to 16kHz mono: 2019-09-11-

## Identify the Correct Tier(s) of the ELAN Files for Transcription Data

ELAN files have tiers for keeping track of different types of information (e.g., speaker, transcription, translation, etc.) related to an audio file. All we care about here is the transcription, so we need to extract just the transcription tier as plain text. To do that, we need to identify what the tiers are. The following cells will do that.

In [4]:
# Path to the directory containing the .eaf files
eaf_dir = "../data/have_transcripts"

# Find all .eaf files in the directory
eaf_files = [f for f in os.listdir(eaf_dir) if f.endswith(".eaf")]

# Loop through each .eaf file and print the tier names and their annotations
for filename in sorted(eaf_files):
    eaf_path = os.path.join(eaf_dir, filename)
    print(f"\nFile: {filename}")

    try:
        eaf = pympi.Elan.Eaf(eaf_path)
    except Exception as e:
        print(f"Error reading {filename}: {e}")
        continue

    # Get the tier names from the EAF file
    tier_names = eaf.get_tier_names()
    if not tier_names:
        print("No tiers found.")
        continue

    print("Available Tiers:")
    for tier in tier_names:
        annotations = eaf.get_annotation_data_for_tier(tier)
        print(f"Tier: {tier} ‚Äî {len(annotations)} annotations")
        # Print the first few annotations for each tier
        for i, annotation in enumerate(annotations[:3]):
            if len(annotation) == 3:
                start, end, value = annotation
                print(f"       {start}-{end} ms: {value}")
            else:
                print(f"       (referring annotation): {annotation}")


File: 2019-08-16-MR-narracion.eaf
Available Tiers:
Tier: transcript_MR ‚Äî 101 annotations
       9975-11425 ms: Manolo
       12800-14400 ms: Manolo Romero
       18800-22025 ms: aca vivo en Nuevo Union, Pozo Amarillo
Tier: translation_MR ‚Äî 101 annotations
       (referring annotation): (9975, 11425, 'Manolo', 'Manolo')
       (referring annotation): (12800, 14400, 'Manolo Romero', 'Manolo Romero')
       (referring annotation): (18800, 22025, 'aca vivo en Nuevo Union, Pozo Amarillo', 'aca vivo en Nuevo Union, Pozo Amarillo')
Tier: transcript_RH ‚Äî 9 annotations
       640-1540 ms: por favor
       2030-6570 ms: sek apvesai'a?
       14500-17275 ms: perfecto, y de donde es?

File: 2019-08-18-MRR-narracion.eaf
Available Tiers:
Tier: Transcript ‚Äî 188 annotations
       100-680 ms: alhnakho
       20290-21020 ms: Miguel
       24010-24760 ms: Romero
Tier: Translation ‚Äî 188 annotations
       (referring annotation): (100, 680, 'perfecto', 'alhnakho')
       (referring annotation):

After reviewing the output of that script, I can assemble a Python list of the tiers that contain the transcriptions. I'll loop over this and use it to identify the correct tier to extract from each `eaf` file.

In [5]:
TRANSCRIPTION_TIERS = ['transcript_MR',
                       'Transcript',
                       'transcript_ER',
                       'transcript_LF',
                       'transcript_LM',
                       'transcript_MM',
                       'transcript_TF',
                       'transcript_SSA',
                       'transcripcion_CA',
                       'transcript_PA',
                       'Transcripcion',
                       'AR_transcripcion',
                       'transcripcion',
                       'FF_Transcripcion',
                       'Transcripcion_FF'
                       ]

## Create the Dataset

I'll iterate over the `../data/have_transcripts` directory where all the `eaf` and `wav` files are located, identify the file pairs (I manually renamed them to be easily paired), use the `TRANSCRIPTION_TIERS` list to extract the correct tier and process it, then write the dataset in `JSONL` format.

In [6]:
# === Configuration ===
data_dir = "../data/have_transcripts"  # üîÅ Set your path here
output_audio_dir = os.path.join(data_dir, "segments")
output_jsonl_path = os.path.join(data_dir, "dataset.jsonl")

# === Prep output directory ===
os.makedirs(output_audio_dir, exist_ok=True)

# === Find all EAF + (resampled) WAV file pairs ===
eaf_files = [f for f in os.listdir(data_dir) if f.endswith(".eaf")]
wav_files = {os.path.splitext(f)[0]: os.path.join(data_dir, f)
             for f in os.listdir(data_dir) if f.endswith(".wav")}

# === Process each EAF file ===
# Initialize a list to hold the data entries
data_entries = []
# Initialize a counter for segment filenames
segment_counter = 0

# Loop through each EAF file
for eaf_file in sorted(eaf_files):
    base_name = os.path.splitext(eaf_file)[0]
    eaf_path = os.path.join(data_dir, eaf_file)

    # Try to find matching WAV file
    matching_wav = None
    for wav_base, wav_path in wav_files.items():
        if base_name.startswith(wav_base):
            matching_wav = wav_path
            break

    if not matching_wav:
        print(f"No matching .wav for {eaf_file}")
        continue

    print(f"\nProcessing {eaf_file} with audio {os.path.basename(matching_wav)}")

    # Load the EAF file and the corresponding audio file
    try:
        eaf = pympi.Elan.Eaf(eaf_path)
        audio = AudioSegment.from_wav(matching_wav)
    except Exception as e:
        print(f"Error loading file {eaf_file}: {e}")
        continue

    # Process each tier in the EAF file
    for tier in eaf.get_tier_names():
        if tier not in TRANSCRIPTION_TIERS:
            continue

        annotations = eaf.get_annotation_data_for_tier(tier)
        print(f"Using tier: {tier} ({len(annotations)} annotations)")
        # Extract annotations and save audio segments
        for start_ms, end_ms, value in annotations:
            value = value.strip()
            if not value:
                continue
            # Ensure start and end times are within the audio duration
            segment = audio[start_ms:end_ms]
            segment_filename = f"seg_{segment_counter:05d}.wav"
            segment_path = os.path.join(output_audio_dir, segment_filename)
            segment.export(segment_path, format="wav")
            
            # Add entry to the dataset
            data_entries.append({
                "audio": segment_path,
                "text": value
            })
            
            # Update the segment counter
            segment_counter += 1

# === Save dataset to JSONL ===
with open(output_jsonl_path, "w", encoding="utf-8") as f:
    for entry in data_entries:
        json.dump(entry, f, ensure_ascii=False)
        f.write("\n")

print(f"\nDone! Extracted {len(data_entries)} segments.")
print(f"Audio segments saved to: {output_audio_dir}")
print(f"Dataset saved as: {output_jsonl_path}")



Processing 2019-08-16-MR-narracion.eaf with audio 2019-08-16-MR-narracion.wav
Using tier: transcript_MR (101 annotations)

Processing 2019-08-18-MRR-narracion.eaf with audio 2019-08-18-MRR-narracion.wav
Using tier: Transcript (188 annotations)

Processing 2019-08-24-ER-narracion.eaf with audio 2019-08-24-ER-narracion.wav
Using tier: transcript_ER (180 annotations)

Processing 2019-08-24-LF-narracion.eaf with audio 2019-08-24-LF-narracion.wav
Using tier: transcript_LF (73 annotations)

Processing 2019-08-24-LM-narracion.eaf with audio 2019-08-24-LM-narracion.wav
Using tier: transcript_LM (106 annotations)

Processing 2019-08-24-MM-narracion.eaf with audio 2019-08-24-MM-narracion.wav
Using tier: transcript_MM (127 annotations)

Processing 2019-08-24-TF-narracion.eaf with audio 2019-08-24-TF-narracion.wav
Using tier: transcript_TF (140 annotations)

Processing 2019-09-11-SSA-narracion-part1.eaf with audio 2019-09-11-SSA-narracion-part1.wav
Using tier: transcript_SSA (1253 annotations)
Us