# Prepare the Dataset for Fine-Tuning Whisper and Wav2Vec2

The raw data includes files of two types: 

1. `.wav` audio of speech by one or more speakers
2. `.eaf` (ELAN Annotation Format) transcriptions of speech. [ELAN](https://archive.mpi.nl/tla/elan) = European Distriubted Corpora Project ([EUDICO](https://www.mpi.nl/world/tg/lapp/eudico/eudico.html)) Linguistic Annotator.

## Audio Files

The audio files can be used as they are, provided that they use the sample rate expected by the model. I'll check that and resample if needed.

## Transcription Files

The transcription files require some preprocessing. The first step is to manually rename them to match the names of the corresponding `wav` files. 

The next step is to identify the correct "tier" to use. The `eaf` format is a flavor of XML. Fortunately, there is a Python library for reading and parsing ELAN files—[PymPi](https://github.com/dopefishh/pympi). It will be used here to extract the information needed for fine-tuning.

## The Goal

The goal is to produce a dataset that looks like this:

```json
{"audio": "../new-data/2019-08-16-MR-narracion.wav", "text": "Manolo"}
{"audio": "../new-data/2019-08-16-MR-narracion.wav", "text": "Manolo Romero"}
{"audio": "../new-data/2019-08-16-MR-narracion.wav", "text": "aca vivo en Nuevo Union, Pozo Amarillo"}
{"audio": "../new-data/2019-08-16-MR-narracion.wav", "text": "yo tengo 39 anos"}
{"audio": "../new-data/2019-08-16-MR-narracion.wav", "text": "si"}
{"audio": "../new-data/2019-08-16-MR-narracion.wav", "text": "si"}
{"audio": "../new-data/2019-08-16-MR-narracion.wav", "text": "asma'ak altenama kolha ka"}
```

## Load the Libraries

The following libraries are required for reading, parsing, formatting, and writing the dataset:

- `json`: For writing the dataset to JSON format
- `os`: For interacting with the computer's operating system
- `pydub`: For working with audio files (<https://github.com/jiaaro/pydub>)
- `pympi`: For reading and parsing ELAN files (<https://github.com/dopefishh/pympi>)
- `shutil`: For high-level file operations like copying and removing files
- `subprocess`: For managing subroutines
- `wave`: For reading and writing WAV files

In [None]:
import pympi
import json
from glob import glob
import os
import random
import numpy as np
import torch
from datasets import load_dataset, DatasetDict
from datasets import Audio
from transformers import WhisperProcessor, WhisperFeatureExtractor
from codecarbon import EmissionsTracker

In [None]:
# Set random seed for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

In [None]:
# Configuration
data_dir = '../new-data'
output_jsonl = '../new-data/whisper_finetune_dataset.jsonl'
output_dir = "../whisper_finetune_cpu"
log_dir = "../logs"

# If directories do not exist, create them
os.makedirs(output_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)

TRANSCRIPTION_TIERS = [
    'transcript_MR', 'Transcript', 'transcript_ER', 'transcript_LF',
    'transcript_LM', 'transcript_MM', 'transcript_TF', 'transcript_SSA',
    'transcripcion_CA', 'transcript_PA', 'Transcripcion', 'AR_transcripcion',
    'transcripcion', 'FF_Transcripcion', 'Transcripcion_FF'
]
min_duration_sec = 0.5

# Find all .eaf files
eaf_paths = sorted(glob(os.path.join(data_dir, '*.eaf')))

samples = []

for eaf_path in eaf_paths:
    basename = os.path.splitext(os.path.basename(eaf_path))[0]
    wav_path = os.path.join(data_dir, basename + '.wav')
    if not os.path.exists(wav_path):
        print(f"⚠️ Missing WAV file for: {basename}")
        continue

    eaf = pympi.Elan.Eaf(eaf_path)
    tier_names = eaf.get_tier_names()

    # Try all known transcription tiers for this file
    matching_tiers = [tier for tier in TRANSCRIPTION_TIERS if tier in tier_names]
    if not matching_tiers:
        print(f"⚠️ No matching tier in {basename}")
        continue

    for tier in matching_tiers:
        for start_ms, end_ms, value in eaf.get_annotation_data_for_tier(tier):
            duration = (end_ms - start_ms) / 1000.0
            if duration < min_duration_sec or not value.strip():
                continue
            samples.append({
                "audio": wav_path,
                "start": start_ms / 1000.0,
                "end": end_ms / 1000.0,
                "text": value.strip()
            })

print(f"✅ Extracted {len(samples)} segments from {len(eaf_paths)} EAF files.")

# Save to JSONL
with open(output_jsonl, 'w', encoding='utf-8') as f:
    for sample in samples:
        f.write(json.dumps(sample, ensure_ascii=False) + '\n')

print(f"📄 Saved dataset to {output_jsonl}")

In [None]:
# Path to your new JSONL dataset
jsonl_path = "../new-data/whisper_finetune_dataset.jsonl"

# Load the dataset (entire dataset as a single split)
dataset = load_dataset("json", data_files=jsonl_path, split="train")

# Split into train/validation/test (80/10/10)
split_dataset = dataset.train_test_split(test_size=0.2, seed=seed)
val_test = split_dataset['test'].train_test_split(test_size=0.5, seed=seed)

# Assemble into a DatasetDict
whisper_dataset = DatasetDict({
    "train": split_dataset['train'],
    "validation": val_test['train'],
    "test": val_test['test']
})

# Optional: Preview
print(whisper_dataset)
print(whisper_dataset["train"][0])

In [None]:
whisper_dataset = whisper_dataset.remove_columns(["start", "end"])

In [None]:
# This will include both the feature extractor and tokenizer
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
# Make a generic tokenizer
tokenizer = processor.tokenizer

In [None]:
whisper_dataset = whisper_dataset.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
print(whisper_dataset["train"][0])

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["text"]).input_ids
    return batch

In [None]:
tracker = EmissionsTracker(
    project_name="whisper-enenlhet-cpu",
    output_dir=log_dir,
    output_file="whisper-prepare-data-emissions-cpu.csv"
)

In [None]:
# Start emissions tracking
tracker.start()
for split in whisper_dataset:
    whisper_dataset[split] = whisper_dataset[split].map(
        prepare_dataset,
        remove_columns=whisper_dataset[split].column_names,
        num_proc=4
    )
# Stop emissions tracking
tracker.stop()
# Save the processed dataset to disk
whisper_dataset.save_to_disk("whisper_prepared_dataset")
print("✅ Dataset prepared and saved to 'whisper_prepared_dataset' directory.")