# Prepare the Dataset for Fine-Tuning Whisper and Wav2Vec2

The raw data includes files of two types: 

1. `.wav` audio of speech by one or more speakers
2. `.eaf` (ELAN Annotation Format) transcriptions of speech. [ELAN](https://archive.mpi.nl/tla/elan) = European Distriubted Corpora Project ([EUDICO](https://www.mpi.nl/world/tg/lapp/eudico/eudico.html)) Linguistic Annotator.

The audio files can be used as they are, provided that they use the sample rate expected by the model. They will be "chunked" by the Hugging Face tools.

The transcription files require some preprocessing. ELAN is a flavor of XML. Fortunately, there is a Python library for reading and parsing ELAN files—[PymPi](https://github.com/dopefishh/pympi). It will be used here to extract the information needed for fine-tuning.

This notebook will produce a dataset in the format:

```json
{"audio": "path/to/file.wav", "text": "transcription text"}
```

## Load the Libraries

The following libraries are required for reading, parsing, formatting, and writing the dataset:

- `json`: For writing the dataset to JSON format
- `os`: For interacting with the computer's operating system
- `pydub`: For working with audio files (<https://github.com/jiaaro/pydub>)
- `pympi`: For reading and parsing ELAN files (<https://github.com/dopefishh/pympi>)

In [1]:
import os
import pympi
from pydub import AudioSegment
import json

## Identify the Correct Tier(s) of the ELAN Files for Transcription Data

ELAN files have tiers for keeping track of different types of information (e.g., speaker, transcription, translation, etc.) related to an audio file. All we care about here is the transcription, so we need to extract just the transcription tier as plain text. To do that, we need to identify what the tiers are. The following cells will do that.

In [3]:
# Path to the directory containing the .eaf files
eaf_dir = "../data/have_transcripts"

# Find all .eaf files in the directory
eaf_files = [f for f in os.listdir(eaf_dir) if f.endswith(".eaf")]

# Loop through each .eaf file and print the tier names and their annotations
for filename in sorted(eaf_files):
    eaf_path = os.path.join(eaf_dir, filename)
    print(f"\nFile: {filename}")

    try:
        eaf = pympi.Elan.Eaf(eaf_path)
    except Exception as e:
        print(f"Error reading {filename}: {e}")
        continue

    tier_names = eaf.get_tier_names()
    if not tier_names:
        print("No tiers found.")
        continue

    print("Available Tiers:")
    for tier in tier_names:
        annotations = eaf.get_annotation_data_for_tier(tier)
        print(f"Tier: {tier} — {len(annotations)} annotations")
        for i, annotation in enumerate(annotations[:3]):  # show first 3 examples
            if len(annotation) == 3:
                start, end, value = annotation
                print(f"       {start}-{end} ms: {value}")
            else:
                print(f"       (referring annotation): {annotation}")




File: 2019-08-16-MR-narracion.eaf
Available Tiers:
Tier: transcript_MR — 101 annotations
       9975-11425 ms: Manolo
       12800-14400 ms: Manolo Romero
       18800-22025 ms: aca vivo en Nuevo Union, Pozo Amarillo
Tier: translation_MR — 101 annotations
       (referring annotation): (9975, 11425, 'Manolo', 'Manolo')
       (referring annotation): (12800, 14400, 'Manolo Romero', 'Manolo Romero')
       (referring annotation): (18800, 22025, 'aca vivo en Nuevo Union, Pozo Amarillo', 'aca vivo en Nuevo Union, Pozo Amarillo')
Tier: transcript_RH — 9 annotations
       640-1540 ms: por favor
       2030-6570 ms: sek apvesai'a?
       14500-17275 ms: perfecto, y de donde es?

File: 2019-08-16-MRR-narracion.eaf
Available Tiers:
Tier: Transcript — 188 annotations
       100-680 ms: alhnakho
       20290-21020 ms: Miguel
       24010-24760 ms: Romero
Tier: Translation — 188 annotations
       (referring annotation): (100, 680, 'perfecto', 'alhnakho')
       (referring annotation): (20290, 2

In [None]:
TRANSCRIPTION_TIERS = ['transcript_MR',
                       'Transcript',
                       'transcript_ER',
                       'transcript_LF',
                       'transcript_LM',
                       'transcript_MM',
                       'transcript_TF',
                       'transcript_SSA',
                       'transcripcion_CA',
                       'transcript_PA',
                       'Transcripcion',
                       'AR_transcripcion',
                       'transcripcion',
                       'FF_Transcripcion',
                       'Transcripcion_FF'
                       ]