In [1]:
import pympi
import re
import pydub
from pydub import AudioSegment
from pathlib import Path

This example shows how we can open an ELAN file and cut individual utterances into their own clips. Python package [pydub](https://github.com/jiaaro/pydub) is used in this, although there are many alternatives. In the beginning we use the method of reading the ELAN into a dictionary that was the same in the last lecture. One could also modify the script below to run into a loop of ELAN files. 

Note that we expect that the wav file has the same name as the ELAN file, and the file endings are in lower case. Please adjust to your conventions! It would also be possible to read from ELAN file where the linked audio file is. 

pydub-package is used so that the audio file is read with `from_wav` function. 

In [2]:
elan_path = "corpus/kpv_csys19570000-291_1a-Mezador.eaf"

audio = AudioSegment.from_wav(elan_path.replace('.eaf', '.wav'))

This is essentially the same as last time. 

In [3]:
elan_data = []

elan = pympi.Elan.Eaf(elan_path)

tiers = elan.get_tier_ids_for_linguistic_type("orthT")

for tier in tiers:

    annotations = elan.get_annotation_data_for_tier(tier)

    for annotation in annotations:

        a = {}

        a['start_ms'] = annotation[0]
        a['end_ms'] = annotation[1]
        a['utterance'] = annotation[2]
        a['reference'] = annotation[3]
        a['participant'] = tier.split("@")[1]
        a['gender'] = a['participant'].split("-")[1]
        a['birthyear'] = a['participant'].split("-")[2]
        a['dialect'] = re.search(r"(?<=kpv_)(izva|csys|izva|udo|uvyc|vym)", str(elan_path)).group(0)
        a['filename'] = str(elan_path)

        elan_data.append(a)

Parsing unknown version of ELAN spec... This could result in errors...


The data looks now like this, practically all we need is `start_ms`, `end_ms` and `utterance`. These we can also get from ELAN file now matter which tier template we use. 

In [4]:
elan_data[0]

{'start_ms': 696,
 'end_ms': 3168,
 'utterance': 'Täällä on Keski-Sysolan murre.',
 'reference': 'kpv_csys19570000-291_1a-01',
 'participant': 'EEI-M-1913',
 'gender': 'M',
 'birthyear': '1913',
 'dialect': 'csys',
 'filename': 'corpus/kpv_csys19570000-291_1a-Mezador.eaf'}

Let's create a directory for the clips, then we loop through each utterance.

In [5]:
! mkdir clips

mkdir: clips: File exists


In [6]:
for annotation in elan_data:
    
    clip_name = f"clips/{Path(annotation['filename']).stem}-{annotation['start_ms']}-{annotation['end_ms']}"
    
    # This is where we specifically select a part of the clip, using milliseconds as the unit
    audio_segment = audio[annotation['start_ms']:annotation['end_ms']]
    
    audio_segment.export(f"{clip_name}.wav", format="wav")
    
    text_file = open(f"{clip_name}.txt", "w")
    
    text_file.write(annotation['utterance'])
    
    text_file.close()