## Author: René Uhliar

### Originally the exportDataCsv function was written by Koray. I adjusted the function to perform phonemes extraction instead of words from just a specific section from the CGN data files, not the whole file.

<p>This file makes it possible to extract CGN .awd files as defined by the `awdFolderPath` path and transform these into .csv files per .awd file. The .csv file will contain the following (headers and) data: begin_time_of_a_phoneme, end_time_of_a_phoneme, phoneme, absolute_path_to_the_corresponding_wav_file. So e.g., we have the CGN file fn00000.awd which contains raw data:</p>

<ul>
    <li>"0.167"</li>
    <li>"0.179"</li>
    <li>"a"</li>
</ul>

<p>We then transform these data into 0.167,0.179,a,`wavFolderPath`/fn00000.wav and write that line to the corresponding .csv file.</p>

<p><b>It has to be noted</b> that every .awd file contains a few IntervalTier annotation. In each of these .awd files, we are only interested in data under the <i>third</i> IntervalTier annotation and skipping the first four lines under that IntervalTier annotation. This can be seen clearly in the code following the variables `skippedRows` and `IntervalTierCounter`. </p>

In [47]:
import glob, os, csv, re

In [36]:
# Reading out all files from a folder.
def get_files(folderpath):
    files = glob.glob(folderpath + '*')
    return files

In [37]:
finalPath = '/datb/aphasia/languagedata/corpus/final_foneem/'
wavFolderPath = '/datb/aphasia/languagedata/corpus/transform/clean/'
awdFolderPath = '/datb/aphasia/CGN_2.0.3/data/annot/text/awd/comp-k/nl/'

In [64]:
# For exporting data from the Corpus Gesproken Nederlands data set to a CSV
def exportDataCSV(awdFolderPath, wavFolderPath, finalpath, fieldnames):

    files = get_files(awdFolderPath)
    
    for filepath in files:
        with open(filepath, encoding = "ISO-8859-1") as toRead:
            read_data = toRead.readline()

            filename = filepath.split('/')[-1].split('.')[0]
            wavfile = wavFolderPath + filename + '.wav'

            with open(finalPath+filename+'.csv', 'w') as writeTo:
                    writer = csv.DictWriter(writeTo, fieldnames=fieldnames)
                    writer.writeheader()

                    skippedRows = 0
                    subCount = 0

                    begin = 0
                    end = 0
                    phoneme = ''
                    intervalTierCounter = 0
                    intervalTier = 'IntervalTier'

                    while read_data:

                        # Just continue searching the file until an "IntervalTier" has been found
                        if intervalTier not in read_data and intervalTierCounter < 3:
                            read_data = toRead.readline()
                            continue
                        
                        # We found an "IntervalTier"
                        if intervalTier in read_data and intervalTierCounter < 3:
                            read_data = toRead.readline()
                            intervalTierCounter += 1
                            continue
                        
                        # Some files contain MORE than 3 "IntervalTier"'s, we want to break out when
                        # we find a 4th "IntervalTier". It doesn't contain phonemes but words.
                        if intervalTier in read_data and intervalTierCounter >= 3:
                            break;
                        
                        if skippedRows < 4:
                            read_data = toRead.readline()
                            skippedRows += 1
                            continue
                        
                        # So, now onto the big BizNiS
                        subCount += 1
                        
                        if subCount == 1:
                            begin = float(read_data)
                        elif subCount == 2:
                            end = float(read_data)
                        elif subCount == 3:
                            phoneme = re.sub('[\n\r"".?]', '', read_data)

                            writer.writerow({'begin': begin, 'end': end, 'phoneme': phoneme, 'audiopath':wavfile})

                            begin = 0
                            end = 0
                            phoneme = ''
                            subCount = 0
                        
                        read_data = toRead.readline()

In [65]:
exportDataCSV(awdFolderPath, wavFolderPath, finalPath, ['begin', 'end', 'phoneme', 'audiopath'])