Author: René Uhliar

## Common dutch words extraction
In this part, matching words that are found within the data set which was earlier processed by my team colleagues (Corpus Gesproken Nederlands (CGN) data) are being extracted in the format
<br>
<b>start_time_of_word,end_time_of_word,the_word,absolute_path_to_wav_file_of_the_word</b>
<br>
These lines are saved into a new file after they've been extracted from the CGN. We will later use these lines to extract concrete audio segments from the wave files.

In [1]:
import os, re

filepath = "/datb/aphasia/languagedata/corpus/final/"
matching_words_path = "/datb/aphasia/aphasia-shared/pannous_matching_words.csv"
save_filepath = "/datb/aphasia/aphasia-shared/cgn_matching_words_lines.txt"

In [15]:
files = [filepath + name for name in os.listdir(filepath)]


def get_matching_words_from_csv(path, delimiter):
    with open (path, "r") as file:
        return file.read().split(delimiter)


def get_matching_file_paths(files, matching_words_arr):
    lines = []
    for idx, file_path in enumerate(files):
        # file_path example: /datb/aphasia/languagedata/corpus/final/fn001023.csv
        
        # For testing purposes we only take n csv files to
        # load lines from, so that the script works fast.
        # 
        # Comment this out when you want to load lines from ALL csvs (takes long to process in the next slides).
        if (idx >= 15):
            break;
        with open (file_path, "r") as reading_file:
            for line in reading_file.readlines():
                # line example: 0.197,0.736,het,/datb/aphasia/languagedata/corpus/transform/wavfiles/fn001023.csv
                for common_word in matching_words_arr:
                    # common_word example: het
                    
                    word = line.split(',')[2].lower()
                    
                    # Character 'f' before a string enables literal string interpolation.
                    # The \b is used in the regex to help us match exactly the word {word} in a sentence.
                    # We don't want to match a word which is longer, but also contains the {word} part in it.
                    pattern = f"\\b{word}\\b"
                    if re.search(pattern, common_word) is not None:
                        lines.append(line)
    return lines
    
matching_words = get_matching_words_from_csv(matching_words_path, ',')

# save the found words into a new file
def save_lines_to_file(save_file_path, matching_file_paths):
    with open (save_file_path, "w") as save_file:
        for line in matching_file_paths:
            save_file.write(line)

save_lines_to_file(save_filepath, get_matching_file_paths(files, matching_words))

## Process the audio from the saved lines into new, just one word audio files
In the first cell underneath, a dictionary can be noticed. The dictionary contains common dutch words as keys and a label (to be used in training) that represents the word.

In the second tab, we use the file we created early with the 
<b>start_time_of_word,end_time_of_word,the_word,absolute_path_to_wav_file_of_the_word</b> files. Here, we want to cut the part out out of the (whole) audio file where only the one word we're looking for is being said. We want to save that into a new file, resample that saved audio file to 8-bit precision and save that as well. We're doing the following iteratively for each line read from the saved file to achieve this:
<ol>
    <li>Extract a line from the earlier saved file</li>
    <li>Clean the comma separated parts to make them usable</li>
    <li>Load the whole audio file at the path and cut just the part where the word occurs (offset &amp; duration)</li>
    <li>Save the new audio file containing just the one spoken word</li>
    <li>Downsample (from original 24-bit) to 8-bit precision and save this file as well</li>
</ol>
We finally use the downsampled 8-bit precision wav files, because in the library we use, a `wave.open()` command is being used and it didn't work with 24-bit audio files. The library's examples also works with 8-bit precision audio files, so we adhere to this precision.

In [13]:
import librosa
from subprocess import call

cgn_augio_dir = "/datb/aphasia/languagedata/corpus/transform/wavfiles/"
save_extracted_words_dir = "/datb/aphasia/aphasia-shared/extracted_words_audio_test/"
save_extracted_words_8bit_dir = save_extracted_words_dir + "new/"

# labels for the words that will match the part of the wav file name e.g.
# the nuber 8 in 8_spreker_214 belongs to the word "een" from the dict.
word_labels_dict = {"zijn":0, "ja":1, "dat":2, "de":3, "ik":4, "en":5,
                    "het":6, "uh":7, "een":8, "hebben":9, "die":10, "van":11, "maar":12, "in":13, "niet":14,}

In [16]:
with open (save_filepath, "r") as read_file:

    index = 0
    for line in read_file.readlines():
        # line example: 0.197,0.736,het,/datb/aphasia/languagedata/corpus/transform/wavfiles/fn001023.csv
        line_parts_arr = line.split(',')
        
        # Remove new line character from the file path part in line_parts_arr.
        line_parts_arr[3] = line_parts_arr[3][:-1]
        start_seq = float(line_parts_arr[0])
        end_seq = float(line_parts_arr[1])
        
        # Lowercase the word, so that we can match it with the word in the word_labels_dict
        word = line_parts_arr[2].lower()
        path = line_parts_arr[3]

        # Get the duration of the word that is being spoken out
        duration = end_seq - start_seq
        
        # Load a wav from - to time (just the one word).
        wav, sr = librosa.load(path, 
                               offset=start_seq, 
                               duration=duration)
        
        new_file_name = str(word_labels_dict[word]) + "_spreker_" + str(index) + ".wav"
        
        # Save location for the extracted audio part.
        save_location = save_extracted_words_dir + new_file_name
        librosa.output.write_wav(save_location, wav, sr)
        
        # The audio has to be converted to <= 16-bit precision: 
        # https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/issues/57
        # for wave library to be able to open the wav file in this case.
        # We will convert to 8-bit precision, because the pannous digit example code also used 8-bit wav files.
        # We will save the new (8-bit precision) wav file to another than the original wav files.
        new_save_location = save_extracted_words_8bit_dir + new_file_name
        call(["sox", save_location, "-b8", new_save_location])
        
        index += 1
