# Prepare Manually Aligned Training Data
This notebooks prepares Prak training data from a manually phone-level time-aligned corpus.
While it is posible to train Prak acousting model from just corresponding texts and
recordings (e.g. using a CommonVoice corpus), adding some manually aligned phones
helps to teach the model where exactly we want to put the phone boundaries (this is rather
subjective so we have to provide an example of what we want).

For example, [Fonetický ústav FFUK](https://fonetika.ff.cuni.cz/) invested great deal of
human labor into preparation of such manually alligned corpus.
If you use models trained on their data, please give FÚ a due credit.

This notebook converts TextGrid files to a time aligned tsv file usable in Prak training.
Format of this file is the same as the format of intermediate tsv files produced during
Prak training on the CommonVoice (only sentence-level time aligned) data.
The difference is that this notebook assignes phone labels to 10ms intervals in audio
based on a human decision (from the TextGrid file) while the tsv produced during
CommonVoice training has this assignment done by the partially trained NN itself.

Human-aligned utf8 coded files should be named ```*.TextGrid``` and have a 
corresponding ```*.wav``` files in the same directory. There can be any directory
structure but filenames (without path) should be unique in the full data set.

TextGrid files should contain:
* interval tier named ```phone``` or ```Phone``` with individual phones in [Czech SAMPA](https://www.phon.ucl.ac.uk/home/sampa/czech-uni.htm)
* interval tier named ```word``` or ```Word``` which will be used as transcript of the recording

The ```word``` tier is not strictly needed but makes tsv easier to check and could be used to fix little errors in the ```phone``` tier (e.g. phones which are out of the Czech SAMPA set can be replaced with a prediction made by Prak). We are in fact interested in the ```phrase``` tier but it is not consistently present in the FÚ data. Also, ```word``` tends to have corrections where reader diverged from the prompt, unlike ```phrase```.

When the manually aligned TextGrid contains phones out of the expected set, we use Prak alignment data (made by CV-trained model) as a backup.

In [None]:
# config cell - edit paths as needed

# Where is the corpus:
full_data = "/data/ada/000_cleanTG"

# Where is our test subset of the full corpus (to be excluded from train):
test_subset = "/home/hanzl/test-prak/ref2/repair"

In [None]:
import pandas as pd
import sys
if sys.path[0] != '..':
    sys.path[0:0] = ['..'] # prepend main Prak directory
from acmodel.praat_ifc import (
    read_interval_tiers_from_textgrid_file,
    rename_prune_tiers,
    desampify_phone_tier)
from acmodel.nn_acmodel import (
    load_nn_acoustic_model,
    b_log_corrections,
    triple_hmm_states,
    mfcc_make_speaker_vector,
    mfcc_win_view,
    mfcc_add_sideview,
    b_set,
    align_hmm)
from prongen.hmm_pron import HMM

In [None]:
full_wav_paths = !find {full_data} -name "*.wav"
test_wav_paths = !find {test_subset} -name "*.wav"
len(full_wav_paths), len(test_wav_paths)

In [None]:
wav2path = {path.split("/")[-1]: path for path in full_wav_paths}
assert len(wav2path)==len(full_wav_paths) # make sure file names are all different

In [None]:
test_wav_file_set = {path.split("/")[-1] for path in test_wav_paths}
assert len(test_wav_file_set)==len(test_wav_paths)

In [None]:
full_wav_file_set = set(wav2path) # get just keys, i.e. file names

In [None]:
train_wav_file_set = full_wav_file_set-test_wav_file_set
assert len(train_wav_file_set)==len(full_wav_file_set)-len(test_wav_file_set)
# The assert above is not strictly necessary but would reveal
# any test files not being present in the full set. Comment it out
# if your test set also contains files comming from elsewhere.
len(train_wav_file_set)

In [None]:
def phone_tier_to_phone_targets(frames, phone_tier):
    """
    Create targets for individual 10ms frames from a list
    of phone intervals. Be permissive when phone times do not
    fit but print a warning.
    """
    targets = ['|']*frames
    # fill in targets for individual phones:
    for (b, e, p) in phone_tier:
        if p=='':
            continue # likely silence, we have it pre-filled already
        #     +----------------+   <-phone
        #+-----+=====+=====+=====+-----+-----+-----+   == phone's frames
        #0    0.01  0.02  0.03      time axis
        #  
        b_frame = int(b*100+0.5) # first frame which belongs to this phone
        e_frame = int(e*100+0.501) # first frame which does NOT belong to this phone
        # 0.5 vs 0.501 makes sure we do not leave frame empty due to rounding noise
        for f in range(b_frame, e_frame):
            if f>=0 and f<frames:
                targets[f]=p
    return ''.join(targets)

#phone_tier_to_phone_targets(7, []) # '|||||||'
#phone_tier_to_phone_targets(7, [(0.00, 0.01, 's')]) # 's||||||'
#phone_tier_to_phone_targets(7, [(0.00, 0.014999, 's'), (0.0150001, 0.03, 'z')]) # 'ssz||||'
#phone_tier_to_phone_targets(7, [(-3, 10, 's')]) # 'sssssss'

In [None]:
# Prepare CV-trained model for eventual patching of problematic spots
# where human put a non-existing phone.

model = load_nn_acoustic_model("half", mid_size=100, varstates=False)
b_log_corr = b_log_corrections("half.tsv")

In [None]:
# prepare tsv columns
c_wav = []
c_sentence = []
c_targets = []
fixes = 0
for wav in sorted(train_wav_file_set):
    path = wav2path[wav]
    tg_path = path[:-len(".wav")]+".TextGrid"
    #print(tg_path)
    in_tiers_all = read_interval_tiers_from_textgrid_file(tg_path)
    in_tiers = rename_prune_tiers(in_tiers_all, ["word", "Word:word", "phone", "Phone:phone"])
    assert set(in_tiers.keys())=={"phone", "word"}
    txt = " ".join(x for (b, e, w) in in_tiers["word"] if (x:=w.strip())!="")
    in_tiers["phone"] = desampify_phone_tier(in_tiers["phone"])
    c_wav.append(path)
    c_sentence.append(txt)
    # compute MFCC from wav, just to find out number of segments:
    hmm = HMM(txt, path, derivatives=0)
    frames = len(hmm.mfcc)
    targets = phone_tier_to_phone_targets(frames, in_tiers["phone"])
    
    if '#' in targets or '@' in targets:  # 394 files out of 1435 has #
        fixes += 1
        #print(f"replacing targets {targets}")
        
        # finish automatic alignement setup:
        triple_hmm_states(hmm)
        hmm.speaker_vector = mfcc_make_speaker_vector(hmm.mfcc)
        hmm.mfcc = mfcc_win_view(mfcc_add_sideview(hmm.mfcc))
    
        targets = list(targets) # make characters assignable
        alp = align_hmm(hmm, model, b_set, b_log_corr=b_log_corr)
        for i in range(frames):
            if targets[i] in '#@':
                targets[i] = hmm.targets[i] # STILL LOOSING EXACT TIME BOUNDARY...
        targets = ''.join(targets) # back to a string
        
        #print(f" with new targets {targets}")
    
    c_targets.append(targets)
    
fixes

In [None]:
df = pd.DataFrame(list(zip(c_wav, c_sentence, c_targets)), columns=['wav', 'sentence', 'targets'])
df

In [None]:
df.to_csv("manual_train.tsv", sep="\t", index=False)

In [None]:
!head manual_train.tsv