# Compare Prak aligner to Prague Labeller
Prague Labeller ('PP' in code below) is a common aligner used before. Running it is a bit tricky,
we run it manually (on different computer on Windows) and copied the
resulting textgrids here. We also have corresponding manually made labels
which we even further correct ourselves.

Parts of the code below are commented out to avoid accidental overwriting of manually corrected data. It can be used as follows:
* align your reference data using an old aligner (e.g. Prague Labeller, 'PP')
* manually copy data to places suggested below, or edit paths if you used other places
* uncomment Prak command line below and align using Prak
* evaluate both aligners (comment/uncomment lines with compare_tiers_detailed() to select aligner)
* uncomment two lines "with open(out_file ...) as f: f.write(tg_txt)", write textgrids for additional hand check
* do additional hand-check using Praat (took us a day for 20k phones)
* COMMENT OUT those two lines again, you do not want to loose this day of work
* evaluate aligners on double-checked data

In short, first read the code below. DO NOT ATTEMPT TO JUST BLINDLY RUN IT.

In [None]:
import os
home = os.getenv("HOME")

In [None]:
# where are phrase textgrids and wavs sent to PP aligner
# this set has corrected phrase tier (made from [Ww]ords tiers, fixing {} issues)
# and is authoritative regarging files used for test (problematic files are deleted here)
test_input_dir = home+'/test-prak/compare_pp/test_pp'

# where are manually corrected textgrids
# (there may be additional textgrids here for files excluded above as problematic)
# We want just [Pp]hone tiers from these (there are more tiers and also some point tiers)
# (We do NOT want phrase nor [Ww]ord tiers from this place!)
man_aligned_dir = home+'/test-prak/compare_pp/orig_tg'

# where are textgrids aligned by PP aligner:
#pp_aligned_dir = '~/test-prak/compare_pp/nastrelene_pp'
pp_aligned_dir = home+'/test-prak/compare_pp/nastrelene_pp_new'

# where to put textgrids aligned by Prak, containing also all the other info:
prak_aligned_dir = home+'/test-prak/compare_pp/nastrelene_prak'

# alternative Prak alignment with known pronunciations of foreign words:
#prak_aligned_dir = home+'/test-prak/compare_pp/nastrelene_prak_foreign'
!mkdir -p {prak_aligned_dir}

# where hand-edit via Praat will be done:
ref_repair_dir = home+'/test-prak/repair_ref'

In [None]:
# Use full with and waste less space on prompts on the left:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
display(HTML("<style>.prompt_container{width: 11ex !important; }</style>"))
display(HTML("<style>div.prompt{min-width: 11ex; }</style>"))

In [None]:
!find {test_input_dir} -name '*.wav'|wc
!find {test_input_dir} -name '*.TextGrid'|wc
!find {man_aligned_dir} -name '*.TextGrid'|wc
!find {pp_aligned_dir} -name '*.TextGrid'|wc
!find {prak_aligned_dir} -name '*.TextGrid'|wc

In [None]:
testlist = !find {test_input_dir} -name '*.wav'
testlist = ["/".join(t.split("/")[-2:])[:-4] for t in testlist]
len(testlist)

In [None]:
for t in testlist:
    # UNCOMMENT BELLOW TO REPEAT PRAK ALIGNMENT
    #!~/f-w/prak/prak -i {test_input_dir}/{t}.TextGrid -w {test_input_dir}/{t}.wav -o {prak_aligned_dir}/{t}.TextGrid -f -e ~/prak/exceptions.txt --merge-in {pp_aligned_dir}/{t}.TextGrid phone:pp_phone :: {man_aligned_dir}/{t}.TextGrid phone:man_phone Phone:man_phone
    pass

In [None]:
import sys

if sys.path[0] != '..':
    sys.path[0:0] = ['..'] # prepend main Prak directory

from acmodel.praat_ifc import *
from acmodel.evaluate import *

In [None]:
# Take all TextGrids with combined data and evaluate
total = Accumulator()  # any newly used attributes are auto-initialized to 0
max_misplace = 0.1
for t in testlist:
    tg_file = prak_aligned_dir+'/'+t+'.TextGrid'
    tg = read_interval_tiers_from_textgrid_file(tg_file)
    
    man = desampify_phone_tier(tg['man_phone']) # WILL BE REPLACED BELOW BY HAND-FIXED REF
    pp = desampify_phone_tier(tg['pp_phone'])
    ours = desampify_phone_tier(tg['phone'])
    
    # get hand-fixed references (comment this out if you do not have these yet):
    fixed_ref_file = ref_repair_dir+'/'+t+'.TextGrid'
    fixed_tg = read_interval_tiers_from_textgrid_file(fixed_ref_file)
    man = fixed_tg['fix-phone'] # GET HAND-FIXED REF INSTEAD OF THE ORIGINAL ONE
    
    total.man_phones += len(man)
    
    #compare_tiers_detailed(man, pp, total, t)
    compare_tiers_detailed(man, ours, total, t, max_misplace)

print(f"{total=}")
print("Summary results:")
print(f'{total.man_phones} phones, {"%0.2f"%(100*total.dif/total.man_phones)}% mismatched, {"%0.2f"%(100*total.misplaced/total.man_phones)}% misplaced more than {max_misplace}s')  

In [None]:
#                        phn err                   Middle shift                         misbeg misend
#                                  100ms   200ms     50ms    30ms   20ms   10ms         100ms
# ours: 20303  586  73    2.89     0.36    0.09      2.40    8.44   17.22  41.75        198 184 !!
# pp:   20303 1327 872    6.54     4.29    3.22      6.26    9.31   14.53  31.68        970 823

# fixed refs:
#              100ms 200ms
# ours:   1.88 0.36  0.09
# pp:     6.61 4.28  3.22

# with all foreign words in exceptions:
# 1.63 0.36

# Create TextGrids highlighting suspicious reference labels
Run this to create input for our additional hand-check.

In [None]:
# Hardlink test wavs to directory structure where repair TextGrid files will go for hand edit via Praat:
subdirs = !ls {test_input_dir}
for x in subdirs:
    !mkdir -p {ref_repair_dir}/{x}
    #!cp -l {test_input_dir}/{x}/*.wav {ref_repair_dir}/{x}

In [None]:
# Take all TextGrids with all the combined data, prepare TextGrid files for manual edit:
for t in testlist:
    tg_file = prak_aligned_dir+'/'+t+'.TextGrid'
    tg = read_interval_tiers_from_textgrid_file(tg_file)
    man = desampify_phone_tier(tg['man_phone'])
    pp = desampify_phone_tier(tg['pp_phone'])
    ours = desampify_phone_tier(tg['phone'])
    word = tg['word']
    phrase = tg['phrase']

    man_s, our_s = prune_tiers_to_suspicious_intervals(man, ours)
    man_spp, pp_s = prune_tiers_to_suspicious_intervals(man, pp)

    out_tg = {"fix-phone": man, "man_s": man_s, "our-s": our_s, "man_spp": man_spp, "pp-s": pp_s, "ours":ours, "word": word, "phrase": phrase, "phone-sampa": tg["man_phone"], "pp-sampa": tg["pp_phone"]}
    out_file = ref_repair_dir+'/'+t+'.TextGrid'
    tg_txt = textgrid_file_text(out_tg)
    # TEXTGRIDS WERE HAND-EDITED, NEVER OVERWRITE THEM AGAIN, LEVE BOTH LINES BELLOW COMMENTED OUT!!!
    #with open(out_file, 'w', encoding='utf-8') as f:
    #    f.write(tg_txt)