We'll use [High quality TTS data for Khmer (OpenSLR 42)] dataset for training the acoustic model.
conda create -n aligner python=3.8 --yes
conda activate aligner
conda install -c conda-forge montreal-forced-aligner --yes
# audio dataset
wget -O km_kh_male.zip https://www.openslr.org/resources/42/km_kh_male.zip
# pronouncing dictionary
wget -O lexicon.txt https://github.com/seanghay/khmer-acoustic-model-mfa/raw/main/lexicon.txt
# uncompress
unzip km_kh_male.zip
Create transcription for each audio files.
python preprocess.py
mfa train --clean --speaker_characters 8 km_kh_male/wavs lexicon.txt khm_model.zip
This will take quite some time. Once it's done, there will be khm_model.zip
file which you can then use for forced alignment.
Each audio file name looks like this khm_0308_0011865648
MFA requires speaker labels for speaker-adapted training (SAT), so basically speaker_characters
argument is to tell MFA to slice and parse the file name (khm_0308_0011865648)
It looks like this in Python
speaker_characters = 8
file_name = "khm_0308_0011865648"
speaker_id = file_name[0:speaker_characters]
print(f"{speaker_id=}")
# => speaker_id=khm_0308"
The pronounciation dictionary lexicon.txt
has a limited amount of words which will lead to out of vocabulary(OOV) error for missing words, so in order to be able to generate unseen words we have to train G2P model.
mfa train_g2p --phonetisaurus lexicon.txt khm_g2p.zip
The output files will be in Praat TextGrid format.
mfa align --clean --speaker_characters 8 km_kh_male/wavs lexicon.txt khm_model.zip outputs
For some reason, without --beam 100
the program crashes.
mfa align --clean --g2p_model_path khm_g2p.zip sample_audio lexicon.txt khm_model.zip sample_audio --beam 100
This will create ./sample_audio/audio.TextGrid
Preview on Praat: doing phonetics by computer