A basic forced aligner using Kaldi and gruut for multiple human languages.
git clone https://github.com/rhasspy/kaldi-align
cd kaldi-align
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install --upgrade wheel setuptool
pip3 install -r requirements.txt
kaldi-align --version
You will also need some system libraries for Kaldi and ffmpeg to convert audio:
$ sudo apt-get install libopenblas-base libgfortran5 ffmpeg
Create a JSONL alignment file from WAV files and a CSV file with the format id|text
:
kaldi-align \
--model en-us \
--metadata /path/to/metadata.csv \
--audio-files <(find /path/to/wavs -name '*.wav' -type f) \
--output-file alignments.jsonl
Text from your metadata.csv file will be automatically cleaned using gruut (punctuation and non-words removed). You can save the cleaned metadata by providing a path to --clean-metadata
(optional).
If your metadata CSV file has the format id|speaker|text
, pass --has-speaker
to kaldi-align
.
With the alignment JSONL file, you can create:
- A CSV file with phoneme ids using gruut that is suitable for training a Larynx
- Trimmed versions of your WAV files with silence removed from front and back
Create a CSV file with the format id|P P P
where each P
is a phoneme id.
align2csv \
--language <LANG> \
--alignments alignments.json \
--phoneme-ids /path/to/phonemes.txt \
> /path/to/phonemes.csv
where <LANG>
is one of gruut's supported languages. The align2csv
script runs python3 -m kaldi_align.align2csv
under the hood.
The --phoneme-ids
path is optional, but recommended. It will write a text file with the map between IPA text phonemes and the integer ids used in the CSV output.
Trim silence from WAV files:
align2wavs \
--metadata /path/to/metadata.csv \
--audio-files <(find /path/to/wavs -name '*.wav' -type f) \
--alignments alignments.json \
--output-dir /path/to/aligned/wavs/
Trimmed versions of all WAV files with at least one word will writen to --output-dir
along with the metadata from --metadata
. The align2wavs
script runs python3 -m kaldi_align.align2wavs
under the hood.
Note that --audio-files
accepts a file path with an audio file path on each line. These paths should not contain spaces.
If your metadata CSV file has the format id|speaker|text
, pass --has-speaker
to align2wavs
.
Kaldi models will be automatically downloaded on first use and stored in $HOME/.local/share/kaldi_align
. You may also manually download them.
- Czech (
cs-cz
) - German (
de-de
) - English (
en-us
) - Spanish (
es-es
) - Persian/Farsi (
fa
) - French (
fr-fr
) - Italian (
it-it
) - Dutch (
nl
) - Russian (
ru-ru
) - Swedish (
sv-se
)
- gruut
- ffmpeg
- pydub
- kaldi
- Automatically downloaded on first use