# Convert clipper-formatted data to mfa-formatted data
This file walks through three pre-processing steps:
1. Cleaning the data,
2. Converting the data for tool use, and
3. Generating phone-level alignments.

The goal here is to run Montreal Forced Aligner (MFA) through Clipper's clips. Clipper's files are flac files and word-level transcripts. MFA takes in 16khz wave files and word-level transcripts, and it outputs phoneme-level transcripts. The `datapipes` module in `src/` can convert Clipper's files into MFA-compatible input.

First step: do a dry-run to check for any errors in the Clipper files we have. Sometimes there's a filename mismatch, a missing character name, missing transcript file, or similar. While running this, you'll see the `In [ ]` on the left-hand side change to `In [*]`. When it's complete, you'll see it change to `In [1]`. The number `[1]` tells you the order in which commands on this page were executed.

In [None]:
!(cd ../src; python -m datapipes \
    --input /home/celestia/data/clipper-samples `# clipper-formatted directory` \
    --output /home/celestia/data/mfa-inputs `# mfa-formatted directory` \
    --delta `# ignore files already processed` \
    --dry-run `# don't create any output files`)

If there are any errors, make sure to fix them and re-run the above command. Repeat until there are no errors, then run the next command to generate the mfa-formatted data. If you're running this on all of Clipper's data, this might take an hour to complete.

In [None]:
!(cd ../src; python -m datapipes \
    --input /home/celestia/data/clipper-samples \
    --output /home/celestia/data/mfa-inputs \
    --delta)

The below two files are known to cause problems for montreal-forced-aligner. Get rid of them before running the forced aligner.

In [None]:
# this one screws with montreal-forced-aligner for some reason
!rm /home/celestia/data/mfa-inputs/Pinkie-Pie/S5_s5e19_00_02_05_Pinkie_Neutral__A.wav
!rm /home/celestia/data/mfa-inputs/Pinkie-Pie/S5_s5e19_00_02_05_Pinkie_Neutral__A.textgrid

Finally, run montreal-forced-aligner with the following command to generate phoneme-level transcripts. Note that, due to quirks with IPython, this command won't produce intermediate output, so you won't be able to monitor progress here. If you're running this on all of Clipper's data, this command might take a few hours to complete. You can monitor progress by watching the `data/mfa-alignments` directory.

In [None]:
%%bash
ls /home/celestia/data/mfa-inputs | while read line;
do
    mkdir /home/celestia/data/mfa-alignments/$line

    yes n | mfa_align -v `# continue even with an incomplete dictionary` \
        /home/celestia/data/mfa-inputs/$line `# input directory` \
        /opt/mfa/pronunciations_dicts/english.dict.txt \
        /opt/mfa/pretrained_models/english.zip \
        /home/celestia/data/mfa-alignments/$line `# output directory`
done

In each of the directories within `mfa-alignments`, you'll find an `oovs_found.txt` file. This file contains a list of words that could not be processed because they don't exist in the pronunciation dictionary. You can find the current pronunciation dictionary in `/opt/mfa/pronunciations_dicts/english.dict.txt`. If you end up adding the pronunciations of any missing words, make sure to post them to the thread. I can update the Docker image so everyone can benefit from it.

In [None]:
!cat /home/celestia/data/mfa-alignments/Twilight-Sparkle/oovs_found.txt