# You do not need to do any of this
This file walks you through the preprocessing steps for:
1. Cleaning the data,
2. Aligning transcripts to utterances at the phoneme level, and
3. Packaging the data for data science and machine learning uses.

**This is a slow, painful, and iterative process. If you're just interested in data science / machine learning, I recommend skipping this and downloading the end results.**

If you want to help out with programmatic data cleaning, then this is for you. There's a lot of work to be done here.

# Preprocessing parameters

In [6]:
samples_type = 'clipper' # options: "vctk", "clipper"
samples_dir = '/home/celestia/data/clipper-samples'
preproc_dir = '/home/celestia/data/clipper-preproc'
dictionary_dir = '/home/celestia/data/dictionaries'

# Convert clipper-formatted data to mfa-formatted data
The goal here is to run Montreal Forced Aligner (MFA) through Clipper's clips. Clipper's files are flac files and word-level transcripts. MFA takes in 16khz wave files and word-level transcripts, and it outputs phoneme-level transcripts. The `datapipes` module in `src/` can convert Clipper's files into MFA-compatible input.

First step: do a dry-run to check for any errors in the Clipper files we have. Sometimes there's a filename mismatch, a missing character name, missing transcript file, or similar. While running this, you'll see the `In [ ]` on the left-hand side change to `In [*]`. When it's complete, you'll see it change to `In [1]`. The number `[1]` tells you the order in which commands on this page were executed.

In [4]:
!(cd ../src; python -m datapipes --mfa-inputs \
    --input "{samples_dir}" `# clipper-formatted directory` \
    --output "{preproc_dir}/mfa-inputs" `# mfa-formatted directory` \
    --dry-run `# don't create any output files`)

2020-03-08 21:02:12.551126: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-03-08 21:02:12.552440: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
Done


Second step: build the necessary dictionaries. We could technically do this step later, but this is a good way to catch typos in the transcriptions, so we'll do it early. The dictionaries are used to determine how words are pronounced using Arpabet. No single dictionary contains all of the pronunciations we need, so we use two standard dictionaries (LibriSpeech and CMU Dict) plus a custom dictionary (Horsewords) that contains show-specific pronunciations. The standard dictionaries are built on the assumption that the speaker uses the "correct" enunciations, so the Horsewords dictionary also contains "messy" pronunciations, including stutters and slurred speech.

The following command validates and merges the three dictionaries, then checks to make sure all words in Clipper's transcriptions have a pronunciation.

In [15]:
!(cd ../src; python -m datapipes --dictionary \
     --include "{dictionary_dir}"/librispeech.txt \
         "{dictionary_dir}"/cmudict-0.7b.txt \
         "{dictionary_dir}"/horsewords.txt \
         "{dictionary_dir}"/gothicwords.txt \
         "{dictionary_dir}"/whovianwords.txt \
     --clipper-path "{samples_dir}" \
     --output-path "{dictionary_dir}")

2020-03-08 21:23:46.027661: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-03-08 21:23:46.028926: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6


If there are any errors reported by the above two commands, make sure to fix them and re-run the commands. Repeat until there are no errors. The next command will generate the mfa-formatted data. If you're running this on all of Clipper's data, this might take an hour to complete. The `--delta` flag tells the script to only process new files. It does NOT tell the script to update existing changed files.

In [None]:
%%bash

rm -r /home/celestia/data/mfa-inputs.bak
mv /home/celestia/data/mfa-inputs /home/celestia/data/mfa-inputs.bak

(cd ../src; python -m datapipes --mfa-inputs \
    --input "/home/celestia/data/clipper-samples" \
    --output /home/celestia/data/mfa-inputs)

Finally, run montreal-forced-aligner with the following command to generate phoneme-level transcripts. Note that, 
due to quirks with IPython, this command won't produce intermediate output, so you won't be able to monitor progress here. If you're running this on all of Clipper's data, this command might take a few hours to complete. You can monitor progress by watching the `data/mfa-alignments` directory.

In [None]:
%%bash

rm -r /home/celestia/data/mfa-alignments.bak
mv /home/celestia/data/mfa-alignments /home/celestia/data/mfa-alignments.bak
rm -r /home/celestia/Documents/MFA

function mfa() {
    mkdir /home/celestia/data/mfa-alignments/$1 || true
    yes n | mfa_align -v `# continue even with an incomplete dictionary` \
        /home/celestia/data/mfa-inputs/$1 `# input directory` \
        /home/celestia/data/dictionaries/normalized.dict.txt \
        /opt/mfa/pretrained_models/english.zip \
        /home/celestia/data/mfa-alignments/$1 `# output directory` \
        || true
}

export -f mfa

ls /home/celestia/data/mfa-inputs | xargs -L1 -P16 bash -c 'mfa $@' _

It's extremely likely that MFA failed on some inputs. There are three ways in which it can fail:
1. MFA found a word it didn't recognize and logged both the missing word and corresponding utterance.
2. MFA failed in some unexpected way while doing preprocessing for a character, and it borked its own configuration files.
3. MFA couldn't figure out how to align the transcript to an utterance.

For the first kind of failure, you can find an `oovs_found.txt` file in each of the directories within `mfa-alignments`. This file contains a list of words that could not be processed because they don't exist in the pronunciation dictionary. You can find the current pronunciation dictionary in `/opt/mfa/pronunciations_dicts/english.dict.txt`. If you end up adding the pronunciations of any missing words, make sure to post them to the thread. I can update the Docker image so everyone can benefit from it.

In [None]:
%%bash
cat /home/celestia/data/mfa-alignments/*/oovs_found.txt

For the second and third kinds of failure, you can find out which characters MFA failed to process by searching for the empty directories  in `mfa-alignments`.

In [None]:
%%bash
find /home/celestia/data/mfa-alignments -type d -empty

If the above command produces any output, it's very likely that MFA stochastically borked something during its own preprocessing stage. The easiest way to handle this is to remove its character-specific cache directory and try again.

The following script does exactly that for the case where MFA fails on Applejack's files. In my case, I needed to run this for Apple-Bloom, Applejack, Cadance, and Rainbow-Dash the most recent time, but MFA's failures are pretty stochastic.

In [None]:
%%bash

retry_character="Applejack"

rm -r "/home/celestia/Documents/MFA/$retry_character"

yes n | mfa_align -v \
        /home/celestia/data/mfa-inputs/$retry_character \
        /opt/mfa/pronunciations_dicts/english.dict.txt \
        /opt/mfa/pretrained_models/english.zip \
        /home/celestia/data/mfa-alignments/$retry_character

The last type of MFA failure is the only one that's complicated to handle. If you run the following command, you can see a list of transcripts that MFA failed to align.

In [None]:
%%bash

function get_textgrids() {
    (cd "$1"
    find -iname '*.textgrid' |
        sed 's/\.textgrid$//gI' |
        sort)
}

diff <(get_textgrids /home/celestia/data/mfa-inputs) <(get_textgrids /home/celestia/data/mfa-alignments) || true

If you didn't complete the above steps for handling whole-character issues, you'll notice that some characters have a huge number of utterances listed. If you did complete the above steps, none of the characters should have _that_ many failures. For me, the worst offender is Pinkie Pie with 77 failures, followed by Fluttershy and Twilight Sparkle both with around 35. If a character has a huge number of utterances listed, it's likely that MFA crashed at some point. You can read through its logs in `/home/celestia/Documents/MFA/` to try to figure out why. 

You can try playing some of the listed files to figure out why MFA might be failing on them. I've found that it's often because either (1) the character is speaking in a very excited or abnormal way, (2) the clip is noisy or muffled, or (3) the utterance contains a lot of out-of-dictionary words.

Eventually, we'll want to find a way to make use of these utterances to generate more realistic speech in niche cases, but for now we can ignore them.

# Package the data


Package the audio and label data into a tarfile. We're using an uncompressed format because it's much faster to load the uncompressed version, and because compression doesn't save much space with these audio files.

The audio format and sampling rate of the saved files can be modified to anything `pysoundfile` can handle. For now, we're using the most common sampling rate (48khz) and file format (wav) used by Clipper.

This first command does a dry run to make sure there are no errors.


In [None]:
%%bash

(cd ../src; python -m datapipes --audio-tar \
    --input-audio /home/celestia/data/clipper-samples `# clipper-formatted directory` \
    --input-alignments /home/celestia/data/mfa-alignments `# mfa-formatted directory` \
    --output /home/celestia/data/audio-tar `# output per-character tar.gz files here` \
    --audio-format 'wav' \
    --sampling-rate 48000 \
    --dry-run `# don't create any output files` \
    --verbose)

This second command creates the tar archive files with the audio and label files. This is what most people use in the Colab notebooks.

In [None]:
%%bash

rm -r /home/celestia/data/audio-tar.bak
mv /home/celestia/data/audio-tar /home/celestia/data/audio-tar.bak

(cd ../src; python -m datapipes --audio-tar \
    --input-audio /home/celestia/data/clipper-samples \
    --input-alignments /home/celestia/data/mfa-alignments \
    --output /home/celestia/data/audio-tar \
    --audio-format 'wav' \
    --sampling-rate 48000 \
    --verbose)

# Package extra phonetics data

There are some sound features that are commonly used in phonetics research. This includes pitch, volume, and formant information. Glottal Closure Instants are also commonly used to identify whether a speech segment is voiced or unvoiced. The following command extracts these features from the audio-tar files created above and writes them to lzma-compressed tar (txz) files.

In [None]:
%%bash

rm -r /home/celestia/data/audio-info.bak
mv /home/celestia/data/audio-info /home/celestia/data/audio-info.bak

function generate_info() {
    source="$(readlink -f $1)"
    target="/home/celestia/data/audio-info/$(basename $source .tar).txz"
    (cd ../src; python -m datapipes --audio-info \
        --input-tar "$source" \
        --output-txz "$target" \
        --verbose) || true
}

export -f generate_info

ls /home/celestia/data/audio-tar/* | xargs -L1 -P16 bash -c 'generate_info $@' _

When working with Tensorflow, packaging data as into TFRecord files simplifies the process of loading data, especially when working with multiple GPUs. The following command packages all of the above label files (Clipper's labels, MFA phoneme transcriptions, and phonetics features) into TFRecord files. This can be used to create label embeddings.

It's recommended practice to only store 100MB to 200MB of data per file. The generated TFRecord files will contain information on up to 24,000 clips, a number that's heuristically estimated to contain 150MB of data. Each generated file will end with ".tfrecordN", where N is a number starting from 0. Currently, none of the characters have more than 24,000 clips, so every character's file ends with ".tfrecord0".

You can use the same command to generate a set of 100MB-200MB files containing all clips all mixed together. This would make more sense for training cross-character models.

# Project-specific packaging

## Tensorflow dataset

In [None]:
%%bash

function generate_tfrecords() {
    audio="$(readlink -f $1)"
    info="/home/celestia/data/audio-info/$(basename $audio .tar).txz"
    tfrecord="/home/celestia/data/labels-tfrecord/$(basename $audio .tar).tfrecord"
    (cd ../src; python -m datapipes --labels-tfrecord \
        --input-labels "$audio" \
        --input-info "$info" \
        --output-tfrecord "$tfrecord" \
        --verbose) || true
}

export -f generate_tfrecords

ls /home/celestia/data/audio-tar/* | xargs -L1 -P16 bash -c 'generate_tfrecords $@' _

## Cookie-normalized data
Cookie's (initial) scripts require the clips to be 22khz and trimmed. The following script was used before to package clips as 48khz files. We'll use the same script to create the 22khz files.

In [None]:
%%bash

rm -r /home/celestia/data/audio-tar-22khz.bak
mv /home/celestia/data/audio-tar-22khz /home/celestia/data/audio-tar-22khz.bak

(cd ../src; python -m datapipes --audio-tar \
    --input-audio /home/celestia/data/clipper-samples \
    --input-alignments /home/celestia/data/mfa-alignments \
    --output /home/celestia/data/audio-tar-22khz \
    --audio-format 'wav' \
    --sampling-rate 22050 \
    --verbose)

We can trim the 22khz audio with the following `sox` command. (Thanks, Cookie!)

In [None]:
%%bash

rm -r /home/celestia/data/audio-trimmed
mv /home/celestia/data/audio-trimmed /home/celestia/data/audio-trimmed.bak

function trim_character() {
    input_tar="$(readlink -f $1)"
    output_folder="/home/celestia/data/audio-trimmed/$(basename $input_tar .tar)"
    echo "writing to $output_folder"
    mkdir -p "$output_folder"
    (cd "$output_folder"
        tar xf "$input_tar";
        for clip in `ls`; do
            cp "$clip/audio.wav" "tmp.wav";
            sox "tmp.wav" "$clip/audio.wav" silence 1 0.05 0.1% reverse silence 1 0.05 0.1% reverse;
        done
        
        rm "tmp.wav"
    )
}

export -f trim_character

ls /home/celestia/data/audio-tar-22khz/* | xargs -L1 -P16 bash -c 'trim_character $@' _

And finally package the results into `tar` files.

In [None]:
%%bash

(cd /home/celestia/data/audio-trimmed
 pwd
 for character in `ls`; do
     (cd "$character/"
      tar -cf "../$character.tar" */audio.wav */label.json)
 done)

rm -r /home/celestia/data/audio-trimmed-22khz.bak
mv /home/celestia/data/audio-trimmed-22khz /home/celestia/data/audio-trimmed-22khz.bak

mkdir /home/celestia/data/audio-trimmed-22khz
mv /home/celestia/data/audio-trimmed/*.tar /home/celestia/data/audio-trimmed-22khz/

# Test the packages

In [2]:
%%bash

(cd ../; pytest --rootdir=src/ tests/test_datafiles.py)

platform linux -- Python 3.6.9, pytest-5.3.2, py-1.8.1, pluggy-0.13.1
rootdir: /home/celestia/synthbot/src
plugins: cov-2.8.1
collected 6 items

src ......                                                               [100%]



# Push the data to Google Drive


In [None]:
%%bash

local="/home/celestia/data"
remote_hidden="/home/celestia/drive/soundtools-alpha/"

rm $remote_hidden/audio-tar.new/*
cp $local/audio-tar/* $remote_hidden/audio-tar.new/

rm $remote_hidden/audio-info.new/*
cp $local/audio-info/* $remote_hidden/audio-info.new/

rm $remote_hidden/audio-trimmed-22khz.new/*
cp $local/audio-trimmed-22khz/* $remote_hidden/audio-trimmed-22khz.new/