# Loading the speech corpus

Before going through this, make sure you've gone through `notebooks/Preprocess data.ipynb` first.

This file will walk you through the main synthbot components for working with utterances. The goal here is to generate a 'hello world' utterance using clips of Twilight's speaking. This requires:
1. Loading Twilight's speech corpus,
2. Finding relevant sound clips, and
3. Piecing them together into a single utterance.

Let's start by running the tests to make sure we have a working installation.

In [None]:
!../run-tests.sh

If you have a red bar on the bottom stating that you've failed some tests, something's wrong. If you have a green bar stating the number of tests that passed, you're good to go.

Import the relevant files:

In [None]:
import sys
sys.path.append('../src')
import IPython.display as ipd
from ponysynth.soundsheaf import *
from ponysynth.speechcorpus import *

Walking through the imports:
* Everything in this notebook is running in an IPython interpretor. The `sys.path.append('../src')` tells IPython where to find the `ponysynth` module. We're currently in `synthbot/notebooks`, and we need to add `synthbot/src` to the python path.
* We'll be using `IPython.display` to play the audio we generate.
* `ponysynth.soundsheaf` contains the basic sound data structures that `ponysynth` uses. It also contains utility functions to work with `SoundSheaf`s, including one to piece together audio.
* `ponysynth.speechcorpus` contains the basic corpus management data structures and utility functions.

We'll be using `load_character_corpus` from `ponysynth.speechcorpus` to load audio and phoneme-level transcripts for Twilight. These are the same phoneme-level transcripts that we generated in `notebooks/Preprocess data.ipynb`.

In [None]:
twilight = load_character_corpus(
    audio_folder='/home/celestia/data/mfa-inputs/Twilight-Sparkle',
    transcripts_folder='/home/celestia/data/mfa-alignments/Twilight-Sparkle')

`load_character_corpus` returns an object of type `SpeechCorpus`. Right now, the only thing `SpeechCorpus` does is index audio files so phones (sound associated with phonemes) are easy to find. We can use that to find utterances that sound like parts of "hello world". Note that to find the phonemes for "hello world", I went through the pronunciation dictionary mentioned in `notebooks/Preprocess data.ipynb`.

The following will find a set of phones and diphones sufficient to say "hello world":

In [None]:
target_phonemes = ['HH', 'EH0', 'L', 'OW1', 'W', 'ER1', 'L', 'D']
target_diphonemes = list(zip(target_phonemes[:-1], target_phonemes[1:]))

available_diphones = [twilight.find_utterances(x) for x in target_diphonemes]
diphone_selections = [next(x, None).diphone_sequence for x in available_diphones]

available_pre = twilight.find_utterances(['HH'])
pre_selection = next(available_pre, None)

available_post = twilight.find_utterances(['D'])
post_selection = next(available_post, None)

The result is a list of utterances, each of which contains one part of "hello world". The following will merge these utterances together into a single "sound":

In [None]:
chained_diphones = reduce(merge_diphseq, diphone_selections)
hello_utterance = merge_utterances(pre_selection, chained_diphones, post_selection)
hello_soundsheaf = hello_utterance.get_sheaf()

A note on terminology: a sheaf is spatial (or termporal) data. A "soundsheaf" deals specifically with temporal PCM (frame) data. A soundsheaf is defined by (1) a time interval, and (2) an assignment of PCM data to each point in that time interval. I'm using the word "soundsheaf" because the word "sound" is ambiguous. Sometimes "sound" refers to PCM data, sometimes it refers to frequencies, and sometimes it refers to a soundsheaf.

An utterance is a soundsheaf with content (e.g., words, phonemes). Each constituent soundsheaf in `hello_soundsheaf` contains at least one phone or diphone of content. When we merge soundsheaves with `merge_diphseq` and `merge_utterances`, we're hoping that the result sounds like natural, but it might not. The reason is that, in natural speech, adjacent sounds need to abide by some rules (e.g., they should be spoken in similar voices), and our naive `merge_*` implementations might break those rules.

One of our eventual goals will be to adjust the merged soundsheaves during the merging process so those rules don't get broken. For now, we have:

In [None]:
ipd.Audio(hello_soundsheaf.get_image(), rate=16000)