# Seminar: Diphone Synthersis
At this seminar we will construct the simpliest possible synthesis - diphone model.
<img src="concat-scheme.png">
We will use part of the LJSpeech dataset.
Your task will be to design search and concatenation of the units.
Preprocessor stages are already performed for the test samples (and it'll be your home assignment to create a small g2p for CMU english phoneset).

## Alignment
The first and very import part in the data preparation is alignment: we need to determine the timings of phonemes our utterance consists of.
Even the concatenative syntheses are not used today in prod alignment is still an important phase for upsampling-based parametric acoustic models (e.g. fastspeech).

### Motreal Force Aligner
To process audio we will use MFA.

At the alignment stage we launch xent-trained TDNN ASR system with fixed text on the output and try to determine the most probable phonemes positions in the timeline.

In [None]:
import sys
if 'google.colab' in sys.modules:
    !wget -q https://raw.githubusercontent.com/yandexdataschool/speech_course/main/week_09/wavs_need.txt
    !wget -q https://raw.githubusercontent.com/yandexdataschool/speech_course/main/week_09/test_phones.txt
    !wget -q https://raw.githubusercontent.com/yandexdataschool/speech_course/main/week_09/fallback_rules.txt

In [None]:
%%writefile install_mfa.sh
#!/bin/bash

## a script to install Montreal Forced Aligner (MFA)

root_dir=${1:-/tmp/mfa}
mkdir -p $root_dir
cd $root_dir

# download miniconda3
wget -q --show-progress https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $root_dir/miniconda3 -f

# create py38 env
$root_dir/miniconda3/bin/conda create -n aligner -c conda-forge openblas python=3.8 openfst pynini ngram baumwelch -y
source $root_dir/miniconda3/bin/activate aligner

# install mfa, download kaldi
pip install montreal-forced-aligner praat-textgrids  # install requirements
pip install git+https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner.git # install latest updates

mfa thirdparty download

echo -e "\n======== DONE =========="
echo -e "\nTo activate MFA, run: source $root_dir/miniconda3/bin/activate aligner"
echo -e "\nTo delete MFA, run: rm -rf $root_dir"
echo -e "\nSee: https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html to know how to use MFA"

In [None]:
# download and install mfa
INSTALL_DIR="/tmp/mfa" # path to install directory

!bash ./install_mfa.sh {INSTALL_DIR}

In [None]:
!source {INSTALL_DIR}/miniconda3/bin/activate aligner; mfa align --help

### LJSpeech data subset
Here we will download the dataset.
However we don't need the whole LJSpeech for diphone synthesis (and it will be processed for quite a while).
Here we will take about 1/10 of the dataset. That's more than enough for diphone TTS.

In [None]:
!echo "download and unpack ljs dataset"
!mkdir -p ./ljs; cd ./ljs; wget -q --show-progress https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
!cd ./ljs; tar xjf LJSpeech-1.1.tar.bz2

In [None]:
# We need sox to convert audio to 16kHz (the format alignment works with)
!sudo apt install -q -y sox
!sudo apt install -q -y libopenblas-dev

In [None]:
!mkdir ./wav
!cat wavs_need.txt | xargs -I F -P 30 sox --norm=-3 ./ljs/LJSpeech-1.1/wavs/F.wav -r 16k -c 1 ./wav/F.wav
!echo "Number of clips" $(ls ./wav/ | wc -l)

It should be 1273 clips here

In [None]:
with open('wavs_need.txt') as ifile:
    wavs_need = {l.strip() for l in ifile}

In [None]:
# metadata to transcripts
lines = open('./ljs/LJSpeech-1.1/metadata.csv', 'r').readlines()
for line in lines:
    fn, _, transcript = line.strip().split('|')
    if fn in wavs_need:
        with open(f'./wav/{fn}.txt', 'w') as ofile:
            ofile.write(transcript)

!echo "Number of transcripts" $(ls ./wav/*.txt | wc -l)

Let's download the artifacts for alignment.

For phoneme ASR we need acoustic model and lexicon (mapping word=>phonemes) made by some other g2p

In [None]:
!wget -q --show-progress https://github.com/MontrealCorpusTools/mfa-models/raw/master/acoustic/english.zip
!wget -q --show-progress http://www.openslr.org/resources/11/librispeech-lexicon.txt

Finally, we come to the alignment.

It will take about 15-17 min for our subset to be aligned

In [None]:
!source {INSTALL_DIR}/miniconda3/bin/activate aligner; \
mfa align -t ./temp -c -j 4 ./wav librispeech-lexicon.txt ./english.zip ./ljs_aligned
!echo "See output files at ./ljs_aligned"

In [None]:
!ls ljs_aligned/|wc -l 

In [None]:
import IPython.display
from IPython.core.display import display

def display_audio(data):
    display(IPython.display.Audio(data, rate=22050))

In [None]:
import numpy as np
from scipy.io import wavfile
import textgrids
import glob

Alignment outputs are textgrids - and xml-like structure with layers for phonemes and words (with timings)

In [None]:
alignment = {f.split("/")[-1].split(".")[0][4:]: textgrids.TextGrid(f) for f in glob.iglob('ljs_aligned/*')}

In [None]:
wavs = {f.split("/")[-1].split(".")[0]: wavfile.read(f)[1] for f in glob.iglob('./ljs/LJSpeech-1.1/wavs/*.wav')}

In [None]:
allphones = {
    ph.text for grid in alignment.values() for ph in grid["phones"]
}
# let's exclude special symbols: silence, spoken noise, non-spoken noise
allphones = {ph for ph in allphones if ph == ph.upper()}
assert len(allphones) == 69

Here your part begins:
You need to create `diphone index` - mapping structure that will allow you to find original utterance and position in it by diphone text id.

E.g.:
`index[(PH1, PH2)] -> (utt_id, phoneme_index)`

In [None]:
diphone_index = dict()
# !!!!!!!!!!!!!!!!!!!!!!#
# INSERT YOUR CODE HERE #
# !!!!!!!!!!!!!!!!!!!!!!#

In [None]:
# check yourself
for a, b in [('AH0', 'P'), ('P', 'AH0'), ('AH0', 'L')]:
    k, i = diphone_index[(a,b)]
    assert a == alignment[k]['phones'][i].text
    assert b == alignment[k]['phones'][i+1].text

In concat TTS you sometimes don't have all the diphones presented
If it's not very frequent ones it's not a trouble
But we need to provide some mechanism to replace missing units

In [None]:
with open("fallback_rules.txt") as ifile:
    lines = [l.strip().split() for l in ifile]
    fallback_rules = {l[0]: l[1:] for l in lines}

In the dict `fallback_rules` lie possible replacement for all the phones
(different replacements in order of similarity).

E.g. `a stressed` -> `a unstressed`  | `o stressed` | `o unstressed`

Here is also some work for you:
You need to create diphone fallbacks from the phoneme ones:

`diphone_fallbacks[(Ph1, Ph2)] -> (some_other_pair_of_phones_presented_in_dataset)`

and also, if `diphone_fallbacks[(a, b)] = c, d` then:
* c = a or
* c $\in$ fallback_rules[a] and/or
* d = b or
* d $\in$ fallback_rules[d]


In [None]:
diphone_fallbacks = dict()
# !!!!!!!!!!!!!!!!!!!!!!#
# INSERT YOUR CODE HERE #
# !!!!!!!!!!!!!!!!!!!!!!#

In [None]:
# check yourself
for a, b in [('Z', 'Z'), ('Z', 'AY1'), ('Z', 'EY0')]:
    assert (a, b) in diphone_fallbacks
    r1, r2 = diphone_fallbacks[(a, b)]
    assert r1 in fallback_rules[a] or r1 == a
    assert r2 in fallback_rules[b] or r2 == b
    assert r1 != a or r2 != b

In [None]:
# some helping constants
SAMPLE_RATE = 22050
WAV_TYPE = np.int16

Little DSP related to concatenative synthesis:

to prevent disturbing "clicking" sound (difference in volume) when concatenating fragments from different utterances we need to perform `cross-fade` - smoothing at concatenation point

If we concatenate $wav_1$ and $wav_2$ at some points $M_1$ and $M_2$ corrispondively we perform crossfade with overlap of $2 V$:

$$\forall i \in [-V; V]:~output[M_1+i] = (1-\alpha) \cdot wav_1[M_1+i] + \alpha \cdot wav_2[M_2+i]$$
Where $$\alpha = \frac{i+V}{2 V}$$

And for $i < -V:~ output[M_1+i] = wav_1[M_1+i]$

for $i > V:~output[M_1+i] = wav_2[M_2+i]$


But it is not ok if the overlapping comes outside the concatenation phoneme.

So, if junction phoneme starts and ends at positions $B_1$ and $E_1$ (the first wav) and $B_2$ and $E_2$ (the second one)
the extact formula for overlapping zone will be:
$$\forall i \in [-L; R]:~output[M_1+i] = (1-\alpha) \cdot wav_1[M_1+i] + \alpha \cdot wav_2[M_2+i]$$
Where:
$$\alpha = \frac{i+L}{L+R},~L = min(M_1-B_1, M_2 - B_2, V), ~R = min(E_1-M_1, E_2-M_2, V)$$
    

In [None]:
def crossfade(lcenter, ldata, rcenter, rdata, halfoverlap):
    """
    ldata, rdata - 1d numpy array only with junction phoneme (so, B1 = 0, E1 = ldata.shape[0])
    lcenter = M1
    rcenter = M2
    
    it is better to return the concatenated version of the junction phoneme (as numpy data)
    """
    # !!!!!!!!!!!!!!!!!!!!!!#
    # INSERT YOUR CODE HERE #
    # !!!!!!!!!!!!!!!!!!!!!!#

In [None]:
def get_data(k, i):
    phoneme = alignment[k]['phones'][i]
    left = phoneme.xmin
    right = phoneme.xmax
    center = (left+right) * .5
    
    left = int(left * SAMPLE_RATE)
    center = int(center * SAMPLE_RATE)
    right = int(right * SAMPLE_RATE)
    return center - left, wavs[k][left:right]

In [None]:
# check yourself
cf = crossfade(*get_data('LJ050-0241', 3), *get_data('LJ038-0067', 56), 300)
assert np.abs(cf.shape[0] - 1764) < 10
assert np.abs(cf.mean() - 11) < 0.1

In [None]:
HALF_OVERLAP_CROSSFADE = 300

def synthesize(phonemes):
    diphones = []
    for ph1, ph2 in zip(phonemes[:-1], phonemes[1:]):
        diphone = (ph1, ph2)
        if diphone in diphone_index:
            k, i = diphone_index[diphone]
        else:
            k, i = diphone_index[diphone_fallbacks[diphone]]
            
        diphones.append((get_data(k, i), get_data(k, i+1)))
    output = []
    
    # Here you need to construct the result utterance with crossfades
    # NB: border (the first and the last phonemes does not require any crossfade and could be just copied)
    # !!!!!!!!!!!!!!!!!!!!!!#
    # INSERT YOUR CODE HERE #
    # !!!!!!!!!!!!!!!!!!!!!!#
    # need to return wav as 1d numpy array of type WAV_TYPE

Check youself:

If everything was correct, you should hear 'hello world'

In [None]:
display_audio(synthesize(['HH', 'AH0', 'L', 'OW1', 'W', 'ER1', 'L', 'D']))

In [None]:
# load additional test texts
with open("test_phones.txt") as ifile:
    test_phones = []
    for l in ifile:
        test_phones.append(l.strip().split())

Here should a little part of the GLADOS song 

In [None]:
output = []
pause = np.zeros([int(0.1 * SAMPLE_RATE)], dtype=WAV_TYPE)
for test in test_phones:
    output.append(synthesize(test))
    output.append(pause)
    
display_audio(np.concatenate(output[:-1]))