## 1. Data Preprocess

raw-data要进行preprocess处理成模型可用形式，即raw-data --> mel features

### 1.1 raw-data格式：

LJSpeech-1.1    
- metadata.csv
- wavs  
    * LJ001-0001.wav    
    * LJ001-0002.wav
    * ...

The LJ Speech Dataset

Version 1.0
July 5, 2017
https://keithito.com/LJ-Speech-Dataset
-----------------------------------------------------------------------------

OVERVIEW

This is a public domain speech dataset consisting of 13,100 short audio clips
of a single speaker reading passages from 7 non-fiction books. A transcription
is provided for each clip. Clips vary in length from 1 to 10 seconds and have
a total length of approximately 24 hours.

The texts were published between 1884 and 1964, and are in the public domain.
The audio was recorded in 2016-17 by the LibriVox project and is also in the
public domain.

FILE FORMAT

Metadata is provided in metadata.csv. This file consists of one record per
line, delimited(分割) by the pipe character (0x7c). The fields are:

  1. ID: this is the name of the corresponding .wav file
  2. Transcription: words spoken by the reader (UTF-8)
  3. Normalized Transcription: transcription with numbers, ordinals, and
     monetary units expanded into full words (UTF-8).

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of
22050 Hz.



STATISTICS

Total Clips            13,100      
Total Words            225,715       
Total Characters       1,308,674       
Total Duration         23:55:17       
Mean Clip Duration     6.57 sec       
Min Clip Duration      1.11 sec       
Max Clip Duration      10.10 sec       
Mean Words per Clip    17.23         
Distinct Words         13,821         

MISCELLANEOUS(各种各样)

The audio clips range in length from approximately 1 second to 10 seconds.
They were segmented automatically based on silences in the recording. Clip
boundaries generally align with sentence or clause boundaries, but not always.

The text was matched to the audio manually, and a QA pass was done to ensure
that the text accurately matched the words spoken in the audio.

The original LibriVox recordings were distributed as 128 kbps MP3 files. As a
result, they may contain artifacts introduced by the MP3 encoding.

The following abbreviations appear in the text. They may be expanded as
follows:

     Abbreviation   Expansion
     --------------------------
     Mr.            Mister
     Mrs.           Misess (*)
     Dr.            Doctor
     No.            Number
     St.            Saint
     Co.            Company
     Jr.            Junior
     Maj.           Major
     Gen.           General
     Drs.           Doctors
     Rev.           Reverend
     Lt.            Lieutenant
     Hon.           Honorable
     Sgt.           Sergeant
     Capt.          Captain
     Esq.           Esquire
     Ltd.           Limited
     Col.           Colonel
     Ft.            Fort

     * there's no standard expansion of "Mrs."


19 of the transcriptions contain non-ASCII characters (for example, LJ016-0257
contains "raison d'être").

For more information or to report errors, please email kito@kito.us.



LICENSE

This dataset is in the public domain in the USA (and likely other countries as
well). There are no restrictions on its use. For more information, please see:
https://librivox.org/pages/public-domain.


CHANGELOG

* 1.0 (July 8, 2017):
  Initial release

* 1.1 (Feb 19, 2018):
  Version 1.0 included 30 .wav files with no corresponding annotations in
  metadata.csv. These have been removed in version 1.1. Thanks to Rafael Valle
  for spotting this.


CREDITS

This dataset consists of excerpts from the following works:

* Morris, William, et al. Arts and Crafts Essays. 1893.
* Griffiths, Arthur. The Chronicles of Newgate, Vol. 2. 1884.
* Roosevelt, Franklin D. The Fireside Chats of Franklin Delano Roosevelt.
  1933-42.
* Harland, Marion. Marion Harland's Cookery for Beginners. 1893.
* Rolt-Wheeler, Francis. The Science - History of the Universe, Vol. 5:
  Biology. 1910.
* Banks, Edgar J. The Seven Wonders of the Ancient World. 1916.
* President's Commission on the Assassination of President Kennedy. Report
  of the President's Commission on the Assassination of President Kennedy.
  1964.

Recordings by Linda Johnson. Alignment and annotation by Keith Ito. All text,
audio, and annotations are in the public domain.

There's no requirement to cite this work, but if you'd like to do so, you can
link to: https://keithito.com/LJ-Speech-Dataset

or use the following:
@misc{ljspeech17,
  author       = {Keith Ito},
  title        = {The LJ Speech Dataset},
  howpublished = {url{https://keithito.com/LJ-Speech-Dataset/}},
  year         = 2017
}


In [None]:
import argparse
import os
from multiprocessing import cpu_count
from tqdm import tqdm
from datasets import ljspeech
from hparams import hparams

def preprocess_ljspeech(args):
    in_dir = os.path.join(args.base_dir, 'LJSpeech-1.1')
    out_dir = os.path.join(args.base_dir, args.output)
    os.makedirs(out_dir, exist_ok=True) #./training
    metadata = ljspeech.build_from_path(in_dir, out_dir, args.num_workers, tqdm=tqdm)
    write_metadata(metadata, out_dir)
    
def write_metadata(metadata, out_dir):
    with open(os.path.join(out_dir, 'train.txt'), 'w', encoding='utf-8') as f:
        for m in metadata:
        f.write('|'.join([str(x) for x in m]) + '\n')
    frames = sum([m[2] for m in metadata])
    hours = frames * hparams.frame_shift_ms / (3600 * 1000)
    print('Wrote %d utterances, %d frames (%.2f hours)' % (len(metadata), frames, hours))
    print('Max input length:  %d' % max(len(m[3]) for m in metadata))
    print('Max output length: %d' % max(m[2] for m in metadata))
    
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--base_dir', default=os.path.expanduser('~/tacotron'))
    parser.add_argument('--output', default='training')
    parser.add_argument('--dataset', required=True, choices=['ljspeech'])
    parser.add_argument('--num_workers', type=int, default=cpu_count())
    args = parser.parse_args()
    preprocess_ljspeech(args)

if __name__ == "__main__":
    main()

In [None]:
# ljspeech.py
from concurrent.futures import ProcessPoolExecutor
from functools import partial
import numpy as np
import os
from util import audio

def build_from_path(in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
    '''Preprocesses the LJ Speech dataset from a given input path into a given output directory.
        Args:
          in_dir: The directory where you have downloaded the LJ Speech dataset
          out_dir: The directory to write the output into
        Returns:
          A list of tuples describing the training examples. This should be written to train.txt
      '''
    executor = ProcessPoolExecutor(max_workers=num_workers)
    futures = []
    index = 1
    with open(os.path.join(in_dir, 'metadata.csv'), encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split('|')
            wav_path = os.path.join(in_dir, 'wavs', '%s.wav' % parts[0])
            text = parts[2]
            futures.append(executor.submit(partial(_process_utterance, out_dir, index, wav_path, text)))
            index += 1
    return [future.result() for future in tqdm(futures)]

def _process_utterance(out_dir, index, wav_path, text):
    '''Preprocesses a single utterance audio/text pair.
        This writes the mel and linear scale spectrograms to disk and returns a tuple to write
        to the train.txt file.
        Args:
            out_dir: The directory to write the spectrograms into
            index: The numeric index to use in the spectrogram filenames.
            wav_path: Path to the audio file containing the speech input
            text: The text spoken in the input audio file
        Returns:
            A (spectrogram_filename, mel_filename, n_frames, text) tuple to write to train.txt
    '''
    # Load the audio to a numpy array:
    wav = audio.load_wav(wav_path)  #(193101,)

    # Compute the linear-scale spectrogram from the wav:
    spectrogram = audio.spectrogram(wav).astype(np.float32) #(1025,773)
    n_frames = spectrogram.shape[1] #773

    # Compute a mel-scale spectrogram from the wav:
    mel_spectrogram = audio.melspectrogram(wav).astype(np.float32) 

    # Write the spectrograms to disk:
    spectrogram_filename = 'ljspeech-spec-%05d.npy' % index
    mel_filename = 'ljspeech-mel-%05d.npy' % index
    np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
    np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)

    # Return a tuple describing this training example:
    return (spectrogram_filename, mel_filename, n_frames, text)

### 1.2 STFT

librosa是一个应用广泛的音频处理python库。在librosa中有一个方法叫做stft，功能是求音频的短时傅里叶变换。

音频短时傅里叶变换后，在对音频取幅值，可以得到音频的线性谱。对线性谱进行mel刻度的加权求和，可以得到语音识别和语音合成中常用的mel谱。

短时傅里叶变换的过程是先对音频分帧，再分别对每一帧傅里叶变换。

#### From DFT to STFT

#### 对音频分帧

Apply windowing function to signal $x_w(k)=x(k) \cdot w(k)$

![1606734418841-4a09b34c-976c-4748-99da-10ad102fe11d.png](attachment:40026cfe-a833-47ff-b2ca-db8c12ccec02.png)
![1606734367106-f099dae3-2b9f-473b-ab02-dcc4c10ea7c3.png](attachment:5ca21a16-ab3b-4763-8f2a-b83cfb8f4588.png)

overlapping frames  

![1606734455370-1efc0a84-c75a-4ad0-8c3a-d53617d5bc4c.png](attachment:2ef3dc20-2919-4469-9aab-e0cc3af76f3a.png)

$$\hat{x}(k)=\sum_{n=0}^{N-1}x(n)\cdot e^{-i2\pi n \frac{k}{n}}$$

$$S(m,k)=\sum_{n=0}^{N-1}x(n+mH)\cdot w(n)\cdot e^{-i2\pi n \frac{k}{N}}$$

m: frame number&ensp; &ensp;&ensp; &ensp; &ensp;&ensp; &ensp; n: frame size &ensp; &ensp;&ensp; &ensp;&ensp; &ensp; &ensp; &ensp;  w(n):windowing function   

Frame size(Window size): 一般选为256,512,1024,2048,4096

hop size: 一般为1/2 ,1/4, 1/8 frame size

Windowing function: hann window

#### Outputs

we get a fourier coefficient for each of the frequency components we're decomposed our original signal into.and this is a one dimensional array it's just like a vector.

**DFT**

- Spectral vector(#frequency bins)

- N complex Fourier coefficents

**STFT**

we get a complex fourier coefficient for each frequency bin that we are considering for each frame.

- Spectral matrix (#frequency bins, #frames)    
    - frequency bins=num_freq
   
- Complex Fourier coefficients

![Screen Shot 2021-10-25 at 7.38.38 PM.png](attachment:17cd51d7-ffa4-4bc5-9e92-edd0a9fa61ef.png)

![1606217847062-f690cc5a-b70e-427e-a425-57524e5b6ae9.png](attachment:6d4047dd-491c-42ff-ac77-371d02249e89.png)
![1606218009633-21a942c7-6fd1-46cb-a425-d44515d975c5 (1).png](attachment:9c4e559b-407e-4b03-bfbe-ebd1f3e4ed33.png)
![1606218069994-955a0887-d1dd-4e51-8d56-e6f7ae6ece4c.png](attachment:a7032bc1-fef4-461e-8680-186b6c80bf21.png) 

![1606219211354-15307c10-e091-416a-b3ba-5012ba9b4a1e.png](attachment:8c1f553e-4990-4b1b-be32-fee1dde6218a.png)
![1606219183209-deb39be2-f088-428f-8a20-20f7b2ecdfe4.png](attachment:d015a13f-7ddd-4ea3-8d6a-a656fbd5146f.png)

$num\_freq=\frac{n\_fft}{2}+1$ 是因为只需对FFT信号的一半进行分析即可，因为实数信号具有对称性，完整的FFT结果具有信息冗余。

In [None]:
def preemphasis(x):
    return scipy.signal.lfilter([1, -hparams.preemphasis], [1], x) #(193101,)

def spectrogram(y): #(1025,773)  
    D = _stft(preemphasis(y)) #得到复数矩阵 D(f,t)
    S = _amp_to_db(np.abs(D)) - hparams.ref_level_db #np.abs(D(f,t))频率的振幅，np.angle(D(f,t))频率的相位
    return _normalize(S)

def _stft(y):
    '''
    Args:
        n_fft
        hop_length:帧移，default win_length/4
        win_length:用零填充以匹配 n_fft,default win_length=n_fft
    '''
    n_fft, hop_length, win_length = _stft_parameters()
    return librosa.stft(y=y, n_fft=n_fft, hop_length=hop_length, win_length=win_length)

def _stft_parameters():
    n_fft = (hparams.num_freq - 1) * 2  # 2048
    hop_length = int(hparams.frame_shift_ms / 1000 * hparams.sample_rate) #12.5/1000*20000=250
    win_length = int(hparams.frame_length_ms / 1000 * hparams.sample_rate) 
    return n_fft, hop_length, win_length

根据幅度计算分贝：  
- 根据电压幅度计算：
    $$dB = 20\cdot log10(Amp)$$
- 根据功率计算：
    $$dB = 10\cdot log10(Amp)$$
   
根据分贝求幅度值：   
$$Amp = exp(\frac{dB}{20})$$

In [None]:
def _amp_to_db(x):
    return 20 * np.log10(np.maximum(1e-5, x))

def _db_to_amp(x):
    return np.power(10.0, x * 0.05)

In [None]:
def _normalize(S):
    return np.clip((S - hparams.min_level_db) / -hparams.min_level_db, 0, 1)  