Skip to content
This repository has been archived by the owner on Oct 1, 2021. It is now read-only.

Training (and for other languages)

Kramarenko Vladislav edited this page Aug 27, 2019 · 3 revisions

Datasets

You need a lot of memory(>512 GB)

Ideally, the source datasets are stored in the same <datasets_root> directory (by default, in the "d" folder). All prepreprocessing scripts will, by default, output the clean data to a new directory "d / SV2TTS". Inside this directory will be created a directory for each model: the encoder, synthesizer and vocoder.

Name Language Link Comments My link Comments
Phoneme dictionary En, Ru En,Ru Phoneme dictionary link Совместил русский и английский фонемный словарь
LibriSpeech En link 300 speakers, 360h clean speech
VoxCeleb En link 7000 speakers, many hours bad speech
M-AILABS Ru link 3 speakers, 46h clean speech
open_tts, open_stt Ru open_tts, open_stt many speakers, many hours bad speech link Почистил 4 часа речи одного спикера. Поправил анотацию, разбил на отрезки до 7 секунд
Voxforge+audiobook Ru link Many speaker, 25h various quality link Выбрал хорошие файлы. Разбил на отрезки. Добавил аудиокниг из интернета. Получилось 200 спикеров по паре минут на каждого
RUSLAN Ru link One speaker, 40h good speech link Перекодировал в 16кГц
Mozilla Ru link 50 speaker, 30h good speech link Перекодировал в 16кГц, Раскидал разных пользователей по папкам
Russian Single Ru link One speaker, 9h good speech link Перекодировал в 16кГц

For g2p models need a dictionary phonemes for your language. where data is represented as strings "stockham с т А к h a м"

For the encoder you need a LOT of sound, where each speaker is placed in a separate folder. Fortunately, you can use untagged data with noises. If you do not have enough data for your language, you can use, for example, English. It's not that important to the coder.

Synthesis requires a lot of clean, well-marked sound from different speakers

The vocoder operates on the synthesized Mel, so it is, preferably, also clean, well-groomed data.

If you want to build a model for several languages at once, think about number of phonemes. The more of them there are, the harder it is for models to learn. But if they are too few, the model will have an accent. Think about what phonemes in your languages sound like. And do not forget to highlight the stressed vowel individual characters. For English, secondary stress plays a small role, and I would single it out.

G2P

  1. Open g2p/train.py and edit class Hparams(for other languages).
  2. Copy dictionary in folder g2p
  3. Run python g2p

Encode

For training, the encoder uses visdom. You can disable it with --no_visdom, but it's nice to have.

It is not necessary train from scratch (even for other languages). Take the pre-trained model.

  1. Run python encoder_preprocess.py <datasets_root> for data processing
  2. Run "visdom" in a separate CLI/process to start your visdom server
  3. Запустите python encoder_train.py my_run <datasets_root> for train encoder

Synthesizer

  1. Open "synthesizer/hparams.py and edit by itself(Especially if you have a sound frequency at 16 kHz or error OOM)
  2. Open "synthesizer/utils/symbols.py and edit _characters for yourself(for other languages)
  3. Run python synthesizer_preprocess_audio.py <datasets_root> to create processed sound and spectrograms
  4. Run python synthesizer_preprocess_embeds.py <datasets_root> for audio coding(obtain the characteristics of the voice)
  5. Run python synthesizer_train.py my_run <datasets_root> for train synthesizer

Vocoder

  1. Run python vocoder_preprocess.py <datasets_root> for the synthesis of Mel spectrograms
  2. Run python vocoder_train.py <datasets_root> for train vocoder
Clone this wiki locally