Learning to Speak from Text for Low-Resource TTS

Implementation for our paper "Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining" to appear in IJCAI 2023.
This repository is standalone but highly dependent on ESPnet.

Abstract:
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. All experiments were conducted using public datasets and the implementation will be made available for reproducibility.

Environment setup

$ cd tools
$ ./setup_anaconda.sh ${output-dir-name|default=venv} ${conda-env-name|default=root} ${python-version|default=none}
# e.g.
$ ./setup_anaconda.sh miniconda zmtts 3.8

Then install espent.

$ make TH_VERSION={pytorch-version} CUDA_VERSION=${cuda-version}
# e.g.
$ make TH_VERSION=1.10.1 CUDA_VERSION=11.3

You can also setup system python environment. For other options, refer to the ESPnet installation.

Data preparation

Prepare a root directory (referred to as db_root) for several multilingual TTS corpora and text-only data. We have scripts to run our model in egs2/masmultts. While we assume css10 for TTS corpora and VoxPopuli for text-only data in this readme, you can use other multilingual datasets by modifying the data preparation scripts.
Download css10 and place it in ${db_root}/css10/ for the TTS training data. Please downsample it from 22.05kHz to 16kHz in advance.
Create a TSV file (${db_root}/css10.tsv) to compile the data for TTS. The data format of each TSV file is as follows.

utt_name<tab>path_to_wav_file<tab>lang_name<tab>speaker_name<tab>utternace_text
...

You can make the TSV file by ruinning egs2/masmultts/make_css10_tsv.py.

If you use IPA symbols, you need to dump IPA symbols to ${db_root}/css10_phn.tsv in the same format. You can perform it using egs2/masmultts/tts_pretrain_1/g2p.py as follows after installing phonemizer.

$ pip3 install phonemizer
$ python3 g2p.py --in_path ${db_dir}/css10.tsv --data_type css10

Since runtime multilingual G2P is not implemented in ESPnet, IPA symbols must be dumped in advance. Replace the utterance_text in the TSV file with IPA symbols adding the suffix _phn.
Place text datasets for the unsupervised text pre-training. Download VoxPopuli and put a list of utterance texts in voxp_text/lm_data/${lang}/sentences.txt. Each sentence.txt looks like:

utternace_text
...

If you use IPA symbols, you need to dump IPA symbols to ${db_root}/voxp_text/lm_data/${lang}/sentences_phn.txt in the same format. You can use egs2/masmultts/tts_pretrain_1/g2p.py here.

$ python3 g2p.py --in_path ${db_dir}/voxp_text/lm_data/de/sentences.txt --data_type voxp

As a result, the root directory and the following files look like the following.

- css10/
  |- german
  |- spanish
  ...
- voxp_text
  |- lm_data
     |- de
        |- sentences.txt
        |- sentences_phn.txt (optional)
     |- es
      ...
- css10.tsv
- css10_phn.tsv (optional)

If you want to use other TTS corpora such as M_AILABS,please see TTS data prep and Pretraining data prep for details.

Add the path to the root directory in db.sh as MASMULTTS=${db_root}.

Unsupervised text pretraining

Please see here.

TTS training and inference

Please see here.

Work in progress

Providing implementation on the paper in this standalone repo.
Prepare a script to automate the data preparation pipeline.
Integrating the implementation to ESPnet.
Providing pretrained models through ESPnet.

Citation

@article{saeki2023learning,
  title={Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining},
  author={Saeki, Takaaki and Maiti, Soumi and Li, Xinjian and Watanabe, Shinji and Takamichi, Shinnosuke and Saruwatari, Hiroshi},
  journal={arXiv preprint arXiv:2301.12596},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
ci		ci
doc		doc
docker		docker
egs2		egs2
espnet		espnet
espnet2		espnet2
test		test
test_utils		test_utils
tools		tools
utils		utils
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.mergify.yml		.mergify.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
lang_table.md		lang_table.md
setup.cfg		setup.cfg
setup.py		setup.py

License

Takaaki-Saeki/zm-text-tts

Folders and files

Latest commit

History

Repository files navigation

Learning to Speak from Text for Low-Resource TTS

Environment setup

Data preparation

Unsupervised text pretraining

TTS training and inference

Work in progress

Citation

About

Resources

License

Stars

Watchers

Forks

Languages