Skip to content

tiro-is/tiro-tts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tiro talgervill / Tiro TTS

Tiro TTS is a text-to-speech API server which works with various TTS backends.

The service can accept either unnormalized text or a SSML document and respond with audio (MP3, Ogg Vorbis or raw 16 bit PCM) or speech marks, indicating the byte and time offset of each synthesized word in the request.

The full API documentation in OpenAPI 2 format is available online at tts.tiro.is. The documentation is auto-generated from src/schema.py.

Tiro talgervill er vefþjónusta fyrir talgervingu sem styður nokkra mismunandi bakenda. Þjónustan getur tekið við annað hvort ónormuðum texta eða SSML-skjali og svarað með hljóðskrá (MP3, Ogg Vorbis eða 16 bita PCM) eða speech marks sem gefa til kynna tímasetningar og staðsetningu hvers orðs í innsenda textanum.

Skjölun forritunarskila þjónustunnar á OpenAPI 2 sniði er að finna á tts.tiro.is, en hún er búin til út frá src/schema.py.

Voices

The models used are configured with a text SynthesisSet protobuf message supplied via the environment variable TIRO_TTS_SYNTHESIS_SET_PB. See conf/synthesis_set.local.pbtxt for an example.

There are currently four voices accessible at tts.tiro.is.

  • Diljá: Female voice developed by Reykjavík University (FastSpeech2 + MelGAN).
  • Diljá v2: Female voice developed by Reykjavík University (ESPnet2 FastSpeech2 + Multiband MelGAN).
  • Álfur: Male voice developed by Reykjavík University (FastSpeech2 + MelGAN).
  • Álfur v2: Male voice developed by Reykjavík University (ESPnet2 FastSpeech2 + Multiband MelGAN).
  • Bjartur: Male voice developed by Reykjavík University (ESPnet2 FastSpeech2 + Multiband MelGAN).
  • Rósa: Female voice developed by Reykjavík University (ESPnet2 FastSpeech2 + Multiband MelGAN).
  • Karl: Male voice on Amazon Polly.
  • Dóra: Female voice on Amazon Polly.

Supported backends

The supported voice backends are described in voice.proto. There are three different backends: Fastspeech2MelganBackend, Espnet2Backend and a AWS Polly proxy backend PollyBackend.

Model preparation for Fastspeech2MelganBackend

The backend tiro.tts.Fastspeech2MelganBackend uses models created with cadia-lvl/FastSpeech2 and a vocoder created with seungwonpark/melgan. Both the FastSpeech2 and MelGAN models have to be converted to TorchScript models before use. The converted models can also be downloaded:

Converting the MelGAN vocoder

To convert the vocoder to TorchScript you have to have access to the trained model and the audio files used to train it. There are two scripts necessary for the conversion //:melgan_preprocess and //:melgan_convert.

For the Diljá voice models from Reykjavik University (yet to be published) the steps to prepare the TorchScript MelGAN vocoder are:

Download the recordings:

mkdir wav
wget https://repository.clarin.is/repository/xmlui/bitstream/handle/20.500.12537/104/dilja.zip
unzip dilja.zip -d wav

Generate the input features:

bazel run :melgan_preprocess -- -c $PWD/src/lib/fastspeech/melgan/config/default.yaml -d $PWD/wav/c

Convert the vocoder model:

bazel run :melgan_convert -- -p $PATH_TO_ORIGNAL_MODEL -o $PWD/melgan_jit.pt -i $PWD/wav/c/audio

And then set melgan_uri in conf/synthesis_set.local.pbtxt to the path to melgan_jit.pt.

Converting the FastSpeech2 acoustic model

The model is converted to TorchScript using scripting, so no recordings are necessary. The script //:fastspeech_convert can be used to convert the model:

bazel run :fastspeech_convert -- -p $PATH_TO_ORIGNAL_MODEL -o $PWD/fastspeech_jit.pt

And then set fastspeech2_uri in conf/synthesis_set.local.pbtxt to the path to fastspeech_jit.pt.

Normalization

There are two types of normalization referenced in voice.proto: BasicNormalizer and GrammatekNormalizer. BasicNormalizer is local and only handles stripping punctuation but the GrammatekNormalizer is a gRPC service that implements com.grammatek.tts_frontent.TTSFrontend, such as grammatek/tts-frontend-service.

Configuration

The voices are configured using Protobuf text file specified by voice.proto. By default it is loaded from conf/synthesis_set.pbtxt but this can be changed by setting the environment variable TIRO_TTS_SYNTHESIS_SET_PB. See src/config.py for a complete list of possible environment variables.

Building and running

The project requires Python 3.8 and uses Bazel for building. To build and run a local development server use the script ./run.sh.

Docker can also be used to build the project:

docker build -t tiro-tts .

and then to run the server:

docker run -v DIR_WITH_MODELS:/models -v PATH_TO_SYNTHESIS_SET:/app/conf/synthesis_set.pbtxt \
           -p 8000:8000 tiro-tts

The project uses To build and run a local development server use the script run.sh.

License

Tiro TTS is licensed under the Apache License, Version 2.0. See LICENSE for more details. Some individual files may be licensed under different licenses, according to their headers.

Acknowledgments

This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.