Middleware module for our speech synthesis systems.
Many common tags are assumed implicitly. Read this for an overview of SSML specification.
- Sentence level
<prosody>
withrate
,pitch
, andvolume
attributes. <phoneme>
withipa
attribute.<voice>
withgender
andname
attribute.
- Install sox.
pip install tts-middleware
For full featured inference, simply wrap your TTS function (text to audio) with the decorator like this:
from tts_middleware.core import tts_middleware, Audio
import numpy as np
@tts_middleware
def tts(text: str, language_code: str) -> Audio:
# Do requests and return audio
...
# Now calls to `tts` will support SSML with all features enabled.
Attributes for SSML tags are described next:
<prosody rate='1.3'>hello world</prosody>
.<prosody pitch='2'>hello world</prosody>
. Parameter is number of semitones as described in pysox here.<prosody volume='10'>hello world</prosody>
. Parameter is gains in db similar to pysox here.<voice gender="female" name="excited"> hello world! </voice>
.Voice
element supports two attributes:gender
andname
.
There is a streamlit app which you can use to try the API by doing the following:
# You need to install espeak for this.
poetry install
poetry run streamlit run ./examples/app.py
There are three major components here, all of which can be used in isolation.
For implicit normalization of SSML marked text. No normalization level tags are supported at the moment, so this only touches raw text.
For converting normalized and <phoneme>
marked text in phone symbols. This can
be used independently for pre-processing training data too.
For applying signal level post processing steps (mostly rate
and volume
attributes) on generated audios.