Cainteoir Text-to-Speech Engine
The Cainteoir Text-to-Speech engine is a library that provides text-to-speech functionality for reading and recording different document formats.
In order to build cainteoir-engine, you need:
- a functional autotools system (
- a functional c++ compiler;
- the Python YAML parser library;
- the zlib development libraries;
- the shared mime info package.
Optionally, you need:
- the pulseaudio development library to enable pulseaudio output;
- the alsa development libraries to enable alsa audio output;
- the vorbis encoder development libraries for ogg/vorbis support;
- the espeak development libraries for espeak support;
- the pico development libraries for svox pico support;
- the poppler development libraries for pdf support.
If you want ePub 3.0 Media Overlay support, you need FFmpeg or libav v9 or later, with the following libraries installed:
- libavresample — for converting differently sampled audio files to the one used by the TTS voice.
To build the documentation, you need:
- the kramdown program to build the general documentation;
- the doxygen program to build the api documentation;
- the documentation generator project (https://github.com/rhdunn/documentation-generator).
|shared mime info||
The Cainteoir Engine supports the standard GNU autotools build system. The
source code does not contain the generated
configure files, so to build
it you need to run:
./autogen.sh ./configure --prefix=/usr make
The tests can be run by using:
The program can be installed using:
sudo make install
Source tarballs can be generated by running:
To support building the documentation, you need to inform the build where the documentation-generator project is located. This can be done by:
cd .. git clone git://github.com/rhdunn/documentation-generator.git cd cainteoir-engine ./configure --with-docgen=../documentation-generator
The documentation can be built by running:
Alternatively, just the API documentation can be built by running:
NOTE: You need a recent version of doxygen (such as 1.8.5) that supports C++11 constructs, specifically scoped enumerations.
Comparison With eSpeak
The Cainteoir Text-to-Speech engine has support for using eSpeak to synthesize text. It is architectured differently to eSpeak such that:
The document processing phase is separate from the text processing phase.
This means that Cainteoir Text-to-Speech can support a wider variety of text formats such as ePub, PDF and RTF, as well as providing support for ePub 3 Media Overlays. The HTML and SSML processing in eSpeak is tied to the text processing module which makes it difficult to test and maintain.
This allows Cainteoir TTS to correctly detect and process a HTML page that just contains an email document complete with MIME tags such that the email is processed instead of treating it as plain text.
Numbers are converted to words instead of phonemes.
This allows the numeric and word forms to be pronounced consistently and can more easily follow accent variations. In eSpeak they can be inconsistent, for example
sixteenhave a different stress placement.
Currently, the number handling in Cainteoir TTS is not as powerful as the one in eSpeak as it cannot handle masculine/feminine/neuter forms from German and other languages and other language variations such as those found in Czech and Irish Gaelic.
Support for large numbers.
Numbers upto 1099 can be supported in the current engine and it is easy to support even higher numbers. Whereas eSpeak reads
thousand billionin both US and UK English instead of
trillionfor US English. For higher numbers, eSpeak speaks each digit individually.
Future work and evolution of Cainteoir Text-to-Speech:
Separation of language from voice.
In eSpeak, language and voice are the same thing. Thus, it is difficult to use a different voice for a given language (eSpeak has a hard coded map of supported MBROLA voices). It is also difficult to get a voice to support different languages (some of the MBROLA voices eSpeak supports have English variants, but there are no mappings to other languages).
For Cainteoir, a voice is just a phoneme synthesizer, a language is a mapping from words to phonemes and an accent is a mapping from phonemes to phonemes. Thus, any of these can be mixed and matched as long as they have compatible phoneme sets.
Support for different dictionaries and pronunciation rules.
The dictionary and rule set architecture is such that it is easy to support different formats for these, so Cainteoir should support as many as possible. It is possible to mix and match dictionaries and pronunciation rules.
User and document dictionaries.
Users will want to provide customized pronunciations for words that are mispronounced or are not in the user's accent. Likewise, a document such as ePub may reference a pronunciation dictionary containing pronunciations of words a Text-to-Speech engine is likely to get wrong (like Latin, foreign or made up names).
In eSpeak, a user must build their pronunciation changes into the main eSpeak dictionary file. With Cainteoir, it should be possible to add user-level and document-level pronunciations.
Precise phonetic pronunciations.
The sounds supported by a voice should use precise articulation based on the phonetic features of that phoneme, not its meaning for a speaker in a given language. That is, each sound should be as narrow and specific as possible.
Disambiguation of allophones (e.g. aspirated plosive at the start of a word) is done after the conversion of text to phonemes, but before synthesis. Thus, allophones are a feature of the language/accent, not the voice.
As such, it should be possible to use the phoneme transcription schemes to parse phonemes transcribed in IPA, ASCII-IPA or other scheme and feed that directly to the voice to be synthesized without the synthesizer modifying the sounds as happens with eSpeak.
Generate an exception dictionary from a reference dictionary.
In eSpeak the generation of the exception dictionary is done by hand. This results in some words that can be pronounced from the letter-to-phoneme rules being present in the dictionary. Also, changes in the letter-to-phoneme rules can result in some words regressing pronunciation.
To resolve those issues, a reference dictionary should be created with a pronunciation of words that is known to be valid. From this, any word that is not pronouncable via the rule set should be added to the exception dictionary. This ensures that the exception dictionary is as small as possible and that the words in the reference dictionary are pronounced correctly.
It also makes it easier to test, analyse and experiment with the rule sets. For example, statistics can be run to see how many times a given rule was matched and how many times it did not match.
Support selecting which language to use to synthesize language scripts.
The Cainteoir engine detects which script (Latin, Greek, Cyrillic, etc.) each section of text is written in. It should be possible to configure what language each script should be read in to provide a more fluent experience.
The eSpeak engine has limited support for this where e.g. the Bulgarian language will use Bulgarian for Cyrillic and English for Latin text. For other languages, it speaks the language name and then the word
letterwhich is not useful when e.g. a user is trying to learn Japanese.
Support different voice synthesizers.
In eSpeak, support for MBROLA phonemes is mixed in with its support for spectral, klatt and wave phonemes. This is done at the wrong level and makes it difficult to maintain.
There should be a voice/phoneme synthesizer interface that takes the phonemes and associated prosodic information and generates audio from that. Supported synthesizers could include:
- MBROLA — use the MBROLA diphone synthesizer
- eSpeak — use the eSpeak spectral, klatt and wave synthesizer
- Klatt — use the Klatt synthesizer with Holmes' vocal track parameters (as is done by rsynth)
- diphone concatenation synthesizer — combining recorded samples of phonemes from a speaker reading some text
- vocal tract synthesizer — modelling the acoustic behaviour of the vocal tract including the vocal chords, mouth, tongue and nose
When loading the voice, the voice data is passed to the voice synthesizer. This tells the voice synthesizer how to handle the different phonemes and generate the audio for those. For example, an MBROLA voice will provide the MBROLA voice name (e.g.
de1) and a mapping of phonemes to the voice's phoneme set including any prosody transformations (e.g. for voices that do not have different phonemes for the long and short variants of a given vowel).
The voice/phoneme synthesizer object is then passed to the engine so the engine can feed phonemes to the synthesizer after the text to phoneme and prosodic analysis phases.
Report bugs to the cainteoir-engine issues page on GitHub.
The Cainteoir Text-to-Speech Engine is released under the GPL version 3 or later license.
Cainteoir is a registered trademark of Reece Dunn.
W3C is a trademark (registered in numerous countries) of the World Wide Web Consortium; marks of W3C are registered and held by its host institutions MIT, ERCIM, and Keio.
All trademarks are property of their respective owners.