Brainless concatenative text to speech.
JoinTTS is a simple off-line (on-premise) concatenative TTS nodejs API.
jointts
is the formal name of this project, but you can call it simplyjoint
, that's also the command line program alias. Ah! There’s a funny double meaning in the name.🙄
The goal is to build a super simple efficient concatenative speech synthesis that at run-time concatenates prerecorded local audio files, without any cloud access.
The system is suitable for applications with a small grammar (a limited set of sentences/words) for a semi-static speech generation.
An example of application could be an embedded system TTS made by mainly fixed output sentences, but containing a small amount of variable/dynamic parts, as entities (codes, names) in template literals.
The target environment is so any sort of embedded system (on-premise/off-line), with poor CPU resources, but the need of a real-time responsive speech output.
The speech is produced by concatenating prepared audio files sources, for letters, words, template literals, entire phrases. All audio files "chunks" needed are prepared offline, to be available afterward, at run-time, for a fast concatenative audio generation.
Text-to-speech output are audio files or in-memory binary blobs (nodejs buffers) in a specific audio codec as PCM or OPUS.
Audio recordings could be realized/sourced in two ways, using in alternative:
-
Real human voices (by voice actors) recordings
This is specially useful by example in language education apps, for special purposes, as syllables pronunciation.
-
Synthetic voices recording
You can by example use Google Translate TTS, or any TTS of your choice to prepare speech files/buffers)
💡 Note that using a cloud-based TTS to generate audio chunks is more a test system to workaround the availability of real human voice recording. Please read disclaimer section for details.
Speech generation is language-dependent.
JoinTTS can be configured to manage many natural languages. See Multi-language doc.
Input texts could be managed as characters, words, phrases.
- Static phrases
- Words concatenation
- Character-by-character spelling
- Template literals
See Text segmentation doc.
All audio files required are generated following configuration files settings, with user voice recordings or with any (synthetic voices) third party sources to be downloaded.
Configurations files are:
characters.json
words.json
phrases.json
templates.json
They specify which file has to be used for the target concatenation.
Configuration files are language-dependent:
config/it/*.json
config/en/*.json
config/de/*.json
- ...
+-------------------+
| |
| joinTTS CLI |
| |
+---------+---------+
|
+---------v---------+
| |
| language grammar |
| config generator |
| |
+---------+---------+
|
v
config/it/*.json
config/en/*.json
config/de/*.json
|
v
Audio source files can be made in 2 different ways:
-
🎙 Human voice recordings
For a personalized voice experience, a voice actor can record all required audio files.
🛠 TOCOMPLETE
-
🩹 Synthetic voices files
Audio files are generated by any cloud-based TTS and downloaded as files. A synthetic voice file can be made using any cloud-based TTS as Amazon Polly, Google Cloud Platform Text-to-Speech, etc.
joinTTS use, for example only, the Google Translate Speech library. Whit
jointts
(orjoint
) command line utility, speech MP3 files (containing the Google Translate synthetic voice) can be generated from texts:$ jointts download gt
+------------------+
| |
| joinTTS CLI |
| |
+---------+--------+
config/it/*.json |
config/en/*.json |
config/de/*.json |
| | +---------v--------+
| +----------> |
| | audio files |
| | production |
| | |
| +--------+---------+
| |
| v
| audio/it/a.mp3
| audio/it/b.mp3
| audio/it/c.mp3
| ...
| |
v v
At run-time the main program call joints run-time engine that generates on the fly audio speech files, concatenating available audio chunks.
config/it/*.json
config/en/*.json
config/de/*.json
| audio/it/a.mp3
| audio/it/b.mp3
| audio/it/c.mp3
| ...
| |
+---------v---------------------v----------+
| |
text --> | joinTTS run-time API | --> audio file
'ABC123' | | ABC123.mp3
+------------------------------------------+
| ffmpeg |
+------------------------------------------+
See functions documentation:
- 📦 Install
ffmpeg
ffmpeg is used acid backend engine for all audio files conversions, audio play, audio concatenations.
sudo apt install ffmpeg
Optionally, to use OPUS codecs:
sudo apt install libopus0 opus-tools
- 📦 Install
jointts
The package contains command line program jointts
,
so you must install the npm package as global:
Download this github repo:
$ git clone https://github.com/solyarisoftware/jointts
$ cd jointts && npm link
Or use npm package manager repo
$ npm install -g jointts
Listen here examples of spelling audio rendering for alphanumeric codes.
WORK-IN-PROGRESS / DRAFT.
So far, the project is a proof-of-concept, in pre-alfa stage, with 60% of features implemented. Smart high-level usage has to be defined.
JointTTS run-time usage is intended to basically run on a private environment. You are in charge to manage privacy, permissions, licenses, of all your files.
If you use cloud-based TTS platforms (as Amazon Polly, Google TTS, etc.) to download synthetic voice files in the preparation step, it’s your responsibility to not break any license or copyright.
In the same way, if you use voice recordings of other people, please assure to have permissions to do it.
MIT (c) Giorgio Robino