Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Adding support for Windows Sapi5 implimentation #40

Open
king-dahmanus opened this issue Nov 14, 2021 · 8 comments
Open

Adding support for Windows Sapi5 implimentation #40

king-dahmanus opened this issue Nov 14, 2021 · 8 comments
Labels
enhancement New feature or request

Comments

@king-dahmanus
Copy link

Hey there developers! I found this repo by exploring, and I'd like to make some requests.
Firstly: Releasing a windows sapi5 version of the tts engine, compatible with all the voices that are available, with integrated necessary encoders which ensure a fast and responsive synthesis: Details below.
I am a blind person who uses a screen reader to use the computer. Blind people like me require a responsive speech synthesizer so they can recieve the requested information without any unnecessary delays, and a quite poppular part of them require very fast speech output without resulting in weird voice artifacts such as those produced by natural sounding tts voices. If I were stupid and ignorant to the point where I don't realize the hard work for it, I would ask you to make an Nvda addon containing the synthesizer along with a possibility to download the voices, but a more mainstream windows integrated option like sapi5 would maybe a little easier perhaps?
Anyway, I know that this project is for rasberry py/commandline usage, but the currently available voices attracted someone like me who uses a more beneficial option for say, dayly usage or something. I look forward to your responce, This is just a request from me, if it can't be done it can't be done. So thanks, and have a good time

@synesthesiam synesthesiam added the enhancement New feature or request label Nov 14, 2021
@synesthesiam
Copy link
Contributor

Hi @king-dahmanus, thanks for your feedback! I would definitely be interested in adding SAPI5 support for Windows in order to make Larynx more accessible to everyone. I'll have to look into what it would take in implement a TTS engine interface.

I've experimented with getting the voices much more responsive in my Glow Speak project, which runs a daemon and caches all of the WAV files it produces (it also uses eSpeak to turn text into phonemes). As you mentioned, though, there are weird artifacts for short phrases, especially single words. I believe this is largely a problem with the datasets I have; none of them feature single word utterances, and many of them have sentences split across multiple utterances (so no pauses at the beginning or end).

Do you know of any public audio datasets that contain only complete sentences and single spoken words? If not, would you be interested in collaborating to create one?

@king-dahmanus
Copy link
Author

king-dahmanus commented Nov 14, 2021 via email

@synesthesiam
Copy link
Contributor

Do you, or anyone you know, have a pleasant voice, a good microphone, and a lot of patience? 🙂

I've worked with several people to create text to speech datasets. I use an algorithm to select a (relatively) small set of phonetically diverse sentences from a public domain book or corpus. Here, I would also make sure that we have a diversity of single spoken words.

@king-dahmanus
Copy link
Author

king-dahmanus commented Nov 17, 2021 via email

@king-dahmanus
Copy link
Author

king-dahmanus commented Nov 17, 2021 via email

@synesthesiam
Copy link
Contributor

The Common Voice datasets are excellent, but not ideal for a text to speech voice. For text to speech, you want a lot of high quality data from very few speakers (no noise, if possible). For speech to text, however, Common Voice is great -- lots of noisy data from many speakers.

Let me look around a bit more before asking you to do any recording. A lot of the text to speech datasets are derived from LibriVox, and I'm hoping there will be a book there where the author reads out lists of items so we can get isolated spoken words.

@king-dahmanus
Copy link
Author

king-dahmanus commented Nov 19, 2021 via email

@king-dahmanus
Copy link
Author

Hey, What's new? Are you working on something yet michael? I mean to tell you something. Currently, we could ignore the dataset issue for the moment and concentrate on making this thing support sapi5 on windows. And also, the speed I'm talking about isn't the issue of not being able to pronounce words with the right intonation, but rather being able to speak at very fast speech rates without producing weird artifacts, and also being responsive, so it doesn't have any lag or delay while speaking, so it has to be fast and responsive. Maybe this is already accomplished since it's designed for rasberry py, but still. Thanks, and have a good time

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants