Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTS Does not handle numbers in text #78

Closed
nikito opened this issue Jun 1, 2023 · 24 comments
Closed

TTS Does not handle numbers in text #78

nikito opened this issue Jun 1, 2023 · 24 comments

Comments

@nikito
Copy link
Contributor

nikito commented Jun 1, 2023

I tried testing the TTS using a generated text response from my HA instance as follows:
Currently, the weather is 55 degrees with partly cloudy skies. Under present weather conditions, the temperature feels like 55 degrees. In the next few hours you can expect more of the same, with a temperature of 55 degrees.

What I noticed is the TTS generated silence for the number 55, but spoke all the other text. Seems it does not know how to handle numeric values?

I also noticed it did similar when trying to report time, such as 8:55AM. I also didn't yet try it, but I imagine it may have similar trouble handling date strings as well. Maybe there's a way to have it handle these specific numeric formats?

EDIT: just tried the string "Today is Thursday, June 01 2023." and it was silent on all the numbers. Also tried "Today is Thursday, June 1st 2023." and it says "st" on the 1st part, silence on all other numbers.

@kristiankielhofner
Copy link
Contributor

This is a known limitation of the Speech T5 model we're using. We have an open issue to use a completely different model that would be superior in every regard (including something as basic as speaking numbers):

#60

We have a wisng branch that will be merged shortly with a new TTS engine. We're still evaluating them but this issue will be addressed then.

@nikito
Copy link
Contributor Author

nikito commented Jun 1, 2023

Saw the other issue and figured that'd be the case, but figured I'd mention it just in case. Look forward to trying out the new engine! 😃

@kristiankielhofner
Copy link
Contributor

While I've implemented many of them I'm a little frozen by "paradox of choice" at this point. There are so many options in terms of voices I'm trying to determine what is most pleasing to the community generally. If you have any input on the various options and voices I'd love to hear it!

@nikito
Copy link
Contributor Author

nikito commented Jun 1, 2023

Guess that depends on the engine 🤣 I find it interesting that Coqui seems to let you clone a voice based on an audio sample, which could be useful for the community to sort of forge their own voices. Barring that, I imagine the community would probably want something like Jarvis ala a british sounding male voice, or on the female spectrum (which I think studies show female voices tend to sound better due to the way their tonalities are processed) probably something akin to the Alexa/Google female voices. Personally I'm trying to find a voice that has an Irish accent (something akin to F.R.I.D.A.Y from the Marvel movies 😄 ) as my wife is Irish and likes the idea of our voice assistant having that kind of accent.

The other side of things is of course multi-language, which then makes the desire for choices even more grand as I imagine people will want the voice to sound good while also pronouncing the words correctly in their native languages.

Are we trying to settle on a general voice for now, and later give people the ability to pick or even create their own?

@nikito
Copy link
Contributor Author

nikito commented Jun 1, 2023

Another option could be to choose several voices that sound good, then make a poll with samples and let the community vote for their choice? 😃

In terms of the systems out there I admittedly haven't played with a lot of them, the last two I played around with were Mimic3 and recently Piper. Don't think either of those is leveraging GPGPU though, so may not be as performant as you are looking for? Other than that I've been playing around with the Coqui.ai console with the different voices you can generate there and it seems pretty neat. I may try to spin up a local instance of it and see what can be done with the different tts voices they include with the distro, and try out the cloning to see how that works as well.

@kristiankielhofner
Copy link
Contributor

The good (and bad news) is Speech T5 is so poor we almost can't go wrong with ANYTHING else. I selected it because the other contender projects are significantly more involved. SpeechT5 in total was a few lines for a HF Transformers model. The other projects are... Not.

At this point I'd be fine with input from another engaged community member as well as the internal team on selecting voice. It just needs to get done and if we don't nail it this go around we'll revisit again (or make it modular, etc). My strong preference is for a the highest quality generic American Female voice as the priority/default. As you note this is a fairly well researched field and generally speaking studies show that people tend to prefer it.

Custom voices is actually fairly straightforward and even supported with SpeechT5 now. Look at /docs (or /api/docs with wisng) for the speaker management APIs. Be advised though, much like the rest of SpeechT5 the results aren't fantastic.

We still primarily target GPGPU for all of the reasons noted in the README and elsewhere, with the ability to also run CPU-only with speed tradeoffs.

Coqui is really easy to spin up locally - integrating it in WIS is another story, but it's certainly something we can do. If you want to install it locally (their docker implementation is solid) and provide feedback that would be great!

@nikito
Copy link
Contributor Author

nikito commented Jun 2, 2023

I actually played with some of the Coqui voices last night on Huggingface, and found the Jenny tts voice to be really good! I'm going to try to spin up a local instance and see how it does on my GTX 1070. 😃

@kristiankielhofner
Copy link
Contributor

Great! You can also try some of the tortoise voices, they are generally pretty highly regarded.

@nikito
Copy link
Contributor Author

nikito commented Jun 2, 2023

Just spun up the docker image with CUDA enabled, performance on the GTX 1070 seems really good, getting RTF of .15, really fast! I'm also running this at the same time as WIS, and both are sharing the GPU just fine. Here's a screenshot of nvidia-smi:
image
Plenty of space even with all of the WIS models loaded as well as TTS.

I'm quite pleased with the Jenny voice personally, exactly what I was looking for. I'll play around with some of the other voices just to see how those are as well.

EDIT: Adding on here to not spam the thread 😆
So I tried to use the tortoise model but got an error when spinning up the server, maybe I am using it wrong somehow.
Also tried the vctk/vits models, and tried several different voices. They all sounded very high quality, and speed was extremely fast as well. So far I think going with Coqui is not a bad choice based on what I am seeing; performance is very good, plays nice with WIS running on the same GPU as well, and lots of options in terms of voices. I still think I like the Jenny one most so far, though I will admit my personal taste is a bit at play there 😄

@kristiankielhofner
Copy link
Contributor

Great to see!

The good news is wisng has extremely performant caching of TTS responses via nginx (and we use Cloudflare tiered caching with reserve for our hosted instances) so speed is less of a concern as for the time being most of the TTS output is very repetitive.

@kristiankielhofner
Copy link
Contributor

kristiankielhofner commented Jun 2, 2023

Another positive for Jenny

TLDR - There is initial support for ONNX export of VITS models. With the ONNX CUDA execution provider this speeds up VITS by 20% (or so) and with newer GPUs with tensor cores and the ONNX Tensor RT Execution Provider it (likely) speeds up dramatically (as tends to be the case with tensor cores).

We would need to add onnxruntime (easy enough) and we already have code to detect tensor core availability so it would pretty straightforward to pull this off.

@lordratner
Copy link

Is there somewhere to listen to the voice options? I'd be happy to add another vote to the mix.

Just got my 1070 added to the server, so I should have a WIS instance spun up soon.

@nikito
Copy link
Contributor Author

nikito commented Jun 3, 2023

I personally just spun up their cuda based docker container and played around with the voices on that. They detail how to do that here: https://tts.readthedocs.io/en/latest/docker_images.html

Make sure to use the GPU version. 😊

@kristiankielhofner
Copy link
Contributor

kristiankielhofner commented Jun 4, 2023

I added support in wisng to convert numbers to words with SpeechT5 so that will at least get you support for numbers. You can use it with wisng branch.

I've also used the TTS docker containers extensively. My general take/sense is that TTS would be fairly difficult to integrate directly in WIS, it has a TON of python dependencies pinned to specific versions and it would be tough to integrate them cleanly.

I spent some time getting CPU, CUDA, and TensorRT working with an ONNX export of the VITS TTS model. With TensorRT on my 3090 it can do TTS at approximately 10x realtime (as a first pass). That's a substantial boost with TensorRT and will only work on newer GPUs but even with CUDA runtime it's still a bit faster than the default GPU implementation with pytorch.

With an OONX export it should also be fairly straightforward to extract the relevant onnxruntime support for VITS from TTS and use it directly with minimal additional dependencies.

Still needs more work to even validate the approach but I like what I'm seeing so far.

@nikito
Copy link
Contributor Author

nikito commented Jun 4, 2023

Sounds promising! One other thing I noticed, the Coqui docker server exposes a few API endpoints, one of which is API/tts which takes a text parameter argument. It appears to work pretty much the same as the wis tts endpoint currently exposed. I know that's not the same as a direct, wis-integrated implementation, but figured it was worth mentioning. 🙂

@kristiankielhofner
Copy link
Contributor

wisng already uses nginx to cache TTS responses, so there is a specific route match for TTS requests. I'm just not very keen on managing yet another Docker container that exists completely outside of the current inference support in WIS today - it will have a different container base, potentially with different versions of CUDA that depend on different drivers. Their TTS today is also quite slow compared to even SpeechT5 and we'd have to either maintain a fork of TTS or attempt to upstream a bunch of Willow specific changes that they'd likely be reluctant to accept (I know I would be).

We're also working towards a concept of full conversational flow management - with support for things like audio in -> STT -> do something -> TTS. This would get pretty messy (not to mention be slower) with an additional API call to what would be an external endpoint for TTS.

I only just started on getting Coqui (in some form or fashion) into WIS. It's not impossible, it will just take a little bit of creativity because as I mentioned the dependency management as-is would make WIS a mess (and nightmare to maintain).

@nikito
Copy link
Contributor Author

nikito commented Jun 4, 2023

Fair points, totally get it. Between the dependency nightmare and the architectural components (risk of some API changing down the line or some other shift) I can understand the difficulty there. I'll give Coqui a rest for now and play around with the ST5 stuff once my S3 box finally arrives in a couple weeks 😂

Thanks for the insight!

EDIT: I'll also try out the wisng branch for the numbers fix. 🙂 If you are looking for any input on any other tts stuff I'm willing to experiment!

@kristiankielhofner
Copy link
Contributor

Any testing of wisng would be great! BTW there is a WebRTC endpoint that loads a page to do ASR directly in your browser that you can test with.

@nikito
Copy link
Contributor Author

nikito commented Jun 5, 2023

Did some testing on the number handling, works now! I used this text:
Currently the weather is rainy, with a temperature of 52 degrees. Under present weather conditions the temperature feels like 51 degrees. In the next few hours the weather will be more of the same, with a temperature of 52 degrees.

One thing I notice, it has a slight pause when speaking compound numbers like 52. For instance instead of saying "fifty two" it says "fifty [pause] two". Not sure if there's a way to improve that? Otherwise working great! 😃

@kristiankielhofner
Copy link
Contributor

SpeechT5 has pretty aggressive pausing on spaces, commas, hyphens, etc. I'll look into "speeding it up", potentially with a URI parameter.

@nikito
Copy link
Contributor Author

nikito commented Jun 5, 2023

Was just about to comment, I looked at the library being used (num2words) and I think the issue is when it converts a number like 52 you end up with "fifty-two", and the hyphen seems to make T5 pause for a moment. I tested this directly on the TTS server and saw that same behavior. I then went in and removed the hyphen so the text was like "fiftytwo" and it spoke the number correctly without a pause. Maybe possible to do a string replace to remove the hyphen on the num2word generation to make that faster?

@kristiankielhofner
Copy link
Contributor

@nikito With the latest commit and your example.

@nikito
Copy link
Contributor Author

nikito commented Jun 5, 2023

Awesome, thanks for the quick turnaround! :)
Pulled and tested, can confirm the pauses are gone.

@kristiankielhofner
Copy link
Contributor

Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants