Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradio demo #50

Closed
AK391 opened this issue Nov 21, 2023 · 16 comments
Closed

Gradio demo #50

AK391 opened this issue Nov 21, 2023 · 16 comments
Labels
help wanted Extra attention is needed

Comments

@AK391
Copy link

AK391 commented Nov 21, 2023

HI, congrats on StyleTT2, would be great to setup a gradio demo for it on Hugging Face, you can see the guide to get started here: https://huggingface.co/docs/hub/spaces-sdks-gradio and here is a recent example: https://huggingface.co/spaces/coqui/xtts, @yvrjsharma

@yl4579 yl4579 added the help wanted Extra attention is needed label Nov 21, 2023
@yl4579
Copy link
Owner

yl4579 commented Nov 21, 2023

I'm not familiar with Gradio. I did try it for StyleTTS but had no success. I would take a look at it later when I get time, but if anyone is interested in making a demo for now feel free to contribute!

@yl4579
Copy link
Owner

yl4579 commented Nov 22, 2023

Someone is already working on it: #53, and we are figuring out some details of it.
I will let you know when it is ready.

@fakerybakery
Copy link
Contributor

fakerybakery commented Nov 22, 2023

Hi @AK391. I’ve released a Gradio demo here with voice cloning, multi-speaker support, and LJSpeech support.

@yl4579
Copy link
Owner

yl4579 commented Nov 22, 2023

@fakerybakery I think for the default voices, it would be great if you could find all the audio samples in the training data and compute the styles of each sample and take the average, then save it as the speaker embedding. This is probably more efficient than computing the style every time it is run, and also more accurate reflection of the speaker.

@fakerybakery
Copy link
Contributor

fakerybakery commented Nov 22, 2023

Yes, you’re probably right. No wonder starting the demo took so long each time! Thank you, I’ll push a fix tomorrow :)

@fakerybakery
Copy link
Contributor

Hi, someone asked here if I would release a local Gradio GUI to run (the comment was later deleted for some reason, but it was still in my inbox).

I am planning to eventually release it and perhaps make a PR to the main repository, but the code quality is currently pretty... low. I'm going to clean it up a bit and then try to release it.

@fakerybakery
Copy link
Contributor

fakerybakery commented Nov 22, 2023

Thanks to @AK391 for posting this solution on X/Twitter! Just realized you can run any Hugging Face space on Docker.

docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all \
	registry.hf.space/styletts2-styletts2:latest python app.py

@yl4579
Copy link
Owner

yl4579 commented Nov 22, 2023

A few more features that could be added:

  1. Now the limit is 300 characters, but the max_length of BERT encoder is 512, so I think a better way of checking this limit is first phonemize the input and then use len() on the phoneimzed texts and make sure it is less than 512.
  2. Probably we can add duration control and pitch control as well. To control duration, we can do Is it possible to control the speed of the speech? StyleTTS#3. To control the pitch, we can do the same but scale the F0. Another more natural way is to change the pitch of the reference audio and use that to sample a style, then combine it with the original style.
  3. We can add emotion control with style transfer, though it may not be very obvious in LibriTTS dataset due to the data itself is not overly emotional.

Thanks again for your help in making the demo!

@fakerybakery
Copy link
Contributor

Hi, I can try to implement this. 1 and 2 seem doable, but 3 seems a bit harder. I'll look into this later today! Thanks for the suggestions!

@MariasStory
Copy link

Can you please remove the "Access code" in the "Long Text"?
It is a problem when the docker is run locally.

@fakerybakery
Copy link
Contributor

fakerybakery commented Nov 23, 2023

Ok, I'll remove the long text feature in a couple minutes or add a character limit

@fakerybakery
Copy link
Contributor

Hi @yl4579, a couple things:

  1. I tried saving the speaker embeddings with pickle but it had an issue switching between CPU + GPU. Do you have any tips for resolving this?
  2. For len, do you mean just get the length of phonemes? Or tokens?

@kotaxyz
Copy link

kotaxyz commented Nov 23, 2023

Hi, someone asked here if I would release a local Gradio GUI to run (the comment was later deleted for some reason, but it was still in my inbox).

I am planning to eventually release it and perhaps make a PR to the main repository, but the code quality is currently pretty... low. I'm going to clean it up a bit and then try to release it.

@fakerybakery thanks a lot for your reply , I'm looking forward for the local version ,I tested the huggingface demo and it looks awesome !

@yl4579
Copy link
Owner

yl4579 commented Nov 23, 2023

@fakerybakery

  1. You can save it to CPU (or even as numpy array) and then do .to('cuda').
  2. Technically you should do it with tokens, but each character is a single token, len should be fine too.

@yvrjsharma
Copy link

I am planning to eventually release it and perhaps make a PR to the main repository, but the code quality is currently pretty... low. I'm going to clean it up a bit and then try to release it.

Would you like to start with making a local copy of the current HF demo and then iterate over it to improve it? @fakerybakery

@fakerybakery
Copy link
Contributor

Yeah, I'll start doing that. However I'm using macOS and can't figure out how to install espeak-ng for phonemizer (I tried MacPorts but it didn't work - maybe I'll develop it on a VM)

Repository owner locked and limited conversation to collaborators Nov 29, 2023
@yl4579 yl4579 converted this issue into discussion #110 Nov 29, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants