-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to control the speed of the speech? #3
Comments
Yes, just scale the duration predicted. Simply change the line duration = model.predictor.duration_proj(x) in Inference_LJSpeech.ipynb with duration = model.predictor.duration_proj(x) / speed where You can also change the speed more naturally by giving slower or faster reference audio, which is a main selling point of StyleTTS. |
@yl4579 Thanks for the reply! |
Yes, it is always necessary. But you can precompute a style from a reference and add some noise to the style to synthesize diverse speech. |
Hi @yl4579 , could we also control the pitch of the generated audio, i.e: generate different audio variations of a speaker from single text? |
@yl4579 , any way to control the pitch? changing the predicted F0 values before sending to the decoder seems to make no difference. |
@vik-rant @sinhprous1 Change the F0 curve should change the pitch, can you give me some examples that changing F0 curves don't change the pitch? |
@yl4579 , sharing some samples with pitch shifted by various values:
|
@vik-rant Shifting F0 doesn't change the absolute pitch, as input F0 is normalized. Instead, the absolute pitch depends on the reference. You will need to shift the reference by 100 Hz to get the synthesized speech with F0 shifted by 100 Hz. If you only change some part of the F0 curve, however, the F0 will change. StyleTTS controls everything with the reference audio and F0 is just a curve instead of an absolute number, which is why it can also do voice conversion. You can retrain the model by removing the instance normalization for the F0 part to get absolute F0. |
@yl4579 , got it, thanks. |
@chiaki-luo What do you mean by directly listening to the mel produced in the training stage and it sounds worse than inference? Maybe you can open a new issue if it is not related to this thread? What do you mean by using the output of inference as ground truth? I don't really understand your question. |
我发现这样用的时候并不能达到预期的控制音频速度,因为我设置了speed为3.0速度反而更慢了 |
Oh, I found that changing the code here with |
A question before I dive in. Given that |
Hello,
It truly is an impressive work!
I wonder if it is possible to control the speed of the output speech.
Thank you!
The text was updated successfully, but these errors were encountered: