Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to control the speed of the speech? #3

Closed
godspirit00 opened this issue Jan 15, 2023 · 13 comments
Closed

Is it possible to control the speed of the speech? #3

godspirit00 opened this issue Jan 15, 2023 · 13 comments

Comments

@godspirit00
Copy link

Hello,
It truly is an impressive work!
I wonder if it is possible to control the speed of the output speech.
Thank you!

@yl4579
Copy link
Owner

yl4579 commented Jan 17, 2023

Yes, just scale the duration predicted. Simply change the line

duration = model.predictor.duration_proj(x)

in Inference_LJSpeech.ipynb with

duration = model.predictor.duration_proj(x) / speed

where speed is your scaling factor, for example, 1.25 is 125% faster.

You can also change the speed more naturally by giving slower or faster reference audio, which is a main selling point of StyleTTS.

@yl4579 yl4579 closed this as completed Jan 17, 2023
@godspirit00
Copy link
Author

godspirit00 commented Jan 17, 2023

@yl4579 Thanks for the reply!
Just another question, is a reference speech always needed to do synthesis?

@yl4579
Copy link
Owner

yl4579 commented Jan 17, 2023

Yes, it is always necessary. But you can precompute a style from a reference and add some noise to the style to synthesize diverse speech.

@sinhprous1
Copy link

Hi @yl4579 , could we also control the pitch of the generated audio, i.e: generate different audio variations of a speaker from single text?

@vik-rant
Copy link

vik-rant commented May 9, 2023

@yl4579 , any way to control the pitch? changing the predicted F0 values before sending to the decoder seems to make no difference.

@yl4579
Copy link
Owner

yl4579 commented May 11, 2023

@vik-rant @sinhprous1 Change the F0 curve should change the pitch, can you give me some examples that changing F0 curves don't change the pitch?

@vik-rant
Copy link

vik-rant commented May 11, 2023

@yl4579 , sharing some samples with pitch shifted by various values:
https://drive.google.com/drive/folders/1au9JGH-WhGTxkeErTT0wNNBWU3UfBEAD?usp=sharing

        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
        F0_pred += 100
        print('F0:')
        print(F0_pred.shape)
        print(F0_pred)

        out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0).to(device)), 
                                F0_pred, N_pred, ref.squeeze().unsqueeze(0))

@yl4579
Copy link
Owner

yl4579 commented May 11, 2023

@vik-rant Shifting F0 doesn't change the absolute pitch, as input F0 is normalized. Instead, the absolute pitch depends on the reference. You will need to shift the reference by 100 Hz to get the synthesized speech with F0 shifted by 100 Hz. If you only change some part of the F0 curve, however, the F0 will change. StyleTTS controls everything with the reference audio and F0 is just a curve instead of an absolute number, which is why it can also do voice conversion. You can retrain the model by removing the instance normalization for the F0 part to get absolute F0.

@vik-rant
Copy link

@yl4579 , got it, thanks.

@yl4579
Copy link
Owner

yl4579 commented Oct 28, 2023

@chiaki-luo What do you mean by directly listening to the mel produced in the training stage and it sounds worse than inference? Maybe you can open a new issue if it is not related to this thread? What do you mean by using the output of inference as ground truth? I don't really understand your question.

@bobo-paopao
Copy link

Yes, just scale the duration predicted. Simply change the line

duration = model.predictor.duration_proj(x)

in Inference_LJSpeech.ipynb with

duration = model.predictor.duration_proj(x) / speed

where speed is your scaling factor, for example, 1.25 is 125% faster.

You can also change the speed more naturally by giving slower or faster reference audio, which is a main selling point of StyleTTS.

我发现这样用的时候并不能达到预期的控制音频速度,因为我设置了speed为3.0速度反而更慢了

@bobo-paopao
Copy link

Oh, I found that changing the code here with duration = torch.sigmoid(duration).sum(axis=-1) / speed can change the speed.

@vishalsantoshi
Copy link

vishalsantoshi commented Mar 24, 2024

A question before I dive in.

Given that duration = torch.sigmoid(duration).sum(axis=-1) is available, AND if we have a target duration, is it possible to calculate potential duration, to calculate the speed factor, which would be a ratio of target duration and potential duration, it seem I cannot use straight up sum(duration) ? I have been doing a double pass ( create the audio and then get the duration ) to get to the speed factor, and then apply to recreate, which seems redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants