Is it possible to control the speed of the speech? #3

godspirit00 · 2023-01-15T15:18:46Z

Hello,
It truly is an impressive work!
I wonder if it is possible to control the speed of the output speech.
Thank you!

yl4579 · 2023-01-17T05:05:17Z

Yes, just scale the duration predicted. Simply change the line

duration = model.predictor.duration_proj(x)

in Inference_LJSpeech.ipynb with

duration = model.predictor.duration_proj(x) / speed

where speed is your scaling factor, for example, 1.25 is 125% faster.

You can also change the speed more naturally by giving slower or faster reference audio, which is a main selling point of StyleTTS.

godspirit00 · 2023-01-17T05:21:56Z

@yl4579 Thanks for the reply!
Just another question, is a reference speech always needed to do synthesis?

yl4579 · 2023-01-17T14:50:31Z

Yes, it is always necessary. But you can precompute a style from a reference and add some noise to the style to synthesize diverse speech.

sinhprous1 · 2023-02-01T03:04:52Z

Hi @yl4579 , could we also control the pitch of the generated audio, i.e: generate different audio variations of a speaker from single text?

vik-rant · 2023-05-09T03:54:19Z

@yl4579 , any way to control the pitch? changing the predicted F0 values before sending to the decoder seems to make no difference.

yl4579 · 2023-05-11T03:04:54Z

@vik-rant @sinhprous1 Change the F0 curve should change the pitch, can you give me some examples that changing F0 curves don't change the pitch?

vik-rant · 2023-05-11T07:56:52Z

@yl4579 , sharing some samples with pitch shifted by various values:
https://drive.google.com/drive/folders/1au9JGH-WhGTxkeErTT0wNNBWU3UfBEAD?usp=sharing

        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
        F0_pred += 100
        print('F0:')
        print(F0_pred.shape)
        print(F0_pred)

        out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0).to(device)), 
                                F0_pred, N_pred, ref.squeeze().unsqueeze(0))

yl4579 · 2023-05-11T09:25:18Z

@vik-rant Shifting F0 doesn't change the absolute pitch, as input F0 is normalized. Instead, the absolute pitch depends on the reference. You will need to shift the reference by 100 Hz to get the synthesized speech with F0 shifted by 100 Hz. If you only change some part of the F0 curve, however, the F0 will change. StyleTTS controls everything with the reference audio and F0 is just a curve instead of an absolute number, which is why it can also do voice conversion. You can retrain the model by removing the instance normalization for the F0 part to get absolute F0.

vik-rant · 2023-05-11T10:00:56Z

@yl4579 , got it, thanks.

yl4579 · 2023-10-28T05:05:01Z

@chiaki-luo What do you mean by directly listening to the mel produced in the training stage and it sounds worse than inference? Maybe you can open a new issue if it is not related to this thread? What do you mean by using the output of inference as ground truth? I don't really understand your question.

bobo-paopao · 2023-12-14T08:33:00Z

Yes, just scale the duration predicted. Simply change the line
duration = model.predictor.duration_proj(x)
in Inference_LJSpeech.ipynb with
duration = model.predictor.duration_proj(x) / speed
where speed is your scaling factor, for example, 1.25 is 125% faster.

You can also change the speed more naturally by giving slower or faster reference audio, which is a main selling point of StyleTTS.

我发现这样用的时候并不能达到预期的控制音频速度，因为我设置了speed为3.0速度反而更慢了

bobo-paopao · 2023-12-14T08:38:10Z

Oh, I found that changing the code here with duration = torch.sigmoid(duration).sum(axis=-1) / speed can change the speed.

vishalsantoshi · 2024-03-24T12:09:05Z

A question before I dive in.

Given that duration = torch.sigmoid(duration).sum(axis=-1) is available, AND if we have a target duration, is it possible to calculate potential duration, to calculate the speed factor, which would be a ratio of target duration and potential duration, it seem I cannot use straight up sum(duration) ? I have been doing a double pass ( create the audio and then get the duration ) to get to the speed factor, and then apply to recreate, which seems redundant.

yl4579 closed this as completed Jan 17, 2023

JohnHerry mentioned this issue Jul 7, 2023

mandrain support? #10

Open

This was referenced Nov 22, 2023

Speed up speech yl4579/StyleTTS2#64

Closed

Gradio demo yl4579/StyleTTS2#50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to control the speed of the speech? #3

Is it possible to control the speed of the speech? #3

godspirit00 commented Jan 15, 2023

yl4579 commented Jan 17, 2023 •

edited

Loading

godspirit00 commented Jan 17, 2023 •

edited

Loading

yl4579 commented Jan 17, 2023

sinhprous1 commented Feb 1, 2023

vik-rant commented May 9, 2023

yl4579 commented May 11, 2023

vik-rant commented May 11, 2023 •

edited

Loading

yl4579 commented May 11, 2023

vik-rant commented May 11, 2023

yl4579 commented Oct 28, 2023

bobo-paopao commented Dec 14, 2023

bobo-paopao commented Dec 14, 2023

vishalsantoshi commented Mar 24, 2024 •

edited

Loading

Is it possible to control the speed of the speech? #3

Is it possible to control the speed of the speech? #3

Comments

godspirit00 commented Jan 15, 2023

yl4579 commented Jan 17, 2023 • edited Loading

godspirit00 commented Jan 17, 2023 • edited Loading

yl4579 commented Jan 17, 2023

sinhprous1 commented Feb 1, 2023

vik-rant commented May 9, 2023

yl4579 commented May 11, 2023

vik-rant commented May 11, 2023 • edited Loading

yl4579 commented May 11, 2023

vik-rant commented May 11, 2023

yl4579 commented Oct 28, 2023

bobo-paopao commented Dec 14, 2023

bobo-paopao commented Dec 14, 2023

vishalsantoshi commented Mar 24, 2024 • edited Loading

yl4579 commented Jan 17, 2023 •

edited

Loading

godspirit00 commented Jan 17, 2023 •

edited

Loading

vik-rant commented May 11, 2023 •

edited

Loading

vishalsantoshi commented Mar 24, 2024 •

edited

Loading