Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaveGlow sythensis result #14

Closed
tingyang01 opened this issue Jun 17, 2019 · 8 comments
Closed

WaveGlow sythensis result #14

tingyang01 opened this issue Jun 17, 2019 · 8 comments

Comments

@tingyang01
Copy link

I tried TTS with WaveGlow as follow, but i've got noise result
Could you explain the reason for me?
def synthesis_waveglow(text_seq, model, waveglow, alpha=1.0, mode=""):
denoiser = Denoiser(waveglow)
text = text_to_sequence(text_seq, hp.text_cleaners)
text = text + [0]
text = np.stack([np.array(text)])
text = torch.from_numpy(text).long().to(device)

pos = torch.stack([torch.Tensor([i+1 for i in range(text.size(1))])])
pos = pos.long().to(device)

model.eval()
with torch.no_grad():
    _, mel_postnet = model(text, pos, alpha=alpha)
with torch.no_grad():
    #wav = waveglow.infer(mel_postnet, sigma=0.666)
    wav = waveglow.infer(torch.transpose(mel_postnet,1,2).type(torch.cuda.HalfTensor), sigma=0.666)
print("Wav Have Been Synthesized.")

if not os.path.exists("results"):
    os.mkdir("results")

wav_denoised = denoiser(wav, strength=0.01)[:, 0]
#audio.save_wav(wav[0].data.cpu().numpy(), os.path.join(
#    "results", text_seq + mode + ".wav"))
audio.save_wav(wav_denoised[0].cpu().numpy(), os.path.join(
    "results", text_seq + mode + ".wav"))

Thank you

@TakoYuxin
Copy link

I tried TTS with WaveGlow as follow, but i've got noise result
Could you explain the reason for me?
def synthesis_waveglow(text_seq, model, waveglow, alpha=1.0, mode=""):
denoiser = Denoiser(waveglow)
text = text_to_sequence(text_seq, hp.text_cleaners)
text = text + [0]
text = np.stack([np.array(text)])
text = torch.from_numpy(text).long().to(device)

pos = torch.stack([torch.Tensor([i+1 for i in range(text.size(1))])])
pos = pos.long().to(device)

model.eval()
with torch.no_grad():
    _, mel_postnet = model(text, pos, alpha=alpha)
with torch.no_grad():
    #wav = waveglow.infer(mel_postnet, sigma=0.666)
    wav = waveglow.infer(torch.transpose(mel_postnet,1,2).type(torch.cuda.HalfTensor), sigma=0.666)
print("Wav Have Been Synthesized.")

if not os.path.exists("results"):
    os.mkdir("results")

wav_denoised = denoiser(wav, strength=0.01)[:, 0]
#audio.save_wav(wav[0].data.cpu().numpy(), os.path.join(
#    "results", text_seq + mode + ".wav"))
audio.save_wav(wav_denoised[0].cpu().numpy(), os.path.join(
    "results", text_seq + mode + ".wav"))

Thank you

I also got noisy results. I think the reason is WaveGlow uses a slightly different audio processing method so these two models are trained on different scaled mel spectrograms, thus not compatible. I tried to convert the mel spectrogram predicted by FastSpeech into the same scale used by WaveGlow but failed...... If you happened to get good results, please let me know

@angelorobo
Copy link

@TakoYuxin Can you please share how did you "convert the mel spectrogram predicted by FastSpeech into the same scale used by WaveGlow"?

@TakoYuxin
Copy link

TakoYuxin commented Jun 19, 2019

@TakoYuxin Can you please share how did you "convert the mel spectrogram predicted by FastSpeech into the same scale used by WaveGlow"?

Here is what I added after FastSpeech predicted mel_postnet but this didn't work.

waveglow_npy = audio._db_to_amp(audio._denormalize(mel_postnet) + hp.ref_level_db)
waveglow_npy = torch.from_numpy(waveglow_npy)
waveglow_npy = torch.log(torch.clamp(waveglow_npy, min=1e-5) * 1)
torch.save(waveglow_npy, os.path.join(mode+ '.pt'))

you can find FastSpeech audio processing functions in audio.py and WaveGlow audio processing functions in NVIDIA/tacotron2/layers.py, audio_processing.py

@tingyang01
Copy link
Author

Thank you @TakoYuxin .
I've got some result but the quality of voice is not good.
My result is in epoch 1000.
Should I do more the training of FastSpeech model?
Could you explain for me?

@TakoYuxin
Copy link

I haven't got any good results either >_< I trained the model for 198k steps but the generated voice was so not clear that I could barely understand what it was saying and a few words were skipped. I don't really know exactly how to fix this but continuing to train for more epochs sounds like a plan lol. We should probably wait for the author's answer.

@enamoria
Copy link

Anyone got good result with waveglow? I tried Tako's denormalize and got some audible result but still can't get anything better. Not even close to Grinffin Lim

@xcmyz
Copy link
Owner

xcmyz commented Nov 5, 2019

The newest repo have audio example synthesized by waveglow in result.

@aguazul
Copy link

aguazul commented Mar 28, 2020

I am also having this issue - I trained to 164,000 steps but the wav file was just silent background noise.

I also tried running synthesis.py on earlier training steps (2000, 9000, 130,000) and the only one that sounded remotely like speech was 2000. By 9000 steps it was just an empty silent wav file.

I saw that @xcmyz said that batch size needed to be 32 or more, but if I set it to 32 I get an out of memory error despite running on a good GPU. I reduced to 16 batch size and it runs without error.

I am not sure why the audio wav files are basically silent noise.

@xcmyz xcmyz closed this as completed Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants