-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to perform text to speech #4
Comments
LPCNet is basically one half of a TTS system. It takes an acoustic feature vector every 10 ms and outputs speech samples. For TTS, you also need a network that takes in characters and outputs these acoustic feature vectors. |
@jmvalin Hi, I have trained a taco2 model to predict the 18-band Bark-scale and 2 pitch parameters. |
@changeforan To compute the LPC coefficients, look for the _celt_lpc() function in denoise.c. The process starts from Ex, computed by compute_band_energy(), so you'd need to invert a few more steps, but that shouldn't be too hard. |
@jmvalin Thanks for your quick response, but I am still confused. |
The LPC are the same as if they'd been computed on features[0:18]. The spectrum on which they're computed in the C code is the same that's used to compute the cepstrum and the operation is reversible. |
SO IT WORKS. Here are my samples.
So, I'am right? But why it works so slow? So if i'm right i'll try connect Tacotron-2 and LPCNet. Or... Or it will better choice to use something else in stand of Tacotron-2? |
Well, the way it's normally supposed to work is that you train Tacotron (or whatever network) to directly output features that LPCNet can use. No need to run the synthesis twice (though in this case I guess it was easier for testing purposes). |
Thanks for your respone. Yes It works. Of course I've synthesized the sound from takotron2 to demonstrate the result (as to say show progress). I tested LCPNet for Korean and Russian. The results are impressive. I will develop an implimentation of Tacotron2 for a closer connection with LCPNet to make end 2 end stt system. If Tacotron 2 will be work on server (without WaveNet vocoder) and LCPNet will be work on the clients it solves many problems, and reduce server load up to 10 times. |
@gosha20777 What acoustic features are you used when you train the TTS model? I've trained with both 55 dimension features and 21 dimension features, however, the results are not good. |
I ve got features from english multi speaker dataset. About 8 hours |
With the original 55 dimension features or other features ? |
Hmm. I'm not sure... But in my apinion it was 20 dim features. Try to learn LONG TIME. I v trained it about 5 days in 2x Nvidia 1080 ti. I ve used horovod library to parallel it. |
I can give u a pretrained model if u want. |
I can't understand what are the 120 dim features and how you extract the features. I'll appreciate it if there's some explanations. In my opinion, in the paper, they claimed using 20 dim features, and in the code it seems like using actually 55 dim features. |
Oh no! No 120 dim but 20 dim! Im so sorry :) |
In the code, it seems like 21 dim features rather than 20 dim. I've tried to predicted the 21 dim features, however, the results sounds not stable. My backbone model is not taco series, but a traditional rnn model. |
@attitudechunfeng I have reviewed the code and found that, features[18:36] is assigned to zero, features[36] and features[37] are about pitch. features[38] is not used at all. features[39:55] are about lpc. |
So it means that i only need to predict the [0:18] and [36:37] both 20 dim features? Do you have good results using these features? @changeforan |
with Taco2 model, Yes. |
FYI, I don't think features[38] is useful for anything. OTOH, features[18:36] could potentially be useful for TTS. |
@attitudechunfeng the 21dim not to predict |
@hdmjdp what do you mean, can you explain it more detailedly? |
@attitudechunfeng it means you need not predict the period, so the net output 20dim |
I tried to predict lpcnet parameters directly using a tacotron model. The generated voice is not very good, and the attention seemed very strange. Here are some attention and samples (In Chinese). Is there someone also have this situation, and knows how to explain this? More: |
Are you training end-to-end or are you just learning the LPCNet features from text? Also, make sure that the LPC features are not predicted, but rather computed directly from the predicted cepstral features. |
@candlewill may be u used the wrong feature as jmvalin said, my alignement is very good. and compared to the mel spectrogram, it is much easy to get the alignment. |
Thanks @jmvalin and @azraelkuan, I predict all of the 55d features when do end-to-end training. I will try to change the features to predict. |
@azraelkuan LPCNet acoustic feature
|
feature: 20-dim concatenated feature, i do not split them, i can not share the samples, sorry |
I tried to combine tacotron with LPCNet, which succeeded in a big data set, but failed in a small data set. (The dataset extraction feature only takes one round.) The tacotron output may have a period greater than 3.1, which I think will cause problems in training the LPCNet network (although training does not report an error). So I plan to normalize the cepstrum and pitch parameters. |
@jmvalin Hi, in your makefile, you give the A53's compile option. Does this mean that this repo can run in realtime on A53 chip? but we find it runs slow than realtime much. why |
LPCNet is not yet real-time on the A53. That's a pretty slow chip. We've managed real-time performance on an iPhone6 though. So it should run in real-time on most modern smartphones. Just not on RaspberryPi yet. That may eventually be achievable, but that's not what we're working on atm. |
@jmvalin thanks, we test lpcnet on the phone with A73 chip, it cannot run in realtime yet. I will try train it with 32*1 sparse block, so it can using 17 registers. What do you think? |
Hi, @candlewill . |
Hi Team - (@candlewill or @azraelkuan if you can help out that would be amazing) Given a predicted 80-dimensional mel-spectrogram from say DeepVoice or Tacotron, what are the steps to post-process it so that it can be fed directly as an input (18-Bark Scale Cepstral Coefficients and 2 pitch params) for LPCNet? Goal: numpy array (.npy file) from tts -> features.32 - without generating a wavform and converting that to a raw audio header file to be fed into LPCNet. Assuming that my base TTS model is not trained e2e for LPCNet features, and let's say I use the below function: def reduce_dim(features):
""" reduce dimension from 55d to 20d
keep features[0:18] and features[36:38] only
:param features: 55d
:return: 20d
"""
N, D = features.shape
assert D == 55, "Dimension error. %sx%s" % (N, D)
features = np.concatenate((features[:, 0:18], features[:, 36:38]), axis=1)
assert features.shape[1] == 20, "Dimension error. %s" % str(features.shape)
return features to reduce my predicted 80 dimensional melspectrogram down to 20d. Where in this repo should I start to generate a |
@pgmbayes why not just predict |
Hi @candlewill, how many epochs would it take to get samples like e2e_lpcnet_samples.zip? Thank you. |
@HallidayReadyOne I trained it with 120 epochs which is the default parameter. What's more important is that before use lpcnet, you should be assure your end2end model can predict the lpcnet-used features well. |
@candlewill Thank you for the kindly reply. Yep, the text2feature model is important. I have trained a tacotron model to predict the lpcnet-used features. The attention alignment is quite good now. However, the sample of lpcnet (about 18 epochs) is instable. |
Hi, @candlewill , how many step and how much batch size when you trained the end2end model to predict the lpcnet-used features? Thanks! |
Can you share the pretrained model |
@changeforan |
@cahuja1992 import numpy as np
npy_data = np.load("mel_220k_0.npy")
npy_data = npy_data.reshape((-1,))
npy_data.tofile("mel_220k_0.s32") |
From tacatron we should be using the model.mel_outputs which comes out to be of (1 , 1000, 80) dimensions for a audio. In order to match the dimensions for LPCNet, what should be the parameters of tacatron 2. The default parameters are as follow: |
Hi, @candlewill , I have listened to your samples. They are better than those that I generated. I have used tacotron 2 to predict the 20 dim features and trained LPCNet with my own data. It seems that the samples with predicted features have problems concerning the pitch compared to samples generated with ground-truth features. I would like to ask if you could share some information with me about the training of tacotron 2, for example, the loss function. |
@jmvalin Trying Below steps to generate features from tacatron and using that to generate speech Training Tacatron for LPCNET
4.check points are created every 1000 iteration 5.Now using metadata.csv of LJSpeech generate sentences array for eval.py 6.Modify synthesizer.py wav = self.session.run(self.wav_output, feed_dict=feed_dict) 7 . run eval.py Training LPCNET using features generated from Tacatron Here we have to use the concatenated pcm file and mel_op.npy 2 make dump_data taco=1
Usage |
@hdmjdp Any progress with 32*1 sparse block? I've tried with A73 chip, when increasing the sparsity, it can reach about 1.0+ realtime speed, however, still a little slow. |
Hi @azraelkuan , for the tacotron model, what did you use as the input? phone or pinyin or English words? Thanks! |
@
@candlewill are you train lpcnet with 55 dim feature? and the 55dim feature just generate with lpcnet dump_data without any other process? |
|
Hi! Have you resolved this problem? |
You should try the Text Speaker" app. This is the best text to speech app. It has so many natural sounding voices to chose from. It is useful to listen to study files and much more. It can even extract text from scanned pages and websites and read them out loud. I use it most often to create mp3 files of my study files so I can listen to them on the go. Great product.https://www.deskshare.com/text-to-speech-software.aspx |
@Ben654987 Please stop pasting your add here. |
@jmvalin Thanks for hosting this interesting project. As part of the usecases for LPCNet you mention about TTS(Text-To-Speech). How do we synthesis speech from text using the test_lpcnet.py?
If this is not the approach to implement TTS, do you have any recommendation on where to start with LPCNet for implementing end to end TTS system?
The text was updated successfully, but these errors were encountered: