-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SqueezeWave Implementation #53
Comments
@sujeendran if you can prove the performance of squeezeWave better than MB-melgan, i will implement it :)). There are no reason to add new model in this framework that no faster and no stronger than what was there. Hear audio samples and glance the paper i don't think squeezewave better than mb-melgan on both inference time and quality. |
@dathudeptrai I will need some time to test mb-melgan on my target platform. I suggested SqueezeWave mostly for the speed and possibility to run on CPU for resource restricted edge devices. TFLite and TFmicro are favorable for such solutions. In my case, I was able to run a combination of FastSpeech and SqueezeWave synthesis on Jetson Nano platform in 0.5 seconds with PyTorch. The quality was not bad, but could have been better. Will update here if I'm successful with mb-melgan. |
@dathudeptrai @sujeendran it is fast but audio quality is not so good example 1 taskset --cpu-list 1 python3 synthesis.py "Fastspeech with Squeezewave vocoder in pytorch , very fast inference on cpu" Speech synthesis time: 1.7220683097839355 soxi out: example 2 |
@sujeendran any update ? |
@dathudeptrai Hi I haven't worked on SqueezeWave for a while as I am working on tflite c++ inference of fastspeech and mbmelgan. As manmay noted the quality of SqueezeWave is not as good as mbmelgan, but it is definitely faster on my tests running on Jetson Nano on CPU/GPU with PyTorch compared to running FastSpeech+MBMelgan on CPU/GPU with Tensorflow2.x. On Jetson, the GPU takes 2+ seconds(even after warmup) for tiny sentences with Tensorflow2.x with the above pipeline (CPU runs faster, but inference time increases linearly with sentence length). Whereas using the PyTorch GPU implementation of FastSpeech+SqueezeWave is able to do this in ~0.5 seconds irrespective of the sentence length and with no warmup. |
@sujeendran on Jetson i think u can inference directly by install our framework without convert into pb or TFlite, i noticed that run inference with @tf.function and input_signature no need warmup compared with pb. In overall i think FastSpeech + mbmelgan is fast enough to run real-time on streaming mode. BTW, did you use 8bit or 32bit for tflite ?, and Jetson nano is ARM ? |
@dathudeptrai You are right about using the @tf.function directly on Jetson for faster inference. But I was trying to reduce the size taken by the model files to avoid keeping the source code on the target device. But the GPU inference is still 2+ seconds at least. I need something that is below 1 second. |
@sujeendran use TFLITE_BUILTINS_INT8 as opset while tflite conversion. also can you share your c++ inference code? |
@manmay-nakhashi Thanks for the tip. I will try that. I'm not at the liberty for sharing the complete C++ code, but I can share a bit of minimal mbmelgan inference code sample once the interpreter is loaded. The same pattern can be used for fastspeech, but you just need to set the other input tensor buffers too and inputtensor will be int32_t type.
EDIT: Just removed the kTfLiteOk checks for allocate and invoke calls. It was part of a error check function call i forgot to remove before posting. |
@manmay-nakhashi can you show your code for INT8 conversion of the fastspeech model? I tried several configurations but couldn't get INT8 to work. Did you provide any representative dataset while converting? and how is the quality of inference for INT8? |
Hi,
Will it be possible to add a TF2 implementation of SqueezeWave vocoder to this system? The performance is really fast and promising. I'm working on the same. But I'm not well versed with TF2 yet. I had quite a struggle trying to train the PyTorch implementation from the authors with my custom dataset even though it had almost the same characteristics as LJSpeech but double the size of dataset. I believe TF2 is more suitable for post training optimization and deployment.
Original Repo: https://github.com/tianrengao/SqueezeWave
The text was updated successfully, but these errors were encountered: