Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqueezeWave Implementation #53

Closed
sujeendran opened this issue Jun 21, 2020 · 12 comments
Closed

SqueezeWave Implementation #53

sujeendran opened this issue Jun 21, 2020 · 12 comments
Assignees
Labels
Discussion 😁 Discuss new feature Feature Request 🤗 Feature support stat:awaiting response ☏ Waiting Response

Comments

@sujeendran
Copy link

Hi,
Will it be possible to add a TF2 implementation of SqueezeWave vocoder to this system? The performance is really fast and promising. I'm working on the same. But I'm not well versed with TF2 yet. I had quite a struggle trying to train the PyTorch implementation from the authors with my custom dataset even though it had almost the same characteristics as LJSpeech but double the size of dataset. I believe TF2 is more suitable for post training optimization and deployment.
Original Repo: https://github.com/tianrengao/SqueezeWave

@dathudeptrai
Copy link
Collaborator

dathudeptrai commented Jun 22, 2020

@sujeendran if you can prove the performance of squeezeWave better than MB-melgan, i will implement it :)). There are no reason to add new model in this framework that no faster and no stronger than what was there. Hear audio samples and glance the paper i don't think squeezewave better than mb-melgan on both inference time and quality.

@dathudeptrai dathudeptrai self-assigned this Jun 22, 2020
@dathudeptrai dathudeptrai added Discussion 😁 Discuss new feature Feature Request 🤗 Feature support labels Jun 22, 2020
@dathudeptrai
Copy link
Collaborator

@sujeendran

@dathudeptrai dathudeptrai added the stat:awaiting response ☏ Waiting Response label Jun 22, 2020
@sujeendran
Copy link
Author

sujeendran commented Jun 22, 2020

@dathudeptrai I will need some time to test mb-melgan on my target platform. I suggested SqueezeWave mostly for the speed and possibility to run on CPU for resource restricted edge devices. TFLite and TFmicro are favorable for such solutions. In my case, I was able to run a combination of FastSpeech and SqueezeWave synthesis on Jetson Nano platform in 0.5 seconds with PyTorch. The quality was not bad, but could have been better. Will update here if I'm successful with mb-melgan.

@manmay-nakhashi
Copy link

@dathudeptrai @sujeendran it is fast but audio quality is not so good
Intel® Core™ i5-6300U CPU

example 1

taskset --cpu-list 1 python3 synthesis.py "Fastspeech with Squeezewave vocoder in pytorch , very fast inference on cpu"

Speech synthesis time: 1.7220683097839355

soxi out:
Input File : 'results/Fastspeech with Squeezewave vocoder in pytorch , very fast inference on cpu_112000_squeezewave.wav'
Channels : 1
Sample Rate : 22050
Precision : 16-bit
Duration : 00:00:05.96 = 131328 samples ~ 446.694 CDDA sectors
File Size : 263k
Bit Rate : 353k
Sample Encoding: 16-bit Signed Integer PCM
approx. 6 sec. audio output in 1.72 sec on single cpu

example 2
taskset --cpu-list 0 python3 synthesis.py "How are you"
Speech synthesis time: 0.3431851863861084
soxi out:
Input File : 'results/How are you _112000_squeezewave.wav'
Channels : 1
Sample Rate : 22050
Precision : 16-bit
Duration : 00:00:00.85 = 18688 samples ~ 63.5646 CDDA sectors
File Size : 37.4k
Bit Rate : 353k
Sample Encoding: 16-bit Signed Integer PCM
0.85 sec. audio output in 0.34 sec on single cpu

@dathudeptrai
Copy link
Collaborator

@sujeendran any update ?

@sujeendran
Copy link
Author

sujeendran commented Jul 15, 2020

@dathudeptrai Hi I haven't worked on SqueezeWave for a while as I am working on tflite c++ inference of fastspeech and mbmelgan. As manmay noted the quality of SqueezeWave is not as good as mbmelgan, but it is definitely faster on my tests running on Jetson Nano on CPU/GPU with PyTorch compared to running FastSpeech+MBMelgan on CPU/GPU with Tensorflow2.x. On Jetson, the GPU takes 2+ seconds(even after warmup) for tiny sentences with Tensorflow2.x with the above pipeline (CPU runs faster, but inference time increases linearly with sentence length). Whereas using the PyTorch GPU implementation of FastSpeech+SqueezeWave is able to do this in ~0.5 seconds irrespective of the sentence length and with no warmup.

@dathudeptrai
Copy link
Collaborator

@sujeendran on Jetson i think u can inference directly by install our framework without convert into pb or TFlite, i noticed that run inference with @tf.function and input_signature no need warmup compared with pb. In overall i think FastSpeech + mbmelgan is fast enough to run real-time on streaming mode. BTW, did you use 8bit or 32bit for tflite ?, and Jetson nano is ARM ?

@sujeendran
Copy link
Author

@dathudeptrai You are right about using the @tf.function directly on Jetson for faster inference. But I was trying to reduce the size taken by the model files to avoid keeping the source code on the target device. But the GPU inference is still 2+ seconds at least. I need something that is below 1 second.
In case of TFLite, allowing supported type tf.float16 increased the speed by around 16x I would say. But I couldnt do the same with FastSpeech. The conversion to tflite failed when I gave supported type tf.float16. Jetson Nano is ARM64. Can you help me out with 8bit tflite as you mentioned?

@manmay-nakhashi
Copy link

manmay-nakhashi commented Jul 20, 2020

@sujeendran use TFLITE_BUILTINS_INT8 as opset while tflite conversion. also can you share your c++ inference code?

@sujeendran
Copy link
Author

sujeendran commented Jul 21, 2020

@manmay-nakhashi Thanks for the tip. I will try that. I'm not at the liberty for sharing the complete C++ code, but I can share a bit of minimal mbmelgan inference code sample once the interpreter is loaded. The same pattern can be used for fastspeech, but you just need to set the other input tensor buffers too and inputtensor will be int32_t type.
Hope this helps!

//MB Melgan
//Input signature -> [1 -1 80]float32
//Output signature -> [1 -1 1]float32
void infer(float *inputtensor, int N, float *&output, int &outsize)
{
  //Resize and reallocate tensor buffers only if input dimension has changed.
  if (currentDim != N)
  {
    const std::vector<int> newDim{1, N, 80};
    interpreter->ResizeInputTensor(0, newDim);
    // Allocate tensor buffers.
    interpreter->AllocateTensors();
  }

  // Fill input buffers
  float *inputptr = interpreter->typed_tensor<float>(inputs[0]);
  memcpy((void *)inputptr, inputtensor, sizeof(float) * N * 80);

  // Run inference
  interpreter->Invoke();

  // Read output buffers
  TfLiteIntArray *output_dims = interpreter->tensor(outputs[0])->dims;
  int output_size = output_dims->data[output_dims->size - 2];
  printf("Output shape: [1 %d 1]\n", output_size);

  float *outputptr = interpreter->typed_tensor<float>(outputs[0]);
  output = outputptr;
  outsize = output_size;
}

EDIT: Just removed the kTfLiteOk checks for allocate and invoke calls. It was part of a error check function call i forgot to remove before posting.

@sujeendran
Copy link
Author

@manmay-nakhashi can you show your code for INT8 conversion of the fastspeech model? I tried several configurations but couldn't get INT8 to work. Did you provide any representative dataset while converting? and how is the quality of inference for INT8?

@dathudeptrai
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion 😁 Discuss new feature Feature Request 🤗 Feature support stat:awaiting response ☏ Waiting Response
Projects
None yet
Development

No branches or pull requests

3 participants