Whisper inference support in cpp runtime #2320

zhr1201 · 2024-01-23T22:57:01Z

We were trying to reproduce the steps to load whisper in wenet and finetune it for streaming on libsrispeech, following the steps in https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/whisper. Finetuning did seem to work, but when we tried to load it in the cpp runtime, it only produced empty results. After further inspection, it looks like the feature extraction of whisper is very different than the one implemented in wenet runtime, mainly regarding:

Window for STFT (Povey window versus Hanning window)
Mel scale is different, there are two commonly used versions 1. one derived from htk, which wenet and kaldi use as well, 2. the other one from matlab slaney toolbox, which is implemented by librosa and used to train whisper
Mel weight are different, kaldi was computing the weight based on the frequencies on the mel scale, but whisper training uses the weight computed in the original frequency scale
Normalization after computing the PSD is different, mainly the base of the log operation and how it scales the output are different

examples show it's working

decoder main using whisper

cli is still compatible with the original code, default behavior doesnt change

There are still a couple places that are not perfect

token list in the above example is directly exported from whisper tokenizer, so the output contains things like <|notimestamp|> and <|space|>, which doesnt look nice. Maybe it can be purely solved by cleanup the token list, but it's also possible that we might need some runtime code change to make this look good. If we do need to do that, i think that's a separate problem and deserves a separate PR. However, if there are simple fixes, i am happy to do it in this PR.
The feature computed is still a bit different from that directly computed using python directly calling the whisper related code, i suspect the difference mainly comes from FFT computation, will show more details in the following comments.

I know these are a lot of changes, so I am more than happy to change the structure into the style you guys prefer if you think there is a better way

zhr1201 · 2024-01-23T23:04:56Z

hanning window generated in runtime versus the one from torch, difference are very small

zhr1201 · 2024-01-23T23:05:46Z

Mel fitlers generated from cpp runtime is also very close to the one from librosa

zhr1201 · 2024-01-23T23:09:32Z

computed feature is pretty similar, but NOT THE SAME! notice there is a small difference in max and min value

I dumped the PSD after STFT as well and compared it with the one computed using torch, I think this might be the source of difference

zhr1201 · 2024-01-23T23:12:30Z

runtime/core/frontend/fbank.h

+              weight = (right_mel - mel) / (right_mel - center_mel);
+          } else if (mel_type == MelType::Slaney) {
+            if (mel <= center_mel) {
+              weight = (InverseMelScale(mel, mel_type) -


https://librosa.org/doc/main/_modules/librosa/filters.html#mel

zhr1201 · 2024-01-23T23:14:57Z

runtime/core/frontend/fbank.h

+        window_[i] = pow(0.5 - 0.5 * cos(a * i), 0.85);
+    } else if (window_type == WindowType::Hanning) {
+      // periodic hanning window
+      double a = M_2PI / (frame_length_);


https://pytorch.org/docs/stable/generated/torch.hann_window.html#torch.hann_window:~:text=periodic%20(bool%2C%20optional)%20%E2%80%93%20If%20True%2C%20returns%20a%20window%20to%20be%20used%20as%20periodic%20function.%20If%20False%2C%20return%20a%20symmetric%20window. default period is true, meaning N is window_length + 1 so that it can be used as a periodic function

zhr1201 · 2024-01-23T23:15:56Z

runtime/core/frontend/fbank.h

  }

-  static inline float MelScale(float freq) {
-    return 1127.0f * logf(1.0f + freq / 700.0f);
+  static inline float MelScale(float freq, MelType mel_type = MelType::HTK) {


https://librosa.org/doc/main/_modules/librosa/core/convert.html#hz_to_mel

zhr1201 · 2024-01-23T23:16:08Z

runtime/core/frontend/fbank.h

+    }
+  }
+
+  static inline float InverseMelScale(float mel_freq,


https://librosa.org/doc/main/generated/librosa.mel_to_hz.html#librosa.mel_to_hz

this can be further optimized if needed, there are lot of repeated computations, but could already be optimized by some compiler through constant propogation

zhr1201 · 2024-01-23T23:17:17Z

runtime/core/frontend/fbank.h

+
+      if (scaled_float_as_input_) {
+        for (int j = 0; j < frame_length_; ++j) {
+          data[j] = data[j] / S16_TO_FLOAT_SCALE;


data feed into this pipeline is int but converted to float without scaling, whisper training code load this as float between -1 to 1

runtime/core/frontend/fbank.h

robin1001 · 2024-01-24T01:42:27Z

Great job, it's clear.

xingchensong · 2024-01-24T05:39:59Z

Great job!

I think kaldifeat's impl for computing features of whisper could be beneficial for our work:

xingchensong · 2024-01-24T05:44:03Z

BTW, Why does the CTC result contain <|notimestamp|>? The label provided to the CTC loss function doesn't include this tag (only the label given to the CE loss has it), so <|notimestamp|> shouldn't appear during CTC decoding.

zhr1201 · 2024-01-24T17:06:44Z

Support computing features for whisper Support computing features for whisper csukuangfj/kaldifeat#82

Thanks for referencing this implementation, those are really helpful! Ya, they are doing basically the same thing that we want to do. I think we have two ways to support whisper inference in wenet.

Create another FBank implementation, just like what kaldifeat did, and copy over their code, implementing a interface with Compute so we can integrate it into the feature pipeline in wenet.
Update Fbank with more options like what we are doing in this PR.

For 1, since Fbank computation is not a lot of code, i think it makes sense to have a separate WhisperFbankComputer even without reusing the current FBank code. However for 1. I don't like the fact that they hard coded the weight there, so it's less flexible https://github.com/csukuangfj/kaldifeat/blob/master/kaldifeat/csrc/whisper-mel-bank.h. and b. It's computing STFT using torch. Since wenet can support runtimes other than torch, we probably don't want to depend on torch in the feature extraction part. However, if we do think 1 is a preferred structure, we can reuse the current wenet FFT so b is not a problem, and we can reuse the code in this PR to generate the filters so a is not a problem as well.

Based on above, i think it's more of a style thing, basically we need to decide if we want to reuse FBank or create another WhisperFbank

zhr1201 · 2024-01-24T17:19:17Z

BTW, Why does the CTC result contain <|notimestamp|>? The label provided to the CTC loss function doesn't include this tag (only the label given to the CE loss has it), so <|notimestamp|> shouldn't appear during CTC decoding.

Very good question, i am curious as well, i checked again and it does look like those special whisper tokens is not added in CTC loss calculation, so it shouldn't be possible. Maybe something wrong with my training setup?

Currently my only assumption is that our model didn't converge well (and it generalized very badly on out of domain data), it looks like we did make a mistake in training by not setting

wenet/wenet/whisper/whisper.py

Line 76 in 5d6ea3e

language="zh",

to finetune for english, but i guess that still doesn't explain why we have this <|notimestamp|> token. Will try finetuning again, but would be great if you can confirm if it is the same case in the AISHELL whisper.

zhr1201 · 2024-01-24T18:43:47Z

runtime/core/frontend/fbank.h

@@ -24,13 +24,43 @@
 #include "frontend/fft.h"
 #include "utils/log.h"

+#define S16_ABS_MAX (2 << 15)


Another concern is that this probably shouldn't be hard coded here, as it should be the responsibility of wavreader to scale the input, rather than the responsibility of feature_extraction_pipeline. Moving it to wavreader also allows the flexiblity if the audio is encoded using pcm_s32 or pcm_s8 instead of fixing it to pcm_s16.

However, doing that would also require changes in http server, websocket server and grpc server code, and possibly other places like the jni bindings for android etc. Feels like that should be decided by the main maintainers of wenet.

(we could do it in a hacky way e.g. let the cli to take in a scale factor, and making this a paramter of the feature extraction pipeline, but that doens't feel right, e.g. the cli takes in a list of wav files encoded with different number of bits, and it won't work in that case as different samples require different scaling factor.)

i will leave this as it is for now, but do let me know your thoughts if you want to update this in this PR. or if you guys feel like making a separate PR fixing this, that also works.

robin1001 · 2024-01-25T01:35:28Z

Support computing features for whisper Support computing features for whisper csukuangfj/kaldifeat#82

Thanks for referencing this implementation, those are really helpful! Ya, they are doing basically the same thing that we want to do. I think we have two ways to support whisper inference in wenet.

Create another FBank implementation, just like what kaldifeat did, and copy over their code, implementing a interface with Compute so we can integrate it into the feature pipeline in wenet.

Update Fbank with more options like what we are doing in this PR.

For 1, since Fbank computation is not a lot of code, i think it makes sense to have a separate WhisperFbankComputer even without reusing the current FBank code. However for 1. I don't like the fact that they hard coded the weight there, so it's less flexible https://github.com/csukuangfj/kaldifeat/blob/master/kaldifeat/csrc/whisper-mel-bank.h. and b. It's computing STFT using torch. Since wenet can support runtimes other than torch, we probably don't want to depend on torch in the feature extraction part. However, if we do think 1 is a preferred structure, we can reuse the current wenet FFT so b is not a problem, and we can reuse the code in this PR to generate the filters so a is not a problem as well.

Based on above, i think it's more of a style thing, basically we need to decide if we want to reuse FBank or create another WhisperFbank

Yes, I think current implemenation is okay. There is a lot of hard code about mel weights for whisper fbank in kaldifeat, which is not preferred.

runtime/core/frontend/fbank.h

zhr1201 · 2024-01-25T03:11:54Z

Regarding the STFT difference, I think there is no way to make them match exactly. Reason is because if fft_length != window_length, torch STFT will pad the window on both the left side and the right side: https://github.com/pytorch/pytorch/blob/2d7a360911fb7b27be82c51ca86b4b34b6f1b087/aten/src/ATen/native/SpectralOps.cpp#L936, normally FFT energy doesn't depend on how you pad the input, however this is how torch separate audio into different frames https://github.com/pytorch/pytorch/blob/2d7a360911fb7b27be82c51ca86b4b34b6f1b087/aten/src/ATen/native/SpectralOps.cpp#L949 , because of this, padding the window in different places will result in different part of the raw signal multiplied by different part of the window, resulting in a different PSD result.

But I think it's probably fine, result will be the same if we shift the sequence by (fft_length - window_length) / 2,

padded_wav = F.pad(wav, (56, 56), "constant", 0) # pad the input so window will match the audio the same way wenet does
stft = torch.stft(padded_wav,
                  512,
                  160,
                  window=window,
                  center=False,  # this is another trivial source of differece
                  win_length=400,
                  return_complex=True)
magnitudes = stft[..., :-1].abs()**2
mel_spec_512 = filters_512 @ magnitudes
log_spec_before_norm_512 = torch.clamp(mel_spec_512, min=1e-10).log10()
log_spec_before_norm_512 = torch.maximum(log_spec_before_norm_512, log_spec_before_norm_512.max() - 8.0)
log_spec_after_norm_512 = (log_spec_before_norm_512 + 4.0) / 4.0

and we will get almost the same result

I think this is a feature, not a bug, as ASR result should not change even if we shift the input by some number of sampling points.

zhr1201 · 2024-01-25T03:42:50Z

感谢各位大佬review 和 approve，愿wenet越来越强大，用户越来越多！

robin1001 · 2024-01-25T03:43:56Z

开源靠大家，感谢贡献！

zhr1201 · 2024-02-02T19:45:56Z

Just for people who are confused about <|notimestamps|>, it's because the token list that i used is wrong, it actually corresponds to blank tokens in CTC, which won't appear in the final transcript.

Related issue: #2329

hzhou245 added 7 commits January 22, 2024 17:44

Working version

b81583b

Refactor how params are passed in

1a5cf91

Passing in through CLI

a5632b4

Fix minnor issues

a91b449

Remove extra dump for debug

01e8e54

Remove unused arrays for debugging

20221ba

Merge branch 'main' into whisper-inference

3236dd0

zhr1201 commented Jan 23, 2024

View reviewed changes

zhr1201 changed the title ~~Whisper inference support in runtime~~ Whisper inference support in cpp runtime Jan 23, 2024

zhr1201 marked this pull request as ready for review January 23, 2024 23:25

pengzhendong requested review from robin1001, Mddct and xingchensong January 24, 2024 00:12

Remove unused header

5e0c2db

robin1001 reviewed Jan 24, 2024

View reviewed changes

runtime/core/frontend/fbank.h Show resolved Hide resolved

robin1001 reviewed Jan 24, 2024

View reviewed changes

runtime/core/frontend/fbank.h Show resolved Hide resolved

robin1001 reviewed Jan 24, 2024

View reviewed changes

runtime/core/frontend/fbank.h Outdated Show resolved Hide resolved

Change naming style of Enum

a9a4efa

Move init_mel_filters to it's own method

a0d4e78

zhr1201 commented Jan 24, 2024

View reviewed changes

Fix one a bug introduced in the last two commit

10e6713

zhr1201 requested a review from robin1001 January 24, 2024 20:40

robin1001 reviewed Jan 25, 2024

View reviewed changes

runtime/core/frontend/fbank.h Outdated Show resolved Hide resolved

robin1001 previously approved these changes Jan 25, 2024

View reviewed changes

Use const instead of macro

649f46c

zhr1201 dismissed robin1001’s stale review via 649f46c January 25, 2024 03:33

zhr1201 force-pushed the whisper-inference branch from 3e2a77a to 649f46c Compare January 25, 2024 03:33

xingchensong approved these changes Jan 25, 2024

View reviewed changes

robin1001 merged commit baaa27a into wenet-e2e:main Jan 25, 2024
6 checks passed

zhr1201 mentioned this pull request Jan 31, 2024

Blank token id hard coded in C++ decoders #2329

Closed

zhr1201 mentioned this pull request Feb 22, 2024

[runtime] Configurable blank token idx #2366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper inference support in cpp runtime #2320

Whisper inference support in cpp runtime #2320

zhr1201 commented Jan 23, 2024 •

edited

Loading

zhr1201 commented Jan 23, 2024

zhr1201 commented Jan 23, 2024

zhr1201 commented Jan 23, 2024 •

edited

Loading

zhr1201 Jan 23, 2024

zhr1201 Jan 23, 2024 •

edited

Loading

zhr1201 Jan 23, 2024

zhr1201 Jan 23, 2024

zhr1201 Jan 24, 2024 •

edited

Loading

zhr1201 Jan 23, 2024

robin1001 commented Jan 24, 2024

xingchensong commented Jan 24, 2024 •

edited

Loading

xingchensong commented Jan 24, 2024

zhr1201 commented Jan 24, 2024

zhr1201 commented Jan 24, 2024 •

edited

Loading

zhr1201 Jan 24, 2024 •

edited

Loading

robin1001 commented Jan 25, 2024

zhr1201 commented Jan 25, 2024 •

edited

Loading

zhr1201 commented Jan 25, 2024

robin1001 commented Jan 25, 2024

zhr1201 commented Feb 2, 2024

Whisper inference support in cpp runtime #2320

Whisper inference support in cpp runtime #2320

Conversation

zhr1201 commented Jan 23, 2024 • edited Loading

zhr1201 commented Jan 23, 2024

zhr1201 commented Jan 23, 2024

zhr1201 commented Jan 23, 2024 • edited Loading

zhr1201 Jan 23, 2024

Choose a reason for hiding this comment

zhr1201 Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

zhr1201 Jan 23, 2024

Choose a reason for hiding this comment

zhr1201 Jan 23, 2024

Choose a reason for hiding this comment

zhr1201 Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

zhr1201 Jan 23, 2024

Choose a reason for hiding this comment

robin1001 commented Jan 24, 2024

xingchensong commented Jan 24, 2024 • edited Loading

xingchensong commented Jan 24, 2024

zhr1201 commented Jan 24, 2024

zhr1201 commented Jan 24, 2024 • edited Loading

zhr1201 Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

robin1001 commented Jan 25, 2024

zhr1201 commented Jan 25, 2024 • edited Loading

zhr1201 commented Jan 25, 2024

robin1001 commented Jan 25, 2024

zhr1201 commented Feb 2, 2024

zhr1201 commented Jan 23, 2024 •

edited

Loading

zhr1201 commented Jan 23, 2024 •

edited

Loading

zhr1201 Jan 23, 2024 •

edited

Loading

zhr1201 Jan 24, 2024 •

edited

Loading

xingchensong commented Jan 24, 2024 •

edited

Loading

zhr1201 commented Jan 24, 2024 •

edited

Loading

zhr1201 Jan 24, 2024 •

edited

Loading

zhr1201 commented Jan 25, 2024 •

edited

Loading