KWS with CTCloss training and CTC prefix beam search detection. #135

duj12 · 2023-06-03T07:15:17Z

This PR is for KWS training with CTC loss and detection with CTC prefix beam search.
Aim to improve the robustness of KWS model, and support customized keywords with limited data.
For now only hi_xiaowen data has runing scripts.

I redo the experiment of ds_tcn with max-pooling loss.
Then I add the ds_tcn model with CTC loss, and add a FSMN backbone.
Finally I add a streaming scoring script, to simulate the real detection case of a CTC model.
All results can be found in README.md.

Note, CTC model can be export to onnx(the output is softmax of logits), but runtime is not support now,
I decide to develop a python script in runtime streaming fashion(do online feature extraction and so on) first, and then the onnx c++ and pybind...
When runtime is ready, I will create a new PR.

…CTC model's result in README; For now CTC model runtime is not supported yet.

mlxu995 · 2023-06-04T03:35:41Z

good job! Thank you so much for this valuable update, and we will appreciate if you could help fix the lint errors such as the trailing whitespace.

… model forward.

duj12 · 2023-06-28T10:54:53Z

Here is a Demo, all models are trained with ctc loss, and in this demo, the detection is performed in streaming fashion.
https://www.modelscope.cn/studios/thuduj12/KWS_Nihao_Xiaojing/summary

jingyonghou · 2023-06-29T03:08:02Z

Thanks for your constructive contribution to this project. Could you please make some revisions to the code, following the flake8 format check? [image: image.png] Jean Du ***@***.***> 于2023年6月28日周三 18:55写道：

…

Here is a Demo, all models are trained with ctc loss, and in this demo, the detection is performed in streaming fashion. https://www.modelscope.cn/studios/thuduj12/KWS_Nihao_Xiaojing/summary — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA4OP3XREIL4WQVGRQVGDDXNQEQRANCNFSM6AAAAAAYZD3MQ4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Best regards, Jingyong Northwestern Polytechnical University Phone: +1 425 394 3659, +86 18392375265 Wechat: zaixialalala

jingyonghou · 2023-06-29T03:13:01Z

examples/hi_xiaowen/s0/run_ctc.sh

+    --checkpoint $score_checkpoint \
+    --score_file $result_dir/score.txt  \
+    --num_workers 8  \
+    --keywords 嗨小问,你好问问 \


it better to replace Chinese char with Latin char in the code

Has been done in the latest commit.

jingyonghou · 2023-06-29T03:13:21Z

examples/hi_xiaowen/s0/run_ctc.sh

+    --lexicon_file data/lexicon.txt
+
+  python wekws/bin/compute_det_ctc.py \
+      --keywords 嗨小问,你好问问 \


it better to replace Chinese char with Latin char in the code

jingyonghou · 2023-06-29T03:21:56Z

examples/hi_xiaowen/s0/README.md

+| DS_TCN(spec_aug)      | CTC         | 0.056574   | 0.056856     |
+
+
+Comparison between DS_TCN(Pretrained with Wenetspeech, 23 epoch)


is it possible to release the pre-trained model, so people can reproduce the experiment?

Pretrained models are released in the latest commit(README).

duj12 · 2023-07-24T09:56:54Z

Here I do some more experiments on a private datasets

	positive(hello_xiaojing)	negative(noise)
train	18 speakers，2219 segments	55 hours
dev	2 speakers，248 segments	12 hours
test	4 speakers，474 segments	24 hours

The results are as following:

backbone	loss	1-FRR(%)	FAR(/24h)	Threshold
ds_tcn	maxpooling	81.1	2	0.88
ds_tcn	ctc	89.7	1	0.02
fsmn	ctc	93.3	2	0.018

As we can see CTC-KWS model outperform Maxpooling model with DS_TCN backbone.
Since the fsmn and ds_tcn use different data to do pretraining (also, different feature pipeline, and with different epoch), it's hard to say which backbone is better.
But the pretrained fsmn(alibaba released in https://modelscope.cn/models/damo/speech_charctc_kws_phone-xiaoyun/summary ) is better than the ds_tcn model I pretrained (23.pt in https://modelscope.cn/datasets/thuduj12/mobvoi_kws_transcription/files ).
So I recommend you use the pretrained FSMN model to train your models.

haha010508 · 2023-08-10T09:28:29Z

`import librosa
from stream_kws_ctc import KeyWordSpotter

def kws_process(wav_path):
kws = KeyWordSpotter(
ckpt_path='model/avg_30.pt',
config_path='model/config.yaml',
token_path='model/tokens.txt',
lexicon_path='model/lexicon.txt',
threshold=0.02,
min_frames=5,
max_frames=250,
interval_frames=50,
score_beam=3,
path_beam=20,
gpu=-1,
is_jit_model=False,)

kws.set_keywords("你好小镜")

# wav_path = '/path/to/your/wave'

y, _ = librosa.load(wav_path, sr=16000, mono=True)
 # NOTE: model supports 16k sample_rate
wav = (y * (1 << 15)).astype("int16").tobytes()

# We inference every 0.3 seconds, in streaming fashion.
interval = int(0.3 * 16000) * 2
for i in range(0, len(wav), interval):
    chunk_wav = wav[i: min(i + interval, len(wav))]
    result = kws.forward(chunk_wav)
    print(result)

if name == 'main':
file = r'./chentianbo_ceqianfang_anjing_0001.wav'
kws_process(file)`

这个自带的音频都不能唤醒，是不是有啥问题？

duj12 · 2023-08-14T02:51:49Z

`import librosa from stream_kws_ctc import KeyWordSpotter

def kws_process(wav_path): kws = KeyWordSpotter( ckpt_path='model/avg_30.pt', config_path='model/config.yaml', token_path='model/tokens.txt', lexicon_path='model/lexicon.txt', threshold=0.02, min_frames=5, max_frames=250, interval_frames=50, score_beam=3, path_beam=20, gpu=-1, is_jit_model=False,)
kws.set_keywords("你好小镜")

# wav_path = '/path/to/your/wave'

y, _ = librosa.load(wav_path, sr=16000, mono=True)
 # NOTE: model supports 16k sample_rate
wav = (y * (1 << 15)).astype("int16").tobytes()

# We inference every 0.3 seconds, in streaming fashion.
interval = int(0.3 * 16000) * 2
for i in range(0, len(wav), interval):
    chunk_wav = wav[i: min(i + interval, len(wav))]
    result = kws.forward(chunk_wav)
    print(result)
if name == 'main': file = r'./chentianbo_ceqianfang_anjing_0001.wav' kws_process(file)`

这个自带的音频都不能唤醒，是不是有啥问题？

You can check the model you used, In kws_demo, I gave two models, one is for "Hi_XiaoWen", the other is for "Nihao_Xiaojing", check it first. And this demo has web server, you can verify it in https://www.modelscope.cn/studios/thuduj12/KWS_Nihao_Xiaojing/summary

haha010508 · 2023-08-14T10:56:02Z

`import librosa from stream_kws_ctc import KeyWordSpotter
def kws_process(wav_path): kws = KeyWordSpotter( ckpt_path='model/avg_30.pt', config_path='model/config.yaml', token_path='model/tokens.txt', lexicon_path='model/lexicon.txt', threshold=0.02, min_frames=5, max_frames=250, interval_frames=50, score_beam=3, path_beam=20, gpu=-1, is_jit_model=False,)
kws.set_keywords("你好小镜")

# wav_path = '/path/to/your/wave'

y, _ = librosa.load(wav_path, sr=16000, mono=True)
 # NOTE: model supports 16k sample_rate
wav = (y * (1 << 15)).astype("int16").tobytes()

# We inference every 0.3 seconds, in streaming fashion.
interval = int(0.3 * 16000) * 2
for i in range(0, len(wav), interval):
    chunk_wav = wav[i: min(i + interval, len(wav))]
    result = kws.forward(chunk_wav)
    print(result)
if name == 'main': file = r'./chentianbo_ceqianfang_anjing_0001.wav' kws_process(file)`
这个自带的音频都不能唤醒，是不是有啥问题？
You can check the model you used, In kws_demo, I gave two models, one is for "Hi_XiaoWen", the other is for "Nihao_Xiaojing", check it first. And this demo has web server, you can verify it in https://www.modelscope.cn/studios/thuduj12/KWS_Nihao_Xiaojing/summary

I got 2 model, one is 23.pt (key word is 你好问问, or 嗨小问), another is avg_30.pt (key word is 你好晓静)， am i right？ i want to reproduce the result like your web server on my computer, so please help, thanks a lot!

duj12 · 2023-08-14T11:27:29Z

`import librosa from stream_kws_ctc import KeyWordSpotter
def kws_process(wav_path): kws = KeyWordSpotter( ckpt_path='model/avg_30.pt', config_path='model/config.yaml', token_path='model/tokens.txt', lexicon_path='model/lexicon.txt', threshold=0.02, min_frames=5, max_frames=250, interval_frames=50, score_beam=3, path_beam=20, gpu=-1, is_jit_model=False,)
kws.set_keywords("你好小镜")

# wav_path = '/path/to/your/wave'

y, _ = librosa.load(wav_path, sr=16000, mono=True)
 # NOTE: model supports 16k sample_rate
wav = (y * (1 << 15)).astype("int16").tobytes()

# We inference every 0.3 seconds, in streaming fashion.
interval = int(0.3 * 16000) * 2
for i in range(0, len(wav), interval):
    chunk_wav = wav[i: min(i + interval, len(wav))]
    result = kws.forward(chunk_wav)
    print(result)
if name == 'main': file = r'./chentianbo_ceqianfang_anjing_0001.wav' kws_process(file)`
这个自带的音频都不能唤醒，是不是有啥问题？
You can check the model you used, In kws_demo, I gave two models, one is for "Hi_XiaoWen", the other is for "Nihao_Xiaojing", check it first. And this demo has web server, you can verify it in https://www.modelscope.cn/studios/thuduj12/KWS_Nihao_Xiaojing/summary
I got 2 model, one is 23.pt (key word is 你好问问, or 嗨小问), another is avg_30.pt (key word is 你好晓静)， am i right？ i want to reproduce the result like your web server on my computer, so please help, thanks a lot!

Here are two models you can use. https://github.com/duj12/kws_demo/tree/master/model, the code is also here.
And the model(23.pt) you mentioned is a pretrained model without specific key words. You cannot use this pretrianed model to do kws.

robin1001 · 2023-08-16T02:07:24Z

Great job! let's merge and refine in the future.

Dapannnnn · 2023-08-24T05:14:09Z

Here I do some more experiments on a private datasets

positive(hello_xiaojing) negative(noise)
train 18 speakers，2219 segments 55 hours
dev 2 speakers，248 segments 12 hours
test 4 speakers，474 segments 24 hours
The results are as following:

backbone loss 1-FRR(%) FAR(/24h) Threshold
ds_tcn maxpooling 81.1 2 0.88
ds_tcn ctc 89.7 1 0.02
fsmn ctc 93.3 2 0.018
As we can see CTC-KWS model outperform Maxpooling model with DS_TCN backbone. Since the fsmn and ds_tcn use different data to do pretraining (also, different feature pipeline, and with different epoch), it's hard to say which backbone is better. But the pretrained fsmn(alibaba released in https://modelscope.cn/models/damo/speech_charctc_kws_phone-xiaoyun/summary ) is better than the ds_tcn model I pretrained (23.pt in https://modelscope.cn/datasets/thuduj12/mobvoi_kws_transcription/files ). So I recommend you use the pretrained FSMN model to train your models.

Hi~,when I run run_fsmn_ctc.sh I encountered the following problem:

This is an error reported when exporting onnx, I think it should be related to the input set, but I don't know how to set the input correctly for the fsmn model.

duj12 · 2023-09-18T02:34:18Z

right now the import of fsmn onnx may still have issue, and ctc model runtime is not support yet. 杜靖 ***@***.***

…

在 2023年8月24日，13:14，Dapannnnn ***@***.***> 写道： Here I do some more experiments on a private datasets positive(hello_xiaojing) negative(noise) train 18 speakers，2219 segments 55 hours dev 2 speakers，248 segments 12 hours test 4 speakers，474 segments 24 hours The results are as following: backbone loss 1-FRR(%) FAR(/24h) Threshold ds_tcn maxpooling 81.1 2 0.88 ds_tcn ctc 89.7 1 0.02 fsmn ctc 93.3 2 0.018 As we can see CTC-KWS model outperform Maxpooling model with DS_TCN backbone. Since the fsmn and ds_tcn use different data to do pretraining (also, different feature pipeline, and with different epoch), it's hard to say which backbone is better. But the pretrained fsmn(alibaba released in https://modelscope.cn/models/damo/speech_charctc_kws_phone-xiaoyun/summary ) is better than the ds_tcn model I pretrained (23.pt in https://modelscope.cn/datasets/thuduj12/mobvoi_kws_transcription/files ). So I recommend you use the pretrained FSMN model to train your models. Hi~,when I run run_fsmn_ctc.sh I encountered the following problem: This is an error reported when exporting onnx, I think it should be related to the input set, but I don't know how to set the input correctly for the fsmn model. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

teinhonglo · 2023-12-20T03:35:32Z

Hi! Could you provide a paper that is implemented in this pull request?
Oh! Is that paper here?
Chen, Mengzhe, et al. "Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting." Interspeech. 2018.

I have another question: how can I transfer it to the Google speech command?
Could you give a quick hint?

Thanks in advance.
TH

duj12 · 2024-01-09T02:47:12Z

Hi! Could you provide a paper that is implemented in this pull request? Oh! Is that paper here? Chen, Mengzhe, et al. "Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting." Interspeech. 2018.

I have another question: how can I transfer it to the Google speech command? Could you give a quick hint?

Thanks in advance. TH

Fow now the lexicon.txt and tokens,txt in Hi_XiaoWen‘s fsmn_CTC model only support Chinese KWS. You need to build a new lexicon and token dict, maybe you can use some resoures in English ASR projects. The tokens.txt is the CTC model's output units, and the lexicon.txt is the map between the words and the output units.

AWangji · 2024-03-22T03:36:40Z

Hi! Could you provide a paper that is implemented in this pull request? Oh! Is that paper here? Chen, Mengzhe, et al. "Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting." Interspeech. 2018.
I have another question: how can I transfer it to the Google speech command? Could you give a quick hint?
Thanks in advance. TH

Fow now the lexicon.txt and tokens,txt in Hi_XiaoWen‘s fsmn_CTC model only support Chinese KWS. You need to build a new lexicon and token dict, maybe you can use some resoures in English ASR projects. The tokens.txt is the CTC model's output units, and the lexicon.txt is the map between the words and the output units.

Hi, could you please tell us the detailed implement steps of create new Customized words?

lin-xiaosheng · 2024-03-28T02:39:09Z

Hi! Could you provide a paper that is implemented in this pull request? Oh! Is that paper here? Chen, Mengzhe, et al. "Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting." Interspeech. 2018.
I have another question: how can I transfer it to the Google speech command? Could you give a quick hint?
Thanks in advance. TH

Fow now the lexicon.txt and tokens,txt in Hi_XiaoWen‘s fsmn_CTC model only support Chinese KWS. You need to build a new lexicon and token dict, maybe you can use some resoures in English ASR projects. The tokens.txt is the CTC model's output units, and the lexicon.txt is the map between the words and the output units.

Hi, could you please tell us the detailed implement steps of create new Customized words?

Yes, I also want to know how to quickly add custom Chinese keywords based on the kew_demo. It would be great if the author could briefly explain that.

dujing added 6 commits May 19, 2023 11:38

add ctcloss training scripts.

a2f8d0e

update compute_det_ctc

9f31676

fix typo.

c4b2ddb

add fsmn model, can use pretrained kws model from modelscope.

6d7e778

Add streaming detection of CTC model. Add CTC model onnx export. Add …

0ac7a41

…CTC model's result in README; For now CTC model runtime is not supported yet.

QA run.sh, maxpooling training scripts is compatible. Ready to PR.

b1b64a4

This was referenced Jun 3, 2023

容易过拟合 #91

Open

Question: time stamp of recognized keyword #119

Open

怎么获得自定义“关键词”识别模型？ #127

Open

WeKws Roadmap 2.0 #121

Open

dujing added 5 commits June 6, 2023 20:50

Add a streaming kws demo, support fsmn online forward

909480d

fix typo.

6f82072

Align Stream FSMN and Non-Stream FSMN, both in feature extraction and…

ceb59e8

… model forward.

fix repeat activation, add a interval restrict.

c26ec9e

fix timestamp when subsampling!=1.

45f0522

jingyonghou reviewed Jun 29, 2023

View reviewed changes

dujing added 3 commits July 24, 2023 16:39

fix flake8, update training script and README, give pretrained ckpt.

ea6a0f5

fix quickcheck and flake8

9b20c84

Add realtime CTC-KWS demo in README.

f91cdef

duj12 requested a review from jingyonghou July 24, 2023 09:34

robin1001 approved these changes Aug 16, 2023

View reviewed changes

robin1001 merged commit b233d46 into wenet-e2e:main Aug 16, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KWS with CTCloss training and CTC prefix beam search detection. #135

KWS with CTCloss training and CTC prefix beam search detection. #135

duj12 commented Jun 3, 2023 •

edited

mlxu995 commented Jun 4, 2023 •

edited

duj12 commented Jun 28, 2023

jingyonghou commented Jun 29, 2023 via email

jingyonghou Jun 29, 2023

duj12 Jul 24, 2023

jingyonghou Jun 29, 2023

duj12 Jul 24, 2023

jingyonghou Jun 29, 2023

duj12 Jul 24, 2023

duj12 commented Jul 24, 2023 •

edited

haha010508 commented Aug 10, 2023

duj12 commented Aug 14, 2023

haha010508 commented Aug 14, 2023

duj12 commented Aug 14, 2023

robin1001 commented Aug 16, 2023

Dapannnnn commented Aug 24, 2023

duj12 commented Sep 18, 2023 via email

teinhonglo commented Dec 20, 2023 •

edited

duj12 commented Jan 9, 2024

AWangji commented Mar 22, 2024

lin-xiaosheng commented Mar 28, 2024

		\| DS_TCN(spec_aug) \| CTC \| 0.056574 \| 0.056856 \|


		Comparison between DS_TCN(Pretrained with Wenetspeech, 23 epoch)

KWS with CTCloss training and CTC prefix beam search detection. #135

KWS with CTCloss training and CTC prefix beam search detection. #135

Conversation

duj12 commented Jun 3, 2023 • edited

mlxu995 commented Jun 4, 2023 • edited

duj12 commented Jun 28, 2023

jingyonghou commented Jun 29, 2023 via email

jingyonghou Jun 29, 2023

Choose a reason for hiding this comment

duj12 Jul 24, 2023

Choose a reason for hiding this comment

jingyonghou Jun 29, 2023

Choose a reason for hiding this comment

duj12 Jul 24, 2023

Choose a reason for hiding this comment

jingyonghou Jun 29, 2023

Choose a reason for hiding this comment

duj12 Jul 24, 2023

Choose a reason for hiding this comment

duj12 commented Jul 24, 2023 • edited

haha010508 commented Aug 10, 2023

duj12 commented Aug 14, 2023

haha010508 commented Aug 14, 2023

duj12 commented Aug 14, 2023

robin1001 commented Aug 16, 2023

Dapannnnn commented Aug 24, 2023

duj12 commented Sep 18, 2023 via email

teinhonglo commented Dec 20, 2023 • edited

duj12 commented Jan 9, 2024

AWangji commented Mar 22, 2024

lin-xiaosheng commented Mar 28, 2024

duj12 commented Jun 3, 2023 •

edited

mlxu995 commented Jun 4, 2023 •

edited

duj12 commented Jul 24, 2023 •

edited

teinhonglo commented Dec 20, 2023 •

edited