Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KWS with CTCloss training and CTC prefix beam search detection. #135

Merged
merged 14 commits into from Aug 16, 2023

Conversation

duj12
Copy link
Contributor

@duj12 duj12 commented Jun 3, 2023

This PR is for KWS training with CTC loss and detection with CTC prefix beam search.
Aim to improve the robustness of KWS model, and support customized keywords with limited data.
For now only hi_xiaowen data has runing scripts.

I redo the experiment of ds_tcn with max-pooling loss.
Then I add the ds_tcn model with CTC loss, and add a FSMN backbone.
Finally I add a streaming scoring script, to simulate the real detection case of a CTC model.
All results can be found in README.md.

Note, CTC model can be export to onnx(the output is softmax of logits), but runtime is not support now,
I decide to develop a python script in runtime streaming fashion(do online feature extraction and so on) first, and then the onnx c++ and pybind...
When runtime is ready, I will create a new PR.

@mlxu995
Copy link
Collaborator

mlxu995 commented Jun 4, 2023

good job! Thank you so much for this valuable update, and we will appreciate if you could help fix the lint errors such as the trailing whitespace.

@duj12
Copy link
Contributor Author

duj12 commented Jun 28, 2023

Here is a Demo, all models are trained with ctc loss, and in this demo, the detection is performed in streaming fashion.
https://www.modelscope.cn/studios/thuduj12/KWS_Nihao_Xiaojing/summary

@jingyonghou
Copy link
Collaborator

jingyonghou commented Jun 29, 2023 via email

--checkpoint $score_checkpoint \
--score_file $result_dir/score.txt \
--num_workers 8 \
--keywords 嗨小问,你好问问 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it better to replace Chinese char with Latin char in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has been done in the latest commit.

--lexicon_file data/lexicon.txt

python wekws/bin/compute_det_ctc.py \
--keywords 嗨小问,你好问问 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it better to replace Chinese char with Latin char in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

| DS_TCN(spec_aug) | CTC | 0.056574 | 0.056856 |


Comparison between DS_TCN(Pretrained with Wenetspeech, 23 epoch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to release the pre-trained model, so people can reproduce the experiment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretrained models are released in the latest commit(README).

@duj12 duj12 requested a review from jingyonghou July 24, 2023 09:34
@duj12
Copy link
Contributor Author

duj12 commented Jul 24, 2023

Here I do some more experiments on a private datasets

  positive(hello_xiaojing) negative(noise)
train 18 speakers,2219 segments 55 hours
dev 2 speakers,248 segments 12 hours
test 4 speakers,474 segments 24 hours

The results are as following:

backbone loss 1-FRR(%) FAR(/24h) Threshold
ds_tcn maxpooling 81.1 2 0.88
ds_tcn ctc 89.7 1 0.02
fsmn ctc 93.3 2 0.018

As we can see CTC-KWS model outperform Maxpooling model with DS_TCN backbone.
Since the fsmn and ds_tcn use different data to do pretraining (also, different feature pipeline, and with different epoch), it's hard to say which backbone is better.
But the pretrained fsmn(alibaba released in https://modelscope.cn/models/damo/speech_charctc_kws_phone-xiaoyun/summary ) is better than the ds_tcn model I pretrained (23.pt in https://modelscope.cn/datasets/thuduj12/mobvoi_kws_transcription/files ).
So I recommend you use the pretrained FSMN model to train your models.

@haha010508
Copy link

`import librosa
from stream_kws_ctc import KeyWordSpotter

def kws_process(wav_path):
kws = KeyWordSpotter(
ckpt_path='model/avg_30.pt',
config_path='model/config.yaml',
token_path='model/tokens.txt',
lexicon_path='model/lexicon.txt',
threshold=0.02,
min_frames=5,
max_frames=250,
interval_frames=50,
score_beam=3,
path_beam=20,
gpu=-1,
is_jit_model=False,)

kws.set_keywords("你好小镜")

# wav_path = '/path/to/your/wave'

y, _ = librosa.load(wav_path, sr=16000, mono=True)
 # NOTE: model supports 16k sample_rate
wav = (y * (1 << 15)).astype("int16").tobytes()

# We inference every 0.3 seconds, in streaming fashion.
interval = int(0.3 * 16000) * 2
for i in range(0, len(wav), interval):
    chunk_wav = wav[i: min(i + interval, len(wav))]
    result = kws.forward(chunk_wav)
    print(result)

if name == 'main':
file = r'./chentianbo_ceqianfang_anjing_0001.wav'
kws_process(file)`

这个自带的音频都不能唤醒,是不是有啥问题?

@duj12
Copy link
Contributor Author

duj12 commented Aug 14, 2023

`import librosa from stream_kws_ctc import KeyWordSpotter

def kws_process(wav_path): kws = KeyWordSpotter( ckpt_path='model/avg_30.pt', config_path='model/config.yaml', token_path='model/tokens.txt', lexicon_path='model/lexicon.txt', threshold=0.02, min_frames=5, max_frames=250, interval_frames=50, score_beam=3, path_beam=20, gpu=-1, is_jit_model=False,)

kws.set_keywords("你好小镜")

# wav_path = '/path/to/your/wave'

y, _ = librosa.load(wav_path, sr=16000, mono=True)
 # NOTE: model supports 16k sample_rate
wav = (y * (1 << 15)).astype("int16").tobytes()

# We inference every 0.3 seconds, in streaming fashion.
interval = int(0.3 * 16000) * 2
for i in range(0, len(wav), interval):
    chunk_wav = wav[i: min(i + interval, len(wav))]
    result = kws.forward(chunk_wav)
    print(result)

if name == 'main': file = r'./chentianbo_ceqianfang_anjing_0001.wav' kws_process(file)`

这个自带的音频都不能唤醒,是不是有啥问题?

You can check the model you used, In kws_demo, I gave two models, one is for "Hi_XiaoWen", the other is for "Nihao_Xiaojing", check it first. And this demo has web server, you can verify it in https://www.modelscope.cn/studios/thuduj12/KWS_Nihao_Xiaojing/summary

@haha010508
Copy link

`import librosa from stream_kws_ctc import KeyWordSpotter
def kws_process(wav_path): kws = KeyWordSpotter( ckpt_path='model/avg_30.pt', config_path='model/config.yaml', token_path='model/tokens.txt', lexicon_path='model/lexicon.txt', threshold=0.02, min_frames=5, max_frames=250, interval_frames=50, score_beam=3, path_beam=20, gpu=-1, is_jit_model=False,)

kws.set_keywords("你好小镜")

# wav_path = '/path/to/your/wave'

y, _ = librosa.load(wav_path, sr=16000, mono=True)
 # NOTE: model supports 16k sample_rate
wav = (y * (1 << 15)).astype("int16").tobytes()

# We inference every 0.3 seconds, in streaming fashion.
interval = int(0.3 * 16000) * 2
for i in range(0, len(wav), interval):
    chunk_wav = wav[i: min(i + interval, len(wav))]
    result = kws.forward(chunk_wav)
    print(result)

if name == 'main': file = r'./chentianbo_ceqianfang_anjing_0001.wav' kws_process(file)`
这个自带的音频都不能唤醒,是不是有啥问题?

You can check the model you used, In kws_demo, I gave two models, one is for "Hi_XiaoWen", the other is for "Nihao_Xiaojing", check it first. And this demo has web server, you can verify it in https://www.modelscope.cn/studios/thuduj12/KWS_Nihao_Xiaojing/summary

I got 2 model, one is 23.pt (key word is 你好问问, or 嗨小问), another is avg_30.pt (key word is 你好晓静), am i right? i want to reproduce the result like your web server on my computer, so please help, thanks a lot!

@duj12
Copy link
Contributor Author

duj12 commented Aug 14, 2023

`import librosa from stream_kws_ctc import KeyWordSpotter
def kws_process(wav_path): kws = KeyWordSpotter( ckpt_path='model/avg_30.pt', config_path='model/config.yaml', token_path='model/tokens.txt', lexicon_path='model/lexicon.txt', threshold=0.02, min_frames=5, max_frames=250, interval_frames=50, score_beam=3, path_beam=20, gpu=-1, is_jit_model=False,)

kws.set_keywords("你好小镜")

# wav_path = '/path/to/your/wave'

y, _ = librosa.load(wav_path, sr=16000, mono=True)
 # NOTE: model supports 16k sample_rate
wav = (y * (1 << 15)).astype("int16").tobytes()

# We inference every 0.3 seconds, in streaming fashion.
interval = int(0.3 * 16000) * 2
for i in range(0, len(wav), interval):
    chunk_wav = wav[i: min(i + interval, len(wav))]
    result = kws.forward(chunk_wav)
    print(result)

if name == 'main': file = r'./chentianbo_ceqianfang_anjing_0001.wav' kws_process(file)`
这个自带的音频都不能唤醒,是不是有啥问题?

You can check the model you used, In kws_demo, I gave two models, one is for "Hi_XiaoWen", the other is for "Nihao_Xiaojing", check it first. And this demo has web server, you can verify it in https://www.modelscope.cn/studios/thuduj12/KWS_Nihao_Xiaojing/summary

I got 2 model, one is 23.pt (key word is 你好问问, or 嗨小问), another is avg_30.pt (key word is 你好晓静), am i right? i want to reproduce the result like your web server on my computer, so please help, thanks a lot!

Here are two models you can use. https://github.com/duj12/kws_demo/tree/master/model, the code is also here.
And the model(23.pt) you mentioned is a pretrained model without specific key words. You cannot use this pretrianed model to do kws.

@robin1001 robin1001 merged commit b233d46 into wenet-e2e:main Aug 16, 2023
4 checks passed
@robin1001
Copy link
Contributor

Great job! let's merge and refine in the future.

@Dapannnnn
Copy link

Here I do some more experiments on a private datasets

  positive(hello_xiaojing) negative(noise)
train 18 speakers,2219 segments 55 hours
dev 2 speakers,248 segments 12 hours
test 4 speakers,474 segments 24 hours
The results are as following:

backbone loss 1-FRR(%) FAR(/24h) Threshold
ds_tcn maxpooling 81.1 2 0.88
ds_tcn ctc 89.7 1 0.02
fsmn ctc 93.3 2 0.018
As we can see CTC-KWS model outperform Maxpooling model with DS_TCN backbone. Since the fsmn and ds_tcn use different data to do pretraining (also, different feature pipeline, and with different epoch), it's hard to say which backbone is better. But the pretrained fsmn(alibaba released in https://modelscope.cn/models/damo/speech_charctc_kws_phone-xiaoyun/summary ) is better than the ds_tcn model I pretrained (23.pt in https://modelscope.cn/datasets/thuduj12/mobvoi_kws_transcription/files ). So I recommend you use the pretrained FSMN model to train your models.

Hi~,when I run run_fsmn_ctc.sh I encountered the following problem:
image
This is an error reported when exporting onnx, I think it should be related to the input set, but I don't know how to set the input correctly for the fsmn model.

@duj12
Copy link
Contributor Author

duj12 commented Sep 18, 2023 via email

@teinhonglo
Copy link

teinhonglo commented Dec 20, 2023

Hi! Could you provide a paper that is implemented in this pull request?
Oh! Is that paper here?
Chen, Mengzhe, et al. "Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting." Interspeech. 2018.

I have another question: how can I transfer it to the Google speech command?
Could you give a quick hint?

Thanks in advance.
TH

@duj12
Copy link
Contributor Author

duj12 commented Jan 9, 2024

Hi! Could you provide a paper that is implemented in this pull request? Oh! Is that paper here? Chen, Mengzhe, et al. "Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting." Interspeech. 2018.

I have another question: how can I transfer it to the Google speech command? Could you give a quick hint?

Thanks in advance. TH

Fow now the lexicon.txt and tokens,txt in Hi_XiaoWen‘s fsmn_CTC model only support Chinese KWS. You need to build a new lexicon and token dict, maybe you can use some resoures in English ASR projects. The tokens.txt is the CTC model's output units, and the lexicon.txt is the map between the words and the output units.

@AWangji
Copy link

AWangji commented Mar 22, 2024

Hi! Could you provide a paper that is implemented in this pull request? Oh! Is that paper here? Chen, Mengzhe, et al. "Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting." Interspeech. 2018.
I have another question: how can I transfer it to the Google speech command? Could you give a quick hint?
Thanks in advance. TH

Fow now the lexicon.txt and tokens,txt in Hi_XiaoWen‘s fsmn_CTC model only support Chinese KWS. You need to build a new lexicon and token dict, maybe you can use some resoures in English ASR projects. The tokens.txt is the CTC model's output units, and the lexicon.txt is the map between the words and the output units.

Hi, could you please tell us the detailed implement steps of create new Customized words?

@lin-xiaosheng
Copy link

Hi! Could you provide a paper that is implemented in this pull request? Oh! Is that paper here? Chen, Mengzhe, et al. "Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting." Interspeech. 2018.
I have another question: how can I transfer it to the Google speech command? Could you give a quick hint?
Thanks in advance. TH

Fow now the lexicon.txt and tokens,txt in Hi_XiaoWen‘s fsmn_CTC model only support Chinese KWS. You need to build a new lexicon and token dict, maybe you can use some resoures in English ASR projects. The tokens.txt is the CTC model's output units, and the lexicon.txt is the map between the words and the output units.

Hi, could you please tell us the detailed implement steps of create new Customized words?

Yes, I also want to know how to quickly add custom Chinese keywords based on the kew_demo. It would be great if the author could briefly explain that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants