Voice Activity Controller #39

rodrigoGA · 2023-12-01T21:00:52Z

Hello, I have found your project interesting, good job.

I believe there is an incorrect use of VAD. The function get_speech_timestamps used by fasterwhisper is a copy of the function from silero which is intended for complete audios. However, when working with streaming, audio fragments are being received. Silero already includes a utility for this at https://github.com/snakers4/silero-vad/blob/5e7ee10ee065ab2b98751dd82b28e3c6360e19aa/utils_vad.py#L428

I have forked your project to test this: https://github.com/rodrigoGA/whisper_streaming/tree/main
Changing the way VAD is used seemed to improve the results.

One of the main drawbacks I found is the delay in obtaining the transcription, which gives an unpleasant feeling, especially when the conversation ends, as no transcription is received for a few seconds. Therefore, I created a class based on VAD to flush the buffer once it detects that the user has not spoken for 0.5 seconds https://github.com/rodrigoGA/whisper_streaming/blob/main/voice_activity_controller.py
In this file, you can find an example that transcribes from the microphone: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_streaming.py
It greatly improves the feeling of real-time transcription, perhaps a similar idea can be applied. I say feeling because I haven't done any serious performance testing.

I've also created a simple example that transcribes when the user stops talking to compare results: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_simple.py

Another point I think you should consider is the tokens you are using. In languages like Spanish, questions are enclosed in question marks at the beginning and end, and can have other punctuation marks in the middle. For example, sentences like this: "¿Cuál es la capital de Francia, y por qué es conocida por su arquitectura?" However, in some situations, your approach has transcribed it as: "cual es la capital de Francia, ¿por qué es conocida por su arquitectura?" It might be a problem with whisper, but I think it's the use of tokens you have applied.

Gldkslfmsd · 2023-12-04T10:34:22Z

Wow, thank you, @rodrigoGA ! This is very interesting feedback. I want to review and test your approach and possibly merge the useful parts. Later, when I'll have time.
Thanks!

rodrigoGA · 2023-12-04T11:42:23Z

Should the suggestion be integrated, I would also suggest changing the way the translation is returned. All streaming systems in some way indicate whether it is a partial or final translation. In this way, what is in the buffer could be returned as partial, and the user would have a more realistic feedback of what is being said. It is understood that the partial can change.

Gldkslfmsd · 2023-12-04T11:52:41Z

yes, an option for |||-separated partial output is possible. But anyway, I don't want more complicated output protocol. Plaintext is enough.

rodrigoGA · 2023-12-04T12:49:06Z

I understand the idea of keeping it simple. However, this is the standard in streaming ASR. You can check how Nvidia uses 'is_final' for all streaming models supported by the Riva platform https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#_CPPv428SpeechRecognitionAlternative or companies that sell the model as a service in streaming APIs https://www.assemblyai.com/docs/guides/real-time-streaming-transcription
All of them use the same concept. As a consumer of these services, I can tell you that this is very useful for knowing when the user is speaking and for getting feedback on what is happening, even though the transcription has not finished. Imagine you want to use an ASR in a real-world use case, for example, transcribing a phone call. You would need to know when the user stops speaking and that the transcription is finished in order to do something with the text. Otherwise, you would have to wait until the call ends to consider the transcription complete, which would lose the aspect of real-time

it works. Reproducing #39

Gldkslfmsd · 2024-02-06T16:13:51Z

@rodrigoGA , thank you very much again. In integrated your VAC in https://github.com/ufal/whisper_streaming/tree/vad-streaming It seems working good, but the code needs to be reviewed and made clearer and simpler. Then I can merge it.

SaddamBInSyed · 2024-07-30T13:32:14Z

@Gldkslfmsd @rodrigoGA

I tried to run the mic_test_whisper_streaming.py code ( gpu enabled machine ) , I am getting the below exception.

Can you advise what I am doing it wrong?

/home//miniconda3/envs/test_rag/bin/python /home//Downloads/whisper_streaming/mic_test_whisper_streaming.py
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:877:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
Using cache found in /home/.cache/torch/hub/snakers4_silero-vad_master
Traceback (most recent call last):
File "/home//Downloads/whisper_streaming/voice_activity_controller.py", line 54, in apply_vad
x = torch.Tensor(x)
TypeError: new(): invalid data type 'bytes'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/Downloads/whisper_streaming/mic_test_whisper_streaming.py", line 28, in
for iter in vad.detect_user_speech(microphone_stream): # processing loop:
File "/home/Downloads/whisper_streaming/voice_activity_controller.py", line 110, in detect_user_speech
yield self.detect_speech_iter(data, audio_in_int16)
File "/home/Downloads/whisper_streaming/voice_activity_controller.py", line 83, in detect_speech_iter
voice_audio, speech_in_wav, last_silent_in_wav = self.apply_vad(wav)
File "/home/Downloads/whisper_streaming/voice_activity_controller.py", line 56, in apply_vad
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
TypeError: Audio cannot be casted to tensor. Cast it manually

Gldkslfmsd · 2024-07-30T15:08:22Z

Hi, this branch is work in progress. I'm aware that mic_test_whisper_streaming.py may not be working. I plan to remove it from this repo because I can't maintain other people's mics.

I recommend usíng other way of inputing audio, like through file, through stdin or through server and client.

SaddamBInSyed · 2024-07-31T11:07:01Z

@Gldkslfmsd

Well noted. Thanks for your reply,.

"While testing the app, whenever I speak a sentence, such as 'Can you tell me about the highest paid sports player?', I receive the output as split sentences, like: {'Can you tell me'} and {'about the highest paid sports player'}. Since I am passing this to an LLM, I need the full sentence before any silence occurs. Could you guide me on how to handle this scenario?"

Gldkslfmsd · 2024-08-18T22:28:25Z

Thanks again for your suggestion, @rodrigoGA . I cleaned the code and merged it now.

Gldkslfmsd added a commit that referenced this issue Jan 3, 2024

VAC controller integrated

dfc862b

it works. Reproducing #39

Gldkslfmsd mentioned this issue Jan 29, 2024

speech can not be handled after some silence for a while or some different languages #54

Closed

Gldkslfmsd changed the title ~~Feedback~~ Voice Activity Controller Feb 6, 2024

Gldkslfmsd mentioned this issue Jul 8, 2024

New Fork: Web client + WebSocket + own VAD impl. #105

Open

Gldkslfmsd closed this as completed Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice Activity Controller #39

Voice Activity Controller #39

rodrigoGA commented Dec 1, 2023

Gldkslfmsd commented Dec 4, 2023

rodrigoGA commented Dec 4, 2023

Gldkslfmsd commented Dec 4, 2023

rodrigoGA commented Dec 4, 2023 •

edited

Loading

Gldkslfmsd commented Feb 6, 2024

SaddamBInSyed commented Jul 30, 2024

Gldkslfmsd commented Jul 30, 2024

SaddamBInSyed commented Jul 31, 2024 •

edited

Loading

Gldkslfmsd commented Aug 18, 2024

Voice Activity Controller #39

Voice Activity Controller #39

Comments

rodrigoGA commented Dec 1, 2023

Gldkslfmsd commented Dec 4, 2023

rodrigoGA commented Dec 4, 2023

Gldkslfmsd commented Dec 4, 2023

rodrigoGA commented Dec 4, 2023 • edited Loading

Gldkslfmsd commented Feb 6, 2024

SaddamBInSyed commented Jul 30, 2024

Gldkslfmsd commented Jul 30, 2024

SaddamBInSyed commented Jul 31, 2024 • edited Loading

Gldkslfmsd commented Aug 18, 2024

rodrigoGA commented Dec 4, 2023 •

edited

Loading

SaddamBInSyed commented Jul 31, 2024 •

edited

Loading