Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice Activity Controller #39

Closed
rodrigoGA opened this issue Dec 1, 2023 · 9 comments
Closed

Voice Activity Controller #39

rodrigoGA opened this issue Dec 1, 2023 · 9 comments

Comments

@rodrigoGA
Copy link
Contributor

Hello, I have found your project interesting, good job.

I believe there is an incorrect use of VAD. The function get_speech_timestamps used by fasterwhisper is a copy of the function from silero which is intended for complete audios. However, when working with streaming, audio fragments are being received. Silero already includes a utility for this at https://github.com/snakers4/silero-vad/blob/5e7ee10ee065ab2b98751dd82b28e3c6360e19aa/utils_vad.py#L428

I have forked your project to test this: https://github.com/rodrigoGA/whisper_streaming/tree/main
Changing the way VAD is used seemed to improve the results.

One of the main drawbacks I found is the delay in obtaining the transcription, which gives an unpleasant feeling, especially when the conversation ends, as no transcription is received for a few seconds. Therefore, I created a class based on VAD to flush the buffer once it detects that the user has not spoken for 0.5 seconds https://github.com/rodrigoGA/whisper_streaming/blob/main/voice_activity_controller.py
In this file, you can find an example that transcribes from the microphone: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_streaming.py
It greatly improves the feeling of real-time transcription, perhaps a similar idea can be applied. I say feeling because I haven't done any serious performance testing.

I've also created a simple example that transcribes when the user stops talking to compare results: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_simple.py

Another point I think you should consider is the tokens you are using. In languages like Spanish, questions are enclosed in question marks at the beginning and end, and can have other punctuation marks in the middle. For example, sentences like this: "¿Cuál es la capital de Francia, y por qué es conocida por su arquitectura?" However, in some situations, your approach has transcribed it as: "cual es la capital de Francia, ¿por qué es conocida por su arquitectura?" It might be a problem with whisper, but I think it's the use of tokens you have applied.

@Gldkslfmsd
Copy link
Collaborator

Wow, thank you, @rodrigoGA ! This is very interesting feedback. I want to review and test your approach and possibly merge the useful parts. Later, when I'll have time.
Thanks!

@rodrigoGA
Copy link
Contributor Author

Should the suggestion be integrated, I would also suggest changing the way the translation is returned. All streaming systems in some way indicate whether it is a partial or final translation. In this way, what is in the buffer could be returned as partial, and the user would have a more realistic feedback of what is being said. It is understood that the partial can change.

@Gldkslfmsd
Copy link
Collaborator

yes, an option for |||-separated partial output is possible. But anyway, I don't want more complicated output protocol. Plaintext is enough.

@rodrigoGA
Copy link
Contributor Author

rodrigoGA commented Dec 4, 2023

I understand the idea of keeping it simple. However, this is the standard in streaming ASR. You can check how Nvidia uses 'is_final' for all streaming models supported by the Riva platform https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#_CPPv428SpeechRecognitionAlternative or companies that sell the model as a service in streaming APIs https://www.assemblyai.com/docs/guides/real-time-streaming-transcription
All of them use the same concept. As a consumer of these services, I can tell you that this is very useful for knowing when the user is speaking and for getting feedback on what is happening, even though the transcription has not finished. Imagine you want to use an ASR in a real-world use case, for example, transcribing a phone call. You would need to know when the user stops speaking and that the transcription is finished in order to do something with the text. Otherwise, you would have to wait until the call ends to consider the transcription complete, which would lose the aspect of real-time

Gldkslfmsd added a commit that referenced this issue Jan 3, 2024
it works. Reproducing #39
@Gldkslfmsd Gldkslfmsd changed the title Feedback Voice Activity Controller Feb 6, 2024
@Gldkslfmsd
Copy link
Collaborator

@rodrigoGA , thank you very much again. In integrated your VAC in https://github.com/ufal/whisper_streaming/tree/vad-streaming It seems working good, but the code needs to be reviewed and made clearer and simpler. Then I can merge it.

@SaddamBInSyed
Copy link

@Gldkslfmsd @rodrigoGA

I tried to run the mic_test_whisper_streaming.py code ( gpu enabled machine ) , I am getting the below exception.

Can you advise what I am doing it wrong?

/home//miniconda3/envs/test_rag/bin/python /home//Downloads/whisper_streaming/mic_test_whisper_streaming.py
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:877:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
Using cache found in /home/.cache/torch/hub/snakers4_silero-vad_master
Traceback (most recent call last):
File "/home//Downloads/whisper_streaming/voice_activity_controller.py", line 54, in apply_vad
x = torch.Tensor(x)
TypeError: new(): invalid data type 'bytes'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/Downloads/whisper_streaming/mic_test_whisper_streaming.py", line 28, in
for iter in vad.detect_user_speech(microphone_stream): # processing loop:
File "/home/Downloads/whisper_streaming/voice_activity_controller.py", line 110, in detect_user_speech
yield self.detect_speech_iter(data, audio_in_int16)
File "/home/Downloads/whisper_streaming/voice_activity_controller.py", line 83, in detect_speech_iter
voice_audio, speech_in_wav, last_silent_in_wav = self.apply_vad(wav)
File "/home/Downloads/whisper_streaming/voice_activity_controller.py", line 56, in apply_vad
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
TypeError: Audio cannot be casted to tensor. Cast it manually

@Gldkslfmsd
Copy link
Collaborator

Hi, this branch is work in progress. I'm aware that mic_test_whisper_streaming.py may not be working. I plan to remove it from this repo because I can't maintain other people's mics.

I recommend usíng other way of inputing audio, like through file, through stdin or through server and client.

@SaddamBInSyed
Copy link

SaddamBInSyed commented Jul 31, 2024

@Gldkslfmsd

Well noted. Thanks for your reply,.

"While testing the app, whenever I speak a sentence, such as 'Can you tell me about the highest paid sports player?', I receive the output as split sentences, like: {'Can you tell me'} and {'about the highest paid sports player'}. Since I am passing this to an LLM, I need the full sentence before any silence occurs. Could you guide me on how to handle this scenario?"

@Gldkslfmsd
Copy link
Collaborator

Thanks again for your suggestion, @rodrigoGA . I cleaned the code and merged it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants