-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Voice Activity Controller #39
Comments
Wow, thank you, @rodrigoGA ! This is very interesting feedback. I want to review and test your approach and possibly merge the useful parts. Later, when I'll have time. |
Should the suggestion be integrated, I would also suggest changing the way the translation is returned. All streaming systems in some way indicate whether it is a partial or final translation. In this way, what is in the buffer could be returned as partial, and the user would have a more realistic feedback of what is being said. It is understood that the partial can change. |
yes, an option for |||-separated partial output is possible. But anyway, I don't want more complicated output protocol. Plaintext is enough. |
I understand the idea of keeping it simple. However, this is the standard in streaming ASR. You can check how Nvidia uses 'is_final' for all streaming models supported by the Riva platform https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#_CPPv428SpeechRecognitionAlternative or companies that sell the model as a service in streaming APIs https://www.assemblyai.com/docs/guides/real-time-streaming-transcription |
@rodrigoGA , thank you very much again. In integrated your VAC in https://github.com/ufal/whisper_streaming/tree/vad-streaming It seems working good, but the code needs to be reviewed and made clearer and simpler. Then I can merge it. |
I tried to run the mic_test_whisper_streaming.py code ( gpu enabled machine ) , I am getting the below exception. Can you advise what I am doing it wrong? /home//miniconda3/envs/test_rag/bin/python /home//Downloads/whisper_streaming/mic_test_whisper_streaming.py During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Hi, this branch is work in progress. I'm aware that mic_test_whisper_streaming.py may not be working. I plan to remove it from this repo because I can't maintain other people's mics. I recommend usíng other way of inputing audio, like through file, through stdin or through server and client. |
Well noted. Thanks for your reply,. "While testing the app, whenever I speak a sentence, such as 'Can you tell me about the highest paid sports player?', I receive the output as split sentences, like: {'Can you tell me'} and {'about the highest paid sports player'}. Since I am passing this to an LLM, I need the full sentence before any silence occurs. Could you guide me on how to handle this scenario?" |
Thanks again for your suggestion, @rodrigoGA . I cleaned the code and merged it now. |
Hello, I have found your project interesting, good job.
I believe there is an incorrect use of VAD. The function get_speech_timestamps used by fasterwhisper is a copy of the function from silero which is intended for complete audios. However, when working with streaming, audio fragments are being received. Silero already includes a utility for this at https://github.com/snakers4/silero-vad/blob/5e7ee10ee065ab2b98751dd82b28e3c6360e19aa/utils_vad.py#L428
I have forked your project to test this: https://github.com/rodrigoGA/whisper_streaming/tree/main
Changing the way VAD is used seemed to improve the results.
One of the main drawbacks I found is the delay in obtaining the transcription, which gives an unpleasant feeling, especially when the conversation ends, as no transcription is received for a few seconds. Therefore, I created a class based on VAD to flush the buffer once it detects that the user has not spoken for 0.5 seconds https://github.com/rodrigoGA/whisper_streaming/blob/main/voice_activity_controller.py
In this file, you can find an example that transcribes from the microphone: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_streaming.py
It greatly improves the feeling of real-time transcription, perhaps a similar idea can be applied. I say feeling because I haven't done any serious performance testing.
I've also created a simple example that transcribes when the user stops talking to compare results: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_simple.py
Another point I think you should consider is the tokens you are using. In languages like Spanish, questions are enclosed in question marks at the beginning and end, and can have other punctuation marks in the middle. For example, sentences like this: "¿Cuál es la capital de Francia, y por qué es conocida por su arquitectura?" However, in some situations, your approach has transcribed it as: "cual es la capital de Francia, ¿por qué es conocida por su arquitectura?" It might be a problem with whisper, but I think it's the use of tokens you have applied.
The text was updated successfully, but these errors were encountered: