Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions about fetures supported by Resemblyzer #44

Open
shubhamshukla7 opened this issue Nov 21, 2020 · 1 comment
Open

questions about fetures supported by Resemblyzer #44

shubhamshukla7 opened this issue Nov 21, 2020 · 1 comment

Comments

@shubhamshukla7
Copy link

shubhamshukla7 commented Nov 21, 2020

Hi there - I am new to Speaker Diarization and was exploring the repo as I have a few questions. I looked at the diarization demo here: demo02_diarization.py

Use live audio stream instead of static audio files:
I see that the demo uses a static mp3 file although in my use-case, I will be working with a realtime audio stream. Does Resemblyzer support streaming input for speech diarization? If so, is there somewhere I could find some resource/sample code to look into for reference?

Number of speakers unknown in the beginning of the audio stream.
Unlike in the given "demo code" where the total number of speakers is pre-decided, in my usecase - I will be trying to stream audio from a live meeting which means that the total number of users might not be known in advance (yes, we know how many people were sent an invite to the meeting but not all might join necessarily). In that case, how can I enable Resemblyzer to not only be able to detect when a particular speaker is talking but also detect that there is a new user who is speaking if he has not spoken before? Does Resemblyzer support that feature? Where can I find some reference for that?

Pre-trained english model for diarization.
I want to work with an already existing model and am okay using some pre-trained diarization model as long as it can detect a new speaker real-time. How can I find some pre-trained diarization models that I can just use right out of the box and see how well that model performs?

Thank you for your time and have a good one.

@CorentinJ
Copy link
Contributor

Use live audio stream instead of static audio files

Yes it's possible and I had plans to implement it but no time to allocate to it (it's not trivial). Might be something I do in the future but no promises.
See the offline (as in the non-streaming approach) code here. To make it online (i.e. streaming), you'll need the following:

  • A threading or async mechanism to both continuously record audio and derive embeddings. From an implementation point of view, this is the hardest bit. You'll want to keep recording audio without interruption and at the same time periodically generate partial embeddings. Your implementation will be dependent on whatever library you use for recording audio and what mechanism it provides to inform you of the length of the audio currently recorded. I can recommend sounddevice, although I am unsure of the specifics for this problem yet.
  • To rewrite compute_partial_slices to yield partial segments in an online fashion. The good news is that your function should be simpler than mine, because you don't have to care about the coverage of the last partial. You'll probably not need to return the wav partials either, unless needed for debugging or visualizations.
  • Finally you can use the remainder of embed_utterance to compute your embeds. Your batch size should be proportional to the maximum latency you can afford. If your application can have up to a 1s delay, then your batch size can be the number of partials in one second. The fastest solution will have a batch size of 1, but it will be more compute-intensive. You should also determine whether you want to be doing inference on CPU or GPU. The voice encoder is a fairly light model and so CPU inference may even be faster than GPU inference for small batch sizes (due to the time it takes to move the data) but maybe doing inference on the CPU will be problematic if you're recording audio at the same time.

Number of speakers unknown in the beginning of the audio stream.

See #10, a very similar thread.

Pre-trained english model for diarization.

Unfortunately diarization in general is not in my scope, I don't know what the SOTA is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants