Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute embeddings from stream & unsupervised diarization #10

Closed
shashankpr opened this issue Sep 23, 2019 · 12 comments
Closed

Compute embeddings from stream & unsupervised diarization #10

shashankpr opened this issue Sep 23, 2019 · 12 comments

Comments

@shashankpr
Copy link

Hi, great work and great repo really. Your code and examples helped me understand the flow very easily.
I am currently working on a speaker identification task wherein I want to detect "who spoke when" with low latency. There are two tasks that I need to overcome and I was wondering if you had already worked on them or have plans to in future. If not, then I would be glad to contribute to your repo as a PR. The tasks are as follows:

  1. How can I use the partial embeddings to identify speaker changes if I do not have pre-defined speaker embeddings (unlike the speaker diarization example that you gave)?
  2. Can the embeddings be computed from a streaming input? Like directly reading wav bytes from microphone and computing them?

I know that they can be done with few tweaks but I would like to know your insight on them if you had already worked or have idea about them.
Thanks!

@CorentinJ
Copy link
Contributor

I've investigated these areas but haven't yet implemented anything for them, even though I am considering it.

  1. You would have to cluster the partial embeddings of the audio (generated with a moderately high rate, I'd use 4). There must be papers on how to do this out there, but you could try some intuitive approaches too. For example you could try to use some clustering algorithm that will create n + 1 clusters (where n is the known number of speakers), and hope that it will assign embeddings to the right clusters and keep a bin state. You might be able to filter out embeddings of clear speech from a single person from those computed from noise/silence or multiple speakers.

You might also be able to work with similarity. E.g. if you add these lines in demo 2 after having computed the continuous embedding:

import matplotlib.pyplot as plt
plt.imshow(cont_embeds @ cont_embeds.T)
plt.show()

You will get this:
image

Clearly you can detect some speakers there, by looking for pattern of high similarity:
image
image

  1. This is definitely achievable. The sounddevice module can record audio and stream in real-time to numpy arrays, so you can work with that. You can then decompose the embed_utterance function to achieve your goal. Define a maximum duration for your audio (it can be an order of magnitude higher than necessary, that's not a problem) and compute the wav slices based on that length: https://github.com/resemble-ai/Resemblyzer/blob/master/resemblyzer/voice_encoder.py#L141. From the wav slices, you will know when you will be able to grab a partial wav from the numpy array being streamed to. For this partial wav, create a unique spectrogram and forward it (with a batch size of 1), and you will have a partial embedding. Keep doing this while the audio is being recorded.

@CorentinJ
Copy link
Contributor

This is a demo I meant to make too, but it's certainly more work than the other 5. Hope we'll get there.

@shashankpr
Copy link
Author

Thanks for your detailed explanations.

  1. I agree with you. I have been reading about spectral clustering method which has been used in couple of papers for similar diarization task. I will follow your suggestion and try it out.
  2. When you mention a batch size of 1, it means that the partial embedding output will have a shape of (number_of_partials, embedding_size) correct?

@CorentinJ
Copy link
Contributor

I mean that at this point in the function: https://github.com/resemble-ai/Resemblyzer/blob/master/resemblyzer/voice_encoder.py#L151, the variable mels has shape (N, 160, 40), where N is the batch size. You will probably end up with a mel of shape (160, 40) so you will have to add an extra dimension (e.g. by doing mels[None, ...]) before forward the mel.

@shashankpr
Copy link
Author

Got it! Thank you very much for clearing these doubts. I will close this and will update here when I will make significant progress with unsupervised and streaming diarization.
Great work once again!

@CorentinJ
Copy link
Contributor

Sure, it's fine if you leave it open until we figure it out.

@nikitalpopov
Copy link

Hi, @shashankpr
Any progress on this task?

@lonniehartley
Copy link

lonniehartley commented Apr 8, 2020 via email

@shashankpr
Copy link
Author

Hi @nikitalpopov ,
I have been doing some experiments around this but haven't really got proper time to implement something good. I am going to start working on it from this week and I will update you if I make any progress

@nikitalpopov
Copy link

@shashankpr
Could I help you with something?

@nikitalpopov
Copy link

nikitalpopov commented May 4, 2020

@CorentinJ @shashankpr
I tried to make it by myself, but results are horrible (DER is not getting any better than 60%). Could you, please, check my test notebook? https://github.com/nikitalpopov/master/blob/dev/demo.ipynb

@RubenPants
Copy link

RubenPants commented Oct 12, 2021

Writing my solution here, since I've been trying to implement a way of embedding during streaming. In my use-case, streaming happens by pushing bytes of audio segments:

import io
import numpy as np
import soundfile as sf
from resemblyzer import VoiceEncoder

encoder = VoiceEncoder()

def embed(chunk_bytes: bytes) -> np.ndarray:
    """Embed the given chunk of WAV-bytes."""
    data, _ = sf.read(
            io.BytesIO(chunk_bytes),
            samplerate=16000,
            channels=1,
            format='RAW',
            subtype='PCM_16',
            endian='FILE',
    )
    return encoder.embed_utterance(data)

An example of this code's result (after PCA) are shown below:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants