Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of ONNX WeSpeaker in favor of its pytorch implementation #1537

Closed
hbredin opened this issue Nov 9, 2023 · 11 comments
Closed

Get rid of ONNX WeSpeaker in favor of its pytorch implementation #1537

hbredin opened this issue Nov 9, 2023 · 11 comments

Comments

@hbredin
Copy link
Member

hbredin commented Nov 9, 2023

Since its introduction in pyannote.audio 3.x, the ONNX dependency seems to cause lots of problem to pyannote users: #1526 #1523 #1517 #1510 #1508 #1481 #1478 #1477 #1475

WeSpeaker does provide a pytorch implementation of its pretrained ResNet models.

Let's use this!

@hbredin
Copy link
Member Author

hbredin commented Nov 10, 2023

Among the people who raised their thumb on this issue, anyone wants to take care of it?

@wsstriving
Copy link

Hi, I am the initiator of Wespeaker, thanks for the interest of our toolkit!
We will update wespeaker to support installation and load pytorch model in a way such as "model = wespeaker.load_pytorch_model" very soon(currently we support wespeaker.load_model, but it's onnx), then I will open a PR to "pipelines/speaker_verification"

@hbredin
Copy link
Member Author

hbredin commented Nov 13, 2023

Thanks @wsstriving! I worked on this a few days ago and already have a working prototype.

Instead of adding one more dependency to pyannote.audio, I was planning to copy the part of the WeSpeaker into a new pyannote.audio.models.embedding.wespeaker module.

I am just stuck with the fact that WeSpeaker uses Apache-2.0 license, while pyannote uses MIT license. Both are permissive but I am not quite sure where and how to mention WeSpeaker license into pyannote codebase. Would putting it at the top of the pyannote.audio.models.embedding.wespeaker directory be enough?

Another option that I am considering is adding embedding entrypoint to pyannote.audio so that any external libraries can actually provide embeddings usable in pyannote as long as they follow the API. What do you think?

@wsstriving
Copy link

Thanks @wsstriving! I worked on this a few days ago and already have a working prototype.

Instead of adding one more dependency to pyannote.audio, I was planning to copy the part of the WeSpeaker into a new pyannote.audio.models.embedding.wespeaker module.

I am just stuck with the fact that WeSpeaker uses Apache-2.0 license, while pyannote uses MIT license. Both are permissive but I am not quite sure where and how to mention WeSpeaker license into pyannote codebase. Would putting it at the top of the pyannote.audio.models.embedding.wespeaker directory be enough?

Another option that I am considering is adding embedding entrypoint to pyannote.audio so that any external libraries can actually provide embeddings usable in pyannote as long as they follow the API. What do you think?

Hi Bredin, I think it's just fine for the first option. We implemented the CLI support and you can check it here https://github.com/wenet-e2e/wespeaker/blob/master/docs/python_package.md

Now, it's easy to use the wespeaker model in pytorch as:

import wespeaker
model = wespeaker.load_model('english')
model.set_gpu(0)
print(model.model)

# model.model(feats)

Check https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/cli/speaker.py#L63 for more details to use it.

@hbredin
Copy link
Member Author

hbredin commented Nov 13, 2023

Quick update:

Could any of you (who raised their thumbs) try the following:

@stygmate
Copy link

stygmate commented Nov 14, 2023

@hbredin

I made a quick test, I don't have checked the results and i'm unsure of the pipeline def.

what i run:

from pyannote.audio.pipelines import SpeakerDiarization
from pyannote.audio.pipelines.utils.hook import ProgressHook
import torch

pipeline = SpeakerDiarization(segmentation="pyannote/segmentation-3.0",embedding="pyannote/wespeaker-voxceleb-resnet34-LM")


pipeline.instantiate({
    "segmentation": {
        "min_duration_off": 0.0,
    },
    "clustering": {
        "method": "centroid",
        "min_cluster_size": 12,
        "threshold": 0.7045654963945799,
    },
})

pipeline.to(torch.device("mps"))

with ProgressHook() as hook:
    diarization = pipeline("./download/test.wav", hook=hook)

i got this warning: Model was trained with pyannote.audio 2.1.1, yours is 3.0.1. Bad things might happen unless you revert pyannote.audio to 2.x.

Seems to work on CPU.

For GPU (mac m1 max) i got this error: NotImplementedError: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
after setting the env var it work but with a mix of gpu and cpu.

@hbredin
Copy link
Member Author

hbredin commented Nov 15, 2023

Thanks @stygmate for the feedback.

To use the same setup as pyannote/speaker-diarization-3.0, one should use the following:

from pyannote.audio.pipelines import SpeakerDiarization
from pyannote.audio.pipelines.utils.hook import ProgressHook
from pyannote.audio import Audio
import torch

pipeline = SpeakerDiarization(
    segmentation="pyannote/segmentation-3.0",
    segmentation_batch_size=32
    embedding="pyannote/wespeaker-voxceleb-resnet34-LM",
    embedding_exclude_overlap=True,
    embedding_batch_size=32)

# other values of `*_batch_size` may lead to faster processing. 
# the larger may not necessarily be the faster.

pipeline.instantiate({
    "segmentation": {
        "min_duration_off": 0.0,
    },
    "clustering": {
        "method": "centroid",
        "min_cluster_size": 12,
        "threshold": 0.7045654963945799,
    },
})

# send the pipeline to your prefered device
device = torch.device("cpu") 
device = torch.device("cuda")
device = torch.device("mps")  
pipeline.to(device)

# load audio in memory (usually leads to faster processing)
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io(audio)
file = {"waveform": waveform, "sample_rate": sample_rate}

# process the audio 
with ProgressHook() as hook:
    diarization = pipeline(file, hook=hook)

I'd love to get feedback from you all regarding possible algorithmic or speed regressions .

@stygmate
Copy link

@hbredin Give me a wav file to process, I will send you the results.

@hbredin
Copy link
Member Author

hbredin commented Nov 16, 2023

Closing as latest version no longer relies on ONNX runtime.
Please update to pyannote.audio 3.1 and pyannote/speaker-diarization-3.1 (and open new issues if needed).

@hbredin hbredin closed this as completed Nov 16, 2023
@magicse
Copy link

magicse commented Nov 16, 2023

It work ok . But I use torch 1.XX
segmentation ---------------------------------------- 100% 0:00:09
speaker_counting ---------------------------------------- 100% 0:00:00
embeddings ---------------------------------------- 100% 0:06:39
discrete_diarization ---------------------------------------- 100% 0:00:00

And i made some changes for comptability with torch 1.xx and torch 2.xx
file \pyannote\audio\models\embedding\wespeaker_init_.py

if torch.__version__ >= "2.0.0":
    # Use torch.vmap for torch 2.0 or newer
    from torch import vmap
else:
    # Use functorch.vmap for torch 1.12 or older
    from functorch import vmap

And change
features = torch.vmap(self._fbank)(waveforms.to(fft_device)).to(device)
to
features = vmap(self._fbank)(waveforms.to(fft_device)).to(device)

Same changes in file \pyannote\audio\models\blocks\pooling.py

@hbredin
Copy link
Member Author

hbredin commented Nov 16, 2023

Thanks for the feedback (and the PR!).
However, I don't plan to support torch 1.x in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants