Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyannote is 10 times slower than WhisperX with GPU utilization 10%: expected behavior or misconfiguration? #1652

Open
chubin opened this issue Feb 19, 2024 · 5 comments

Comments

@chubin
Copy link

chubin commented Feb 19, 2024

Tested versions

pyannote.audio==3.1.1
pyannote.core==5.0.0
pyannote.database==5.0.1
pyannote.metrics==3.2.1
pyannote.pipeline==3.0.1

System information

Ubuntu 22.04, NVIDIA RTX A6000

Issue description

I am not sure if it is a bug, so please feel free to close it if it is expected behavior.

I am trying to diarize a large recording (approximately 60 minutes), and the
diarization process takes 8.5 minutes:

real    8m40,982s
user    8m12,687s
sys     1m21,703s

Here is my code:

import torch
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1", use_auth_token=hf_token
    )

pipeline.to(torch.device("cuda"))

diarization = pipeline("audio.wav")

It uses the GPU during diarization, but with a low utilization level (~10%),
and it uses 1 core of the CPU (100%) all the time.

When doing the diarization with whisperx, though, it takes just a minute,
and GPU utilization is at full capacity.

However, the quality of diarization is slightly worse in this case (approximately 5% of text
is attributed to wrong/non-existent speakers).

           duration   GPU-usage
pyannote   520.5s     10%
whisperx    75.0s     100%

Pyannote diarization quality is just brilliant, but it takes an order of magnitude more time.

I suppose that I am doing something wrong, but I don't know what exactly.

Could you please point me in the right direction,
or just say that it is exactly as it should be, and the behavior is expected.

GPU utilization while using pyannote pure

pyannote

GPU utilization when using whisperX

whisperX

Minimal reproduction example (MRE)

(not applicable)

@hbredin
Copy link
Member

hbredin commented Feb 19, 2024

Would you mind sharing a link to a Google Colab that one can just click and run to reproduce the issue?

@chubin
Copy link
Author

chubin commented Feb 19, 2024

Unfortunately, I have no access to Google Colab from my Google Account (I can create a new account if needed),
but as you can see the code is trivial.

I noticed that the problem disappears, when I load the audio file using Audio:

from pyannote.audio import Audio
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io("audio.mp3")

diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

instead of loading audio.wav directly. The wav file (audio.wav) has the same sample rate (16000) though.

@hbredin
Copy link
Member

hbredin commented Feb 21, 2024

The code might be "trivial" but the whole point of sharing a Google Colab is for pyannote maintainers to avoid wasting time on problems that are not reproducible.

For instance, two files with two different extensions (.wav and .mp3) are mentioned here.
It is not clear which one works and which one fails.

Preparing a Google Colab will definitely increase your chances of having someone look at your issue. It might also happen that the mere preparation of the Google Colab makes you realize that the problem is on your side (I am not saying that this is the case here but it happened in the past).

@DerEchteFeuerpfeil
Copy link

+1 for this issue

thanks for the note @chubin , I have used your solution with

io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io("audio.mp3")

diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

and got much faster inference 👍

@ahmetkipkip
Copy link

Unfortunately, I have no access to Google Colab from my Google Account (I can create a new account if needed), but as you can see the code is trivial.

I noticed that the problem disappears, when I load the audio file using Audio:

from pyannote.audio import Audio
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io("audio.mp3")

diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

instead of loading audio.wav directly. The wav file (audio.wav) has the same sample rate (16000) though.

Wow, after updatin from 2.x to 3.x I had performance issues. Now It's better than old code. I really didn't get what caused that but..

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants