DiCoW (Diarization-Conditioned Whisper) enhances OpenAI’s Whisper ASR model by integrating speaker diarization for multi-speaker transcription. The app leverages BUT-FIT/diarizen-wavlm-large-s80-mlc
to segment speakers and provides diarization-conditioned transcription for long-form audio inputs.
Training and inference source codes can be found here: TS-ASR-Whisper
Note: For the original v1 model, see the
v1
branch.
- Multi-Speaker ASR: Handles multi-speaker audio using diarization-aware transcription.
- Flexible Input Sources:
- Microphone: Record and transcribe live audio.
- Audio File Upload: Upload pre-recorded audio files for transcription.
- Folder Batch Processing – Process multiple .wav files from a directory via the command line.
- Diarization Support: Powered by
BUT-FIT/diarizen-wavlm-large-s80-mlc
for accurate speaker segmentation. - Built with 🤗 Transformers: Uses the latest Whisper checkpoints for robust transcription.
Run the app directly in your browser with Gradio app.
Before running the app, ensure you have the following installed:
- Python 3.11
- FFmpeg: Required for audio processing.
- Python Libraries:
gradio
transformers
pyannote.audio
torch
librosa
soundfile
- Clone the repository:
git clone https://github.com/BUTSpeechFIT/DiCoW.git cd DiCoW
- Setup dependencies:
pip install -r requirements.txt
- Clone DiariZen submodule:
git submodule init git submodule update
- Install the DiariZen dependencies:
cd DiariZen cd pyannote-audio pip install -e . cd .. & cd ..
Run the application locally:
python app.py
Once the server is running, access the app in your browser at http://localhost:7860
.
To process multiple .wav
files at once, run:
python inference.py --input-folder /path/to/wav/files
If you want to run this demo on background, it may be good to make a service out of it. (some distros kill the background jobs when user logs out, hence kill the demo).
To register the demo as service, first edit ./run_server.sh
and ./DiCoW-background.service
and set proper paths and users. It is important to set the conda correctly in ./run_server.sh
as the service is started out of the userspace (.profile
).
Then register and start the service (run as root):
systemctl enable ./DiCoW-background.service #register the service
systemctl start DiCoW-background.service #start
systemctl status DiCoW-background.service #check if it is running
systemctl stop DiCoW-background.service #stop
systemctl disable DiCoW-background.service #will not start on restart anymore
- Microphone: Use your device's microphone for live transcription.
- Audio File Upload: Upload pre-recorded audio files for diarization-conditioned transcription.
- Folder Batch Processing: Process multiple WAV files from command line for automated workflows.
We welcome contributions! If you’d like to add features or improve the app, please open an issue or submit a pull request.
This project is licensed under the Apache License 2.0.
If you use our model or code, please, cite:
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
issn = {0885-2308},
doi = {https://doi.org/10.1016/j.csl.2025.101841},
url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Target Speaker ASR with Whisper},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
doi={10.1109/ICASSP49660.2025.10887683}
}
@misc{polok2025mlcslmchallenge,
title={BUT System for the MLC-SLM Challenge},
author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget},
year={2025},
eprint={2506.13414},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2506.13414},
}
For more information, feel free to contact us: ipoloka@fit.vut.cz, xkleme15@vutbr.cz.