Speaker Diarization Using OpenAI Whisper

Functionality

batch_diarize_audio(input_audios, model_name="medium.en", stemming=False): This function takes a list of input audio files, processes them, and generates speaker-aware transcripts and SRT files for each input audio file. It maintains consistent speaker numbering across all files in the batch and labels the most-spoken speaker as the 'instructor'.
diarize_audio(input_audio, model_name="medium.en", stemming=False): This function takes a single input audio file and processes it to extract speaker-wise word mappings and sentence mappings. It generates speaker-aware transcripts and SRT files for the input audio file.
Helper functions: The code also includes several helper functions for processing the audio files, extracting speaker information, and generating output files.

Usage

Import the necessary libraries and functions:

from batch_diarize import batch_diarize_audio

Prepare a list of input audio files:

input_audios = ["audio1.wav", "audio2.wav", "audio3.wav"]

Call the batch_diarize_audio function with the list of input audio files:

results = batch_diarize_audio(input_audios)

The results variable will contain a list of tuples, where each tuple contains the following information for each input audio file:
- input_audio: The input audio file name
- wsm: Speaker-wise word mappings
- ssm: Speaker-wise sentence mappings
- instructor_speaker_number: The speaker number assigned to the instructor
- instructor_embeddings: The embeddings of the instructor
The code will also generate output files with speaker-aware transcripts (in TXT format) and subtitles (in SRT format) for each input audio file. The output files will have the same name as the input audio files, with the corresponding file extensions (.txt and .srt).

Example

from batch_diarize import batch_diarize_audio

input_audios = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = batch_diarize_audio(input_audios)

for result in results:
    print(f"Input audio: {result[0]}")
    print(f"Instructor speaker number: {result[3]}")
    print(f"Instructor embeddings: {result[4]}")
    print("Speaker-wise sentence mappings:")
    for ssm in result[2]:
        print(ssm)

This example demonstrates how to use the batch_diarize_audio function to process a list of input audio files and generate speaker-aware transcripts and SRT files. It also prints the instructor speaker number, instructor embeddings, and speaker-wise sentence mappings for each input audio file.

Forked From

Speaker Diarization pipeline based on OpenAI Whisper I'd like to thank @m-bain for Wav2Vec2 forced alignment, @mu4farooqi for punctuation realignment algorithm

This work is based on OpenAI's Whisper , Nvidia NeMo , and Facebook's Demucs

Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!

What is it

This repository combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using WhisperX to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by WhisperX to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.

Whisper, WhisperX and NeMo parameters are coded into diarize.py and helpers.py, I will add the CLI arguments to change them later

Usage

python diarize.py -a AUDIO_FILE_NAME

Known Limitations

Only tested on english but several other languages are supported
Overlapping speakers are yet to be addressed, a possible approach would be to separate the audio file and isolate only one speaker, then feed it into the pipeline but this will need much more computation
There might be some errors, please raise an issue if you encounter any.

Future Improvements

Implement a maximum length per sentence for SRT
Use Whisper word-level timestamps for languages that are not in WhisperX
Improve performance using Faster Whisper or Batched Inference

Aknowledgements

Special Thanks for @adamjonas for supporting this project

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
LICENSE		LICENSE
README.md		README.md
Whisper_Transcription_+_NeMo_Diarization.ipynb		Whisper_Transcription_+_NeMo_Diarization.ipynb
diarize.py		diarize.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speaker Diarization Using OpenAI Whisper

Functionality

Usage

Example

What is it

Usage

Known Limitations

Future Improvements

Aknowledgements

About

Releases

Packages

Languages

License

thegoodwei/whisper-diarization-batchprocess

Folders and files

Latest commit

History

Repository files navigation

Speaker Diarization Using OpenAI Whisper

Functionality

Usage

Example

What is it

Usage

Known Limitations

Future Improvements

Aknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages