Skip to content

Batch Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper

License

Notifications You must be signed in to change notification settings

thegoodwei/whisper-diarization-batchprocess

 
 

Repository files navigation

Speaker Diarization Using OpenAI Whisper

Functionality

  1. batch_diarize_audio(input_audios, model_name="medium.en", stemming=False): This function takes a list of input audio files, processes them, and generates speaker-aware transcripts and SRT files for each input audio file. It maintains consistent speaker numbering across all files in the batch and labels the most-spoken speaker as the 'instructor'.

  2. diarize_audio(input_audio, model_name="medium.en", stemming=False): This function takes a single input audio file and processes it to extract speaker-wise word mappings and sentence mappings. It generates speaker-aware transcripts and SRT files for the input audio file.

  3. Helper functions: The code also includes several helper functions for processing the audio files, extracting speaker information, and generating output files.

Usage

  1. Import the necessary libraries and functions:
from batch_diarize import batch_diarize_audio
  1. Prepare a list of input audio files:
input_audios = ["audio1.wav", "audio2.wav", "audio3.wav"]
  1. Call the batch_diarize_audio function with the list of input audio files:
results = batch_diarize_audio(input_audios)
  1. The results variable will contain a list of tuples, where each tuple contains the following information for each input audio file:

    • input_audio: The input audio file name
    • wsm: Speaker-wise word mappings
    • ssm: Speaker-wise sentence mappings
    • instructor_speaker_number: The speaker number assigned to the instructor
    • instructor_embeddings: The embeddings of the instructor
  2. The code will also generate output files with speaker-aware transcripts (in TXT format) and subtitles (in SRT format) for each input audio file. The output files will have the same name as the input audio files, with the corresponding file extensions (.txt and .srt).

Example

from batch_diarize import batch_diarize_audio

input_audios = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = batch_diarize_audio(input_audios)

for result in results:
    print(f"Input audio: {result[0]}")
    print(f"Instructor speaker number: {result[3]}")
    print(f"Instructor embeddings: {result[4]}")
    print("Speaker-wise sentence mappings:")
    for ssm in result[2]:
        print(ssm)

This example demonstrates how to use the batch_diarize_audio function to process a list of input audio files and generate speaker-aware transcripts and SRT files. It also prints the instructor speaker number, instructor embeddings, and speaker-wise sentence mappings for each input audio file.

Forked From

GitHub stars GitHub issues GitHub license Twitter Open in Colab

Speaker Diarization pipeline based on OpenAI Whisper I'd like to thank @m-bain for Wav2Vec2 forced alignment, @mu4farooqi for punctuation realignment algorithm

This work is based on OpenAI's Whisper , Nvidia NeMo , and Facebook's Demucs

drawing Please, star the project on github (see top-right corner) if you appreciate my contribution to the community!

What is it

This repository combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using WhisperX to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by WhisperX to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.

Whisper, WhisperX and NeMo parameters are coded into diarize.py and helpers.py, I will add the CLI arguments to change them later

Usage

python diarize.py -a AUDIO_FILE_NAME

Known Limitations

  • Only tested on english but several other languages are supported
  • Overlapping speakers are yet to be addressed, a possible approach would be to separate the audio file and isolate only one speaker, then feed it into the pipeline but this will need much more computation
  • There might be some errors, please raise an issue if you encounter any.

Future Improvements

  • Implement a maximum length per sentence for SRT
  • Use Whisper word-level timestamps for languages that are not in WhisperX
  • Improve performance using Faster Whisper or Batched Inference

Aknowledgements

Special Thanks for @adamjonas for supporting this project

About

Batch Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 54.3%
  • Python 45.7%