Tool to create written transcripts for podcasts and other interview style audio.
See an example podcast transcript here.
Podcasts are verging on the most popular form of long-form media due to its ease of creation, and consumption by end-users. Originally built to allow for revision and notetaking, now with summarisation features it can also be used to get a quick understanding of a podcast's contents.
- Audio-to-Text transcription including speaker detection (diarization).
- Automatic topic generation.
- Topic and full-text summarisation.
- Front-end transcript display.
- Easy to read transcript with summarisation.
- Episode audio player with word-level seeking.
Use of free publically available AI models that can be run on a local instance.
- Transcription:
whisperx
- Open source transcription pipeline combining OpenAI's transcription model whisper with Active Speech Recognition (ASR) to generate accurate word-level timestamps.
- Diarization:
pyannote
- Open source model to diarize transcripts.
- Topic Modelling:
- Clustering embeddings created by public sentence embedding model as well as cosine positional embeddings.
- Summarisation:
- Use of meta's
Llama2
LLM to generate titles and summaries for topics.
- Use of meta's
Run the main.py script with the following parameters supplied:
- url
- episode_name
- media_type (podcast/youtube)
- n_speakers (optional)
This will run a pipeline of the mentioned processes, downloading the specified url, and outputting the transcript as an html file that can be viewed through a browser.
Developed firstly through jupyter notebooks (see each for additional info and design choices).
These notebooks have each been created as python modules through the nbdev
library. This was simply done by tagging the code with required cells with the #|export
tag.
pip install git+https://github.com/stephankostov/transcriber.git
Whisper requires ffmpeg and also rust to also be installed. See their installation instructions in their repo for details.
- Pyannote diarization model performs poorly with overlapping speech
- Look into other diarization models such as nvidia nemo
- Speech segments often wrongly split in between sentences
- Split speech segments on sentences rather than words
- Topic grouping is rather arbitrary
- Topics are usually introduced by the interviewer so code a solution that takes this into account in the topic splitting.