Incredibly fun and powerful to string along a series of simple tools: parsing a podcast xml, downloading and reencoding the files with ffmpeg, transcribed with whisper-cpp, simple html written by concatenating strings in python, scp'd into a nearly-free speech-host.
Browse at cristobal.arquipelago.org/entitled-opinions
Download the Entitled Opinions podcast RSS feed and save it as opinions.xml.
curl https://entitled-opinions.com/feed/podcast > opinions.xml
1_parse_xml.py will extract the key information from the XML using the Python
built-in html.minidom and save it as opinions.json.
2_download_and_transcribe.py will read the saved json and download the audio
files for each episode. It will then convert the .mp3 files into 16 kHz
.wav to be processed with whisper-cpp.
whisper-cpp must be installed somewhere in system. The path to the binary is
hard-coded in the python file above. The fastest and lowest quality model was
used, though higher quality can be done by changing the model size and being
patient.
This generates a local file structure with the episode transcripts. The recording Unix timestamp is used as a unique identifier for each episode.
├── 1_parse_xml.py
├── 2_download_and_transcribe.py
├── 3_generate_html.py
│
├── entitled-opinions.xml
├── entitled-opinions.json
│
└── episodes
├── 1126670400
│ ├── audio.mp3
│ ├── audio.wav
│ └── transcript-tiny.vtt
├── 1126929600
│ ├── audio.mp3
3_generate_html.py then reads this file structure and creates an index.html
file for each episode, as well as an homepage (index.html) and reference
index (index2.html). For this reference index, we ignore common words present
in 10000-most-common-words.txt.
├── index.html
├── index2.html
└── episodes
├── 1126670400
│ ├── index.html
│ ├── audio.mp3
│ ├── audio.wav
│ └── transcript-tiny.vtt
├── 1126929600
│ ├── index.html
The output can be previewed locally by running a web-server, e.g.
python3 -m http.server
Alternatively, it can be hosted on a web provider, e.g. Nearly Free Speech