<a href="https://colab.research.google.com/github/riya-Sharma2802/listenwise-notes/blob/main/automated_podcastgeneraor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
#%% [markdown]
# Automated Podcast Show Notes Generator
# Colab-ready end-to-end pipeline
#
# Features:
# - Audio preprocessing (ffmpeg)
# - Speaker diarization (pyannote recommended, fallback to VAD+clustering)
# - Transcription (Whisper or openai-whisper or WhisperX)
# - Alignment of text to speakers and timestamps
# - Topic segmentation + chapter timestamps
# - Key moment detection (salient sentences + audio energy peaks)
# - Show notes output in Markdown (summary, chapters, takeaways, guest info)
# - Enhancements: link extraction, quote highlighting, SEO meta, social posts
#
# Instructions:
# 1. Upload podcast audio files to Colab (left-side Files panel) or mount Google Drive.
# 2. Install dependencies (next cell).
# 3. Set environment variables for OPENAI_API_KEY and HF_TOKEN if available.
# 4. Run pipeline: call `process_episode(path_to_audio, out_dir=...)`.


In [5]:
# Install dependencies
!pip install --quiet --upgrade pip
!pip install --quiet ffmpeg-python pydub openai transformers sentence-transformers scikit-learn numpy scipy librosa matplotlib tqdm nltk spacy yake python-dotenv

# Whisper ASR
!pip install --quiet git+https://github.com/openai/whisper.git

# WhisperX for diarization + alignment
!pip install --quiet whisperx

# Download spaCy English model
!python -m spacy download en_core_web_sm





[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/1.8 MB[0m [31m10.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m31.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build whee

In [7]:
#%%python
import os
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")  # optional
# If you want to provide HF token later for improved models, set HF_TOKEN env var.
HF_TOKEN = os.environ.get("HF_TOKEN")
print("OPENAI_API_KEY set:", bool(OPENAI_API_KEY))
print("HF_TOKEN set:", bool(HF_TOKEN))

# Imports
import io, json, math, datetime, re
from pathlib import Path
from tqdm import tqdm
import numpy as np
import librosa
from pydub import AudioSegment
import soundfile as sf
import torch



OPENAI_API_KEY set: False
HF_TOKEN set: False


  m = re.match('([su]([0-9]{1,2})p?) \(([0-9]{1,2}) bit\)$', token)
  m2 = re.match('([su]([0-9]{1,2})p?)( \(default\))?$', token)
  elif re.match('(flt)p?( \(default\))?$', token):
  elif re.match('(dbl)p?( \(default\))?$', token):


In [8]:
#%%python
def ensure_dir(p):
    Path(p).mkdir(parents=True, exist_ok=True)

def to_wav_mono_16k(in_path, out_path):
    audio = AudioSegment.from_file(in_path)
    audio = audio.set_frame_rate(16000).set_channels(1).set_sample_width(2)
    audio.export(out_path, format="wav")
    return out_path

def format_timestamp(seconds):
    # seconds -> H:MM:SS
    return str(datetime.timedelta(seconds=int(seconds)))


In [9]:
#%%python
import whisper
import whisperx

# Load Whisper model once (choose size: tiny/base/small/medium/large)
def load_whisper(model_name="small"):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = whisper.load_model(model_name, device=device)
    return model, device

def transcribe_and_align(wav_path, whisper_model, device):
    # 1) Transcribe
    print("Whisper transcribing (this may take a while)...")
    res = whisper_model.transcribe(wav_path, beam_size=5)
    segments = res.get("segments", [])
    # 2) WhisperX alignment to get word timestamps
    print("Running WhisperX alignment...")
    asr_model = whisperx.load_model("small", device)  # same model family
    result = asr_model.transcribe(wav_path)
    # load alignment model for forced-alignment
    align_model = whisperx.load_align_model(device)
    aligned = whisperx.align(result["segments"], align_model, wav_path, device)
    # aligned contains 'word_segments' (list of dicts with start,end,word)
    return segments, aligned.get("word_segments", [])


In [10]:
#%%python
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering

embed_model = SentenceTransformer("all-MiniLM-L6-v2")

def diarize_fallback(wav_path, window_sec=1.5, step_sec=1.0, n_speakers=2):
    y, sr = librosa.load(wav_path, sr=16000)
    wlen = int(window_sec * sr)
    slen = int(step_sec * sr)
    frames = []
    times = []
    for start in range(0, max(1, len(y)-wlen), slen):
        end = start + wlen
        chunk = y[start:end]
        times.append((start/sr, end/sr))
        # represent chunk by MFCC mean
        mf = librosa.feature.mfcc(y=chunk, sr=sr, n_mfcc=20)
        frames.append(np.mean(mf, axis=1))
    X = np.array(frames)
    if len(X) < n_speakers:
        n_speakers = max(1, len(X))
    cl = AgglomerativeClustering(n_clusters=n_speakers).fit(X)
    labels = cl.labels_
    # merge consecutive segments with same label to produce diarization list
    segments = []
    if len(labels)==0:
        return segments
    cur_label = labels[0]
    cur_start = times[0][0]
    cur_end = times[0][1]
    for i in range(1, len(labels)):
        if labels[i] == cur_label:
            cur_end = times[i][1]
        else:
            segments.append((cur_start, cur_end, f"Speaker {cur_label+1}"))
            cur_label = labels[i]
            cur_start, cur_end = times[i]
    segments.append((cur_start, cur_end, f"Speaker {cur_label+1}"))
    return segments


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
#%%python
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt', quiet=True)

def assign_speakers_by_overlap(transcript_segments, word_segments, diarization_segments):
    """
    transcript_segments: whisper segments (each has 'start','end','text')
    word_segments: whisperx word-level segments [{'start','end','word'}...]
    diarization_segments: list of (start,end,label)
    Output: assigned_sentences list of dicts: start,end,speaker,text
    """
    # Build a fast lookup for word -> speaker using midpoint overlap
    word_speakers = []
    for w in word_segments:
        mid = (w['start'] + w['end'])/2.0
        spk = "Unknown"
        best_overlap = 0
        for dstart, dend, label in diarization_segments:
            if dstart <= mid <= dend:
                spk = label
                break
        word_speakers.append({"start": w['start'], "end": w['end'], "word": w['word'], "speaker": spk})
    # Reconstruct sentences per whisper transcript segments, grouping words by sentence boundaries
    assigned = []
    for seg in transcript_segments:
        sstart, send, text = seg['start'], seg['end'], seg['text']
        s_words = [w for w in word_speakers if w['start'] >= sstart - 1e-3 and w['end'] <= send + 1e-3]
        if not s_words:
            # fallback: assign whole segment to overlapping diarization label with max overlap
            best = "Unknown"; best_ov=0
            for dstart,dend,label in diarization_segments:
                ov = max(0, min(send,dend)-max(sstart,dstart))
                if ov>best_ov:
                    best_ov=ov; best=label
            assigned.append({"start": sstart, "end": send, "speaker": best, "text": text.strip()})
            continue
        # group words into sentences by punctuation heuristics
        words_text = [w['word'] for w in s_words]
        joined = " ".join(words_text).strip()
        # naive sentence split by periods (use nltk tokenize for reliability)
        sents = sent_tokenize(joined)
        # distribute timestamps proportionally across sents
        total_dur = send - sstart if send > sstart else 0.001
        pos = 0
        # build speaker for each small slice using majority speaker of the words in that slice
        word_idx = 0
        for st in sents:
            wcount = len(st.split())
            if wcount == 0:
                continue
            # collect next wcount words or until end
            slice_words = s_words[word_idx: word_idx + wcount]
            word_idx += wcount
            if not slice_words:
                continue
            seg_start = slice_words[0]['start']
            seg_end = slice_words[-1]['end']
            # majority speaker
            spks = [w['speaker'] for w in slice_words]
            from collections import Counter
            spk = Counter(spks).most_common(1)[0][0] if spks else "Unknown"
            assigned.append({"start": seg_start, "end": seg_end, "speaker": spk, "text": st.strip()})
    return assigned


In [12]:
#%%python
from sklearn.cluster import KMeans
import yake
from sklearn.metrics.pairwise import cosine_similarity

kw_extractor = yake.KeywordExtractor(lan="en", top=5)
embed_sent = embed_model  # from earlier

def make_chapters(assigned_sentences, max_chapters=6, min_length_sec=20):
    texts = [s['text'] for s in assigned_sentences]
    if not texts:
        return []
    embs = embed_sent.encode(texts, convert_to_numpy=True, show_progress_bar=False)
    k = min(max_chapters, max(1, len(texts)//4))
    kmeans = KMeans(n_clusters=k, random_state=42).fit(embs)
    labels = kmeans.labels_
    # merge consecutive same label into chapters
    chapters = []
    cur_label = labels[0]
    cur_start = assigned_sentences[0]['start']
    cur_end = assigned_sentences[0]['end']
    texts_in = [assigned_sentences[0]['text']]
    for i in range(1, len(labels)):
        if labels[i]==cur_label:
            cur_end = assigned_sentences[i]['end']
            texts_in.append(assigned_sentences[i]['text'])
        else:
            chapters.append((cur_start, cur_end, cur_label, texts_in[:]))
            cur_label = labels[i]
            cur_start = assigned_sentences[i]['start']
            cur_end = assigned_sentences[i]['end']
            texts_in = [assigned_sentences[i]['text']]
    chapters.append((cur_start, cur_end, cur_label, texts_in[:]))
    # merge small chapters
    merged=[]
    for ch in chapters:
        if not merged:
            merged.append(ch)
        else:
            last = merged[-1]
            if ch[1]-ch[0] < min_length_sec:
                merged[-1] = (last[0], ch[1], last[2], last[3]+ch[3])
            else:
                merged.append(ch)
    out=[]
    for start,end,lab,texts in merged:
        combined = " ".join(texts)
        kws = kw_extractor.extract_keywords(combined)
        top_terms = [k for k,_ in kws[:3]]
        out.append({"start":start, "end":end, "topic_terms": top_terms, "excerpt": texts[:3]})
    return out

def extract_key_moments(assigned_sentences, top_k=6):
    texts = [s['text'] for s in assigned_sentences]
    if not texts:
        return []
    embs = embed_sent.encode(texts, convert_to_numpy=True, show_progress_bar=False)
    centroid = np.mean(embs, axis=0, keepdims=True)
    sims = cosine_similarity(embs, centroid).flatten()
    # yake importance (inverted)
    yake_scores = []
    for t in texts:
        kws = kw_extractor.extract_keywords(t)
        score = kws[0][1] if kws else 1.0
        yake_scores.append(1.0/(score+1e-6))
    final = 0.6*sims + 0.4*np.array(yake_scores)
    idxs = np.argsort(final)[::-1][:top_k]
    km = []
    for i in sorted(idxs):
        s = assigned_sentences[i]
        km.append({"start": s['start'], "end": s['end'], "speaker": s['speaker'], "text": s['text'], "score": float(final[i])})
    return km


In [13]:
#%%python
from transformers import pipeline
import spacy
nlp_spacy = spacy.load("en_core_web_sm")

# local summarizer (BART). On GPU this will be faster. May require significant RAM for long transcripts.
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0 if torch.cuda.is_available() else -1)

def summarize_long_text(text, max_length=140):
    # chunk long text into sizes manageable by summarizer
    words = text.split()
    if len(words) < 700:
        out = summarizer(text, max_length=max_length, min_length=40, do_sample=False)
        return out[0]['summary_text']
    chunks = []
    chunk_size = 600
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i+chunk_size])
        out = summarizer(chunk, max_length=120, min_length=30, do_sample=False)
        chunks.append(out[0]['summary_text'])
    combined = " ".join(chunks)
    out = summarizer(combined, max_length=max_length, min_length=40, do_sample=False)
    return out[0]['summary_text']

def summarize_openai(text):
    if not OPENAI_API_KEY:
        raise ValueError("OPENAI_API_KEY not set")
    import openai
    openai.api_key = OPENAI_API_KEY
    prompt = f"Summarize the following podcast transcript into a concise paragraph (3-5 sentences):\n\n{text}"
    resp = openai.ChatCompletion.create(
        model="gpt-4o-mini",
        messages=[{"role":"user","content":prompt}],
        max_tokens=250,
        temperature=0.2
    )
    return resp['choices'][0]['message']['content'].strip()


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [14]:
#%%python
def build_markdown(title, summary, chapters, key_moments, assigned_sentences, guests=None, links=None):
    md=[]
    md.append(f"# {title}\n")
    md.append(f"**Summary:** {summary}\n")
    duration = format_timestamp(max([s['end'] for s in assigned_sentences]) if assigned_sentences else 0)
    md.append(f"**Duration:** {duration}\n\n")
    md.append("## Chapters\n")
    for ch in chapters:
        md.append(f"- {format_timestamp(ch['start'])} — {format_timestamp(ch['end'])}: {', '.join(ch['topic_terms'])}")
        if ch.get("excerpt"):
            md.append(f"  > {' '.join(ch['excerpt'])[:220]}...")
    md.append("\n## Key Moments\n")
    for km in key_moments:
        md.append(f"- {format_timestamp(km['start'])} — **{km['speaker']}**: {km['text']}")
    md.append("\n## Transcript (sample)\n")
    for s in assigned_sentences[:200]:
        md.append(f"- {format_timestamp(s['start'])} — **{s['speaker']}**: {s['text']}")
    if guests:
        md.append("\n## Guests / People Mentioned\n")
        for p,c in guests.items():
            md.append(f"- {p} (mentioned {c} times)")
    if links:
        md.append("\n## Links & Resources\n")
        for u in links:
            md.append(f"- {u}")
    return "\n".join(md)


In [19]:
def transcribe_and_align(wav_path, whisper_model, device):
    import whisperx

    print("Running WhisperX alignment (CPU safe)...")

    # Load WhisperX model in int8 mode (avoids FP16 error)
    model = whisperx.load_model(
        "small",               # or whisper_size if passed separately
        device="cpu",
        compute_type="int8"
    )

    # Transcribe
    result = model.transcribe(
        wav_path,
        batch_size=8
    )

    # Load alignment model
    align_model, metadata = whisperx.load_align_model(
        language_code=result["language"],
        device="cpu",

    )

    # Align
    result_aligned = whisperx.align(
        result["segments"],
        align_model,
        metadata,
        wav_path,
        device="cpu"
    )

    # Return aligned segments
    return result_aligned["segments"], result_aligned["word_segments"]


In [20]:
from collections import Counter

def process_episode(audio_path, out_dir="outputs", whisper_size="small", diarization_speakers=2, use_openai_summary=False):
    ensure_dir(out_dir)
    audio_path = str(audio_path)
    stem = Path(audio_path).stem
    wav_out = os.path.join(out_dir, f"{stem}_16k.wav")
    print("Converting to WAV 16k mono...")
    to_wav_mono_16k(audio_path, wav_out)

    whisper_model, device = load_whisper(whisper_size)

    print("Running diarization fallback...")
    diar = diarize_fallback(wav_out, n_speakers=diarization_speakers)
    print(f"Diarization segments: {len(diar)}")

    # CPU-safe WhisperX transcribe + alignment
    w_segments, word_segments = transcribe_and_align(wav_out, whisper_model, device)

    assigned = assign_speakers_by_overlap(w_segments, word_segments, diar)
    print(f"Assigned {len(assigned)} sentence-level units.")

    chapters = make_chapters(assigned, max_chapters=6)
    key_moments = extract_key_moments(assigned, top_k=8)

    full_text = " ".join([s['text'] for s in assigned])

    try:
        summary = summarize_long_text(full_text)
    except:
        summary = full_text[:150] + "..."

    doc = nlp_spacy(full_text)
    persons = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
    guest_counts = dict(Counter(persons).most_common(10))

    urls = re.findall(r'(https?://[^\s]+)', full_text)

    title = stem
    md = build_markdown(title, summary, chapters, key_moments, assigned, guests=guest_counts, links=list(set(urls)))

    md_path = os.path.join(out_dir, f"{stem}_shownotes.md")
    with open(md_path, "w", encoding="utf-8") as f:
        f.write(md)

    meta = {
        "assigned": assigned,
        "chapters": chapters,
        "key_moments": key_moments,
        "summary": summary,
        "guests": guest_counts,
        "links": list(set(urls))
    }

    json_path = os.path.join(out_dir, f"{stem}_metadata.json")
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(meta, f, indent=2)

    print("Saved:", md_path, json_path)
    return {"md_path": md_path, "json_path": json_path, "summary": summary, "chapters": chapters}


In [22]:
audio_file = "/content/Communication - Basics and Importance.mp4"

result = process_episode(
    audio_file,
    out_dir="/content/outputs",
    whisper_size="small",
    diarization_speakers=2
)

print(result)



Converting to WAV 16k mono...
Running diarization fallback...
Diarization segments: 71
Running WhisperX alignment (CPU safe)...
2025-11-16 16:36:47 - whisperx.asr - INFO - No language specified, language will be detected for each audio file (increases inference time)
2025-11-16 16:36:47 - whisperx.vads.pyannote - INFO - Performing voice activity detection using Pyannote...


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.6. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../usr/local/lib/python3.12/dist-packages/whisperx/assets/pytorch_model.bin`
  torchaudio.list_audio_backends()


Model was trained with pyannote.audio 0.0.1, yours is 3.4.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.8.0+cu126. Bad things might happen unless you revert torch to 1.x.
2025-11-16 16:37:35 - whisperx.asr - INFO - Detected language: en (0.99) in first 30s of audio
Assigned 78 sentence-level units.


Your max_length is set to 140, but your input_length is only 115. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=57)


Saved: /content/outputs/Communication - Basics and Importance_shownotes.md /content/outputs/Communication - Basics and Importance_metadata.json
{'md_path': '/content/outputs/Communication - Basics and Importance_shownotes.md', 'json_path': '/content/outputs/Communication - Basics and Importance_metadata.json', 'summary': 'Our emotions also affect our communication. How we interpret what we hear is affected by the thoughts that come to our mind when we are listening. For example, if your boss asks if the task that was assigned to you has been completed, you are likely to respond in anger.', 'chapters': [{'start': 0.031, 'end': 6.963, 'topic_terms': ['basics and importance', 'basics', 'importance'], 'excerpt': ['Communication, basics and importance.', 'In this video, we will learn what communication is.']}, {'start': 7.805, 'end': 100.624, 'topic_terms': ['personal and professional', 'communication', 'communication skills'], 'excerpt': ['We will also learn the importance of communication