# An Implementation of Notebook LM's PDF to Podcast

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/PDF_to_Podcast.ipynb)

### Introduction

In this notebook we will see how to create a podcast like the one below from a PDF input!


You may need to set up the following to run the notebook in Jupyter Lab. Follow the below steps in your terminal.

### Step 1: Create a virtual environment
python3.12 -m venv ~/venvs/pdf2podcast

### Step 2: Activate it
source ~/venvs/pdf2podcast/bin/activate

### Step 3: Install ipykernel (and anything else you need)
pip install ipykernel

### Step 4: Register the kernel with Jupyter
python -m ipykernel install --user --name=python312 --display-name "Python 3.12"

If you have compatible kernel and prefer to run in code editor, skip steps 5-7

### Step 5: Install JupyterLab
pip3 install jupyterlab

### Step 6: Start JupyterLab
jupyter lab

### Step 7: Select the kernel you just registered in the JupyterLab interface (top right corner)

### Other steps if you haven't done already:
- have ffmpeg installed `brew install ffmpeg`
- set your TOGETHER_API_KEY os env

Inspired by [Notebook LM's](https://notebooklm.google/) podcast generation feature and a recent open source implementation of [Open Notebook LM](https://github.com/gabrielchua/open-notebooklm). In this cookbook we will implement a walkthrough of how you can build a PDF to podcast pipeline. 

Given any PDF we will generate a conversation between a host and a guest discussing and explaining the contents of the PDF.

In doing so we will learn the following:
1. How we can use JSON mode and structured generation with open models like Llama 3 70b to extract a script for the Podcast given text from the PDF.
2. How we can use TTS models to bring this script to life as a conversation.

In [5]:
%pip install pypdf pydantic together cartesia ffmpeg-python alive-progress tqdm

Collecting pydub>=0.25.1 (from cartesia)
  Using cached pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1
Note: you may need to restart the kernel to use updated packages.


In [8]:
import os
import argparse
from pydantic import BaseModel
from typing import List, Literal
from together import Together
from pathlib import Path
from pypdf import PdfReader
from pydantic import ValidationError
from tqdm import tqdm
from alive_progress import alive_bar
import time
import requests

import subprocess
import ffmpeg
import pygame

pygame 2.6.1 (SDL 2.28.4, Python 3.12.9)
Hello from the pygame community. https://www.pygame.org/contribute.html


In [10]:
# Paste in your Together AI API key or load it. We will also access Together's Cartesia endpoint using this
api_key = os.getenv("TOGETHER_API_KEY") #or just set api_key = "your_api_key"

#note: you need to set this before launching the notebook. otherwise, close nb and relaunch. or just set the key here manually
if api_key is None:
    raise ValueError("TOGETHER_API_KEY is not set in your environment.")


### Define Dialogue Schema with Pydantic

We need a way of telling the LLM what the structure of the podcast script between the guest and host will look like. We will do this using `pydantic` models.

Below we define the required classes. 

- The overall conversation consists of lines said by either the host or the guest. The `DialogueItem` class specifies the structure of these lines.
- The full script is a combination of multiple lines performed by the speakers, here we also include a scratchpad field to allow the LLM to ideate and brainstorm the overall flow of the script prior to actually generating the lines. The `Dialogue` class specifies this. 

In [11]:
class LineItem(BaseModel):
    speaker: Literal["Host (Jane)", "Guest"]
    text: str

class Script(BaseModel):
    scratchpad: str
    name_of_guest: str
    script: List[LineItem]

In [12]:
# Adapted and modified from https://github.com/gabrielchua/open-notebooklm
SYSTEM_PROMPT = """
You are a world-class podcast producer tasked with transforming the provided input text into an engaging and informative podcast script. The input may be unstructured or messy, sourced from PDFs or web pages. Your goal is to extract the most interesting and insightful content for a compelling podcast discussion.

# Steps to Follow:

1. **Analyze the Input:**
   Carefully examine the text, identifying key topics, points, and interesting facts or anecdotes that could drive an engaging podcast conversation. Disregard irrelevant information or formatting issues.

2. **Brainstorm Ideas:**
   In the `<scratchpad>`, creatively brainstorm ways to present the key points engagingly. Consider:
   - Analogies, storytelling techniques, or hypothetical scenarios to make content relatable
   - Ways to make complex topics accessible to a general audience
   - Thought-provoking questions to explore during the podcast
   - Creative approaches to fill any gaps in the information

3. **Craft the Dialogue:**
   Develop a natural, conversational flow between the host (Jane) and the guest speaker (the author or an expert on the topic). Incorporate:
   - The best ideas from your brainstorming session
   - Clear explanations of complex topics
   - An engaging and lively tone to captivate listeners
   - A balance of information and entertainment

   Rules for the dialogue:
   - The host (Jane) always initiates the conversation and interviews the guest
   - Include thoughtful questions from the host to guide the discussion
   - Incorporate natural speech patterns, including occasional verbal fillers (e.g., "Uhh", "Hmmm", "um," "well," "you know")
   - Allow for natural interruptions and back-and-forth between host and guest - this is very important to make the conversation feel authentic
   - Ensure the guest's responses are substantiated by the input text, avoiding unsupported claims
   - Maintain a PG-rated conversation appropriate for all audiences
   - Avoid any marketing or self-promotional content from the guest
   - The host concludes the conversation

4. **Summarize Key Insights:**
   Naturally weave a summary of key points into the closing part of the dialogue. This should feel like a casual conversation rather than a formal recap, reinforcing the main takeaways before signing off.

5. **Maintain Authenticity:**
   Throughout the script, strive for authenticity in the conversation. Include:
   - Moments of genuine curiosity or surprise from the host
   - Instances where the guest might briefly struggle to articulate a complex idea
   - Light-hearted moments or humor when appropriate
   - Brief personal anecdotes or examples that relate to the topic (within the bounds of the input text)

6. **Consider Pacing and Structure:**
   Ensure the dialogue has a natural ebb and flow:
   - Start with a strong hook to grab the listener's attention
   - Gradually build complexity as the conversation progresses
   - Include brief "breather" moments for listeners to absorb complex information
   - For complicated concepts, reasking similar questions framed from a different perspective is recommended
   - End on a high note, perhaps with a thought-provoking question or a call-to-action for listeners

IMPORTANT RULE: Each line of dialogue should be no more than 100 characters (e.g., can finish within 5-8 seconds)

Remember: Always reply in valid JSON format, without code blocks. Begin directly with the JSON output.
"""

### Call the LLM to Generate Podcast Script

Below we call `Llama-3.1-70B` to generate a script for our podcast. We will also be able to read it's `scratchpad` and see how it structured the overall conversation.

In [14]:
client = Together(api_key=api_key) #set above 

def call_llm(system_prompt: str, text: str, schema_class):
    response = client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text}
        ],
        model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
        response_format={
            "type": "json_object",
            "schema": schema_class.model_json_schema(),
        },
    )
    return response

In [13]:
def generate_script(system_prompt: str, input_text: str, output_model):
    """Get the dialogue from the LLM."""
    # Load as python object
    try:
        response = call_llm(system_prompt, input_text, output_model)
        dialogue = output_model.model_validate_json(
            response.choices[0].message.content
        )
    except ValidationError as e:
        error_message = f"Failed to parse dialogue JSON: {e}"
        system_prompt_with_error = f"{system_prompt}\n\nPlease return a VALID JSON object. This was the earlier error: {error_message}"
        response = call_llm(system_prompt_with_error, input_text, output_model)
        dialogue = output_model.model_validate_json(
            response.choices[0].message.content
        )
    return dialogue

### Load in PDF of Choice

Here we will load in an academic paper that proposes the use of many open source language models in a collaborative manner together to outperform proprietary models that are much larger!

In [8]:
#https://arxiv.org/abs/2406.04692

# !wget https://arxiv.org/pdf/2406.04692
# !mv 2406.04692 MoA.pdf
# use above if you prefer wget
!curl -L https://arxiv.org/pdf/2406.04692 -o MoA.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1130k  100 1130k    0     0  5733k      0 --:--:-- --:--:-- --:--:-- 5737k


In [15]:
def get_PDF_text(file : str):
    text = ''

    # Read the PDF file and extract text
    try:
        with Path(file).open("rb") as f:
            reader = PdfReader(f)
            text = "\n\n".join([page.extract_text() for page in reader.pages])
    except Exception as e:
        raise f"Error reading the PDF file: {str(e)}"

        # Check if the PDF has more than ~131,072 characters
        # The context lenght limit of the model is 131,072 tokens and thus the text should be less than this limit
    if len(text) > 131072:
        raise "The PDF is too long. Please upload a PDF with fewer than ~131072 characters."

    return text

In [16]:
# Helper Functions - PDF processing. Conversts a pdf into text!
def get_pdf_text(file_path: str) -> str:
    with Path(file_path).open("rb") as f:
        reader = PdfReader(f)
        text = "\n\n".join([page.extract_text() or '' for page in reader.pages])
    if len(text) > 400000:
        raise ValueError("PDF is too long")
    return text

### Generate Script

Below we generate the script and print out the lines.

In [17]:
# we use alive bar to show a progress bar and it looks nice!
def generate_script(system_prompt, input_text, schema_class):
    try:
        with alive_bar(spinner='dots_waves', title='Generating podcast script...') as bar:
            response = call_llm(system_prompt, input_text, schema_class)
            parsed = schema_class.model_validate_json(response.choices[0].message.content)
    except ValidationError as e:
        print("Retrying due to validation error:", e)
        system_prompt += f"\n\nPrevious error: {e}"
        with alive_bar(spinner='dots_waves', title='Retrying generation...') as bar:
            response = call_llm(system_prompt, input_text, schema_class)
            parsed = schema_class.model_validate_json(response.choices[0].message.content)
    return parsed

In [18]:
filepath = 'MoA.pdf' #whatever pdf path you want to use
pdf_text = get_pdf_text(filepath)
script = generate_script(SYSTEM_PROMPT, pdf_text, Script)
# Print the script
for line in script.script:
    print(f"{line.speaker}: {line.text}")

Generating podcast script... |████████████████████████████████████████| 0 in 25.8s (0.00/s) 
Host (Jane): Welcome to our podcast, where we explore the world of artificial intelligence and its applications. Today, we have Junlin Wang, a researcher from Duke University, who has made significant contributions to the field of large language models. Junlin, welcome to the show!
Guest: Thank you, Jane, for having me. I'm excited to share our research on large language models and how we can harness their collective strengths to improve their capabilities.
Host (Jane): Let's dive right in. Your research proposes a new approach called Mixture-of-Agents, or MoA. Can you explain what that is and how it works?
Guest: MoA is a methodology that leverages the collective strengths of multiple large language models to improve their reasoning and language generation capabilities. We construct a layered MoA architecture, where each layer comprises multiple LLM agents. Each agent takes all the outputs fro

### Generate Podcast Using TTS

Below we read through the script and parse choose the TTS voice depending on the speaker. We define a speaker and guest voice id.

We can loop through the lines in the script and generate them by a call to the TTS model with specific voice and lines configurations. The lines all appended to the same buffer and once the script finishes we write this out to a `wav` file, ready to be played.



In [19]:
# Code for actually generating the audio... saves as a bunch of smaller files then concats 
def generate_audio_with_together(text, voice, model_id):
    api_key = os.getenv("TOGETHER_API_KEY")
    if api_key is None:
        raise ValueError("TOGETHER_API_KEY is not set in your environment.")
    
    url = "https://api.together.ai/v1/audio/generations"
    
    headers = {"Authorization": f"Bearer {api_key}"}
    
    data = {
        "input": text,
        "voice": voice,
        "response_format": "wav",  # Changed from mp3 to wav
        "sample_rate": 44100,
        "stream": False,
        "model": model_id,
    }
    
    response = requests.post(url, headers=headers, json=data)
    
    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        raise Exception(f"API request failed: {response.text}")
        
    return response.content

In [20]:
host_id = "wise man" # host ... look at together api for other voices
guest_id = "sarah" # Guest voice ... look at together api for other voices
model_id = "cartesia/sonic-2" # look at together api for other models
print("Starting audio generation...")
audio_files = []

for i, line in enumerate(tqdm(script.script, desc="Generating audio")):
    voice = host_id if line.speaker == "Host (Jane)" else guest_id
    
    audio_content = generate_audio_with_together(line.text, voice, model_id)
    
    # save each line as separate wav file
    filename = f"temp_audio_{i}.wav"  # Changed extension to .wav
    with open(filename, "wb") as f:
        f.write(audio_content)
    audio_files.append(filename)

concat_file = "concat_list.txt"
with open(concat_file, "w") as f:
    for file in audio_files:
        f.write(f"file '{file}'\n")

# Save as wav 
subprocess.run([
    "ffmpeg", "-f", "concat", "-safe", "0",
    "-i", "concat_list.txt", "-c", "copy", "podcast.wav"
])

# Clean up temp files
for file in audio_files:
    os.remove(file)
os.remove(concat_file)
print("Podcast generated successfully!")

Starting audio generation...


Generating audio: 100%|██████████| 16/16 [01:48<00:00,  6.79s/it]

Podcast generated successfully!



ffmpeg version 9c33b2f Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 9.3.0 (crosstool-NG 1.24.0.133_b0863d8_dirty)
  configuration: --prefix=/home/roy/miniconda3/envs/pdf2podcast --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1627813612080/_build_env/bin/x86_64-conda-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-gnutls --enable-gpl --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-libx264 --enable-pic --enable-pthreads --enable-shared --enable-static --enable-version3 --enable-zlib --enable-libmp3lame --pkg-config=/home/conda/feedstock_root/build_artifacts/ffmpeg_1627813612080/_build_env/bin/pkg-config
  libavutil      56. 51.100 / 56. 51.100
  libavcodec     58. 91.100 / 58. 91.100
  libavformat    58. 45.100 / 58. 45.100
  libavdevice    58. 10.100 / 58. 10.100
  libavfilter     7. 85.100 /  7. 85.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  7.100 /  5.  7.100
  libswresample   3.  

NOTE: There are many options for playing the audio file. Here we use `pygame` to play the file since it avoids a lot of compiling and installation issues. Make sure your system also has an audio output.

In [None]:
# Play the podcast 
# Play audio (optional - only if on host device) -> if your jupyter kernel is being weird you can run this script locally
play_audio = True #set true if you want to play the audio
if play_audio:
    try:
        import pygame
        pygame.init()
        pygame.mixer.init()
        pygame.mixer.music.load("podcast.wav")
        pygame.mixer.music.play()

        # Wait until playback finishes
        while pygame.mixer.music.get_busy():
            pygame.time.Clock().tick(10)
        print("Podcast finished playing...")
    except ImportError:
        print("pygame not available - try another audio output method")