# Document to Podcast

Source code: https://github.com/mozilla-ai/document-to-podcast

Docs: https://mozilla-ai.github.io/document-to-podcast/

This notebooks goes through the process of transforming documents into engaging podcast episodes involves an integration of pre-processing, LLM-powered transcript generation, and text-to-speech generation.

For educational purposes, the "low level" API is used.

You can check the [Command Line Interface](https://mozilla-ai.github.io/document-to-podcast/cli/) for a simpler usage.

## GPU Check

First, you'll need to enable GPUs for the notebook:

- Navigate to `Edit`→`Notebook Settings`
- Select T4 GPU from the Hardware Accelerator section
- Click `Save` and accept.

Next, we'll confirm that we can connect to the GPU:

In [None]:
import torch

if not torch.cuda.is_available():
    raise RuntimeError("GPU not available")
else:
    print("GPU is available!")

## Installing dependencies

In [None]:
%pip install --quiet https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp310-cp310-linux_x86_64.whl
%pip install --quiet document-to-podcast

## Uploading data

In [None]:
from google.colab import files

uploaded = files.upload()

## Loading and cleaning data

[Docs for this Step](https://mozilla-ai.github.io/document-to-podcast/step-by-step-guide/#step-1-document-pre-processing)

In [None]:
from pathlib import Path
from document_to_podcast.preprocessing import DATA_CLEANERS, DATA_LOADERS

input_file = list(uploaded.keys())[0]
suffix = Path(input_file).suffix

data_loader = DATA_LOADERS[suffix]
data_cleaner = DATA_CLEANERS[suffix]

In [None]:
raw_text = data_loader(input_file)
print(f"Number of characters before cleaning: {len(raw_text)}")
print(raw_text[:200])

In [None]:
clean_text = data_cleaner(raw_text)
print(f"Number of characters after cleaning: {len(clean_text)}")
print(clean_text[:200])

## Downloading and loading models

[Docs for this Step](https://mozilla-ai.github.io/document-to-podcast/step-by-step-guide/#step-2-podcast-script-generation)

For this demo, we are using the following models:
  - [OLMoE-1B-7B-0924-Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct-GGUF)
  - [OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf](https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF)

You can check the [Customization Guide](https://mozilla-ai.github.io/document-to-podcast/customization/) for more information on how to use different models.

In [None]:
from document_to_podcast.inference.model_loaders import (
    load_llama_cpp_model,
    load_tts_model,
)

text_model = load_llama_cpp_model(
    "allenai/OLMoE-1B-7B-0924-Instruct-GGUF/olmoe-1b-7b-0924-instruct-q8_0.gguf"
)
speech_model = load_tts_model("OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf")

In [None]:
max_characters = text_model.n_ctx() * 4
if len(clean_text) > max_characters:
    print(
        f"Input text is too big ({len(clean_text)})."
        f" Using only a subset of it ({max_characters})."
    )
    clean_text = clean_text[:max_characters]

## Podcast generation

[Docs for this Step](https://mozilla-ai.github.io/document-to-podcast/step-by-step-guide/#step-3-audio-podcast-generation)

### Speaker configuration

In [None]:
from document_to_podcast.config import Speaker

speakers = [
    {
        "id": 1,
        "name": "Laura",
        "description": "The main host. She explains topics clearly using anecdotes and analogies, teaching in an engaging and captivating way.",
        "voice_profile": "female_1",
    },
    {
        "id": 2,
        "name": "Jon",
        "description": "The co-host. He keeps the conversation on track, asks curious follow-up questions, and reacts with excitement or confusion, often using interjections like hmm or umm.",
        "voice_profile": "male_1",
    },
]

speakers_str = "\n".join(
    str(Speaker.model_validate(speaker))
    for speaker in speakers
    if all(speaker.get(x, None) for x in ["name", "description", "voice_profile"])
)

### Prompt Configuration

In [None]:
PROMPT = """
You are a podcast scriptwriter generating engaging and natural-sounding conversations in JSON format.
The script features the following speakers:
{SPEAKERS}
Instructions:
- Write dynamic, easy-to-follow dialogue.
- Include natural interruptions and interjections.
- Avoid repetitive phrasing between speakers.
- Format output as a JSON conversation.
Example:
{
  "Speaker 1": "Welcome to our podcast! Today, we're exploring...",
  "Speaker 2": "Hi! I'm excited to hear about this. Can you explain...",
  "Speaker 1": "Sure! Imagine it like this...",
  "Speaker 2": "Oh, that's cool! But how does..."
}
"""
system_prompt = PROMPT.replace("{SPEAKERS}", speakers_str)
print(system_prompt)

### Model inference

In [None]:
import re

from document_to_podcast.inference.text_to_speech import text_to_speech
from document_to_podcast.inference.text_to_text import text_to_text_stream
from IPython.display import display, Audio

podcast_audio = []
podcast_script = ""
text = ""
for chunk in text_to_text_stream(
    clean_text, text_model, system_prompt=system_prompt.strip()
):
    text += chunk
    if text.endswith("\n") and "Speaker" in text:
        podcast_script += text
        print(text)

        speaker_id = re.search(r"Speaker (\d+)", text).group(1)
        voice_profile = next(
            speaker["voice_profile"]
            for speaker in speakers
            if speaker["id"] == int(speaker_id)
        )
        speech = text_to_speech(
            text.split(f'"Speaker {speaker_id}":')[-1],
            speech_model,
            voice_profile,
        )
        podcast_audio.append(speech)
        display(Audio(speech, rate=speech_model.sample_rate))
        text = ""

## Save the results

You can download the results from the file explorer.

In [None]:
with open("podcast.txt", "w") as f:
    f.write(podcast_script)

In [None]:
import numpy as np
import soundfile as sf

sf.write(
    "podcast.wav",
    np.concatenate(podcast_audio),
    samplerate=speech_model.sample_rate,
)