# Creating Audiobooks with Gemini 2.5 TTS
Previously, I [experimented](exploring_gemini_25_tts.ipynb) with the Gemini-2.5 Flash Preview TTS model by generating speech from simple text in a markdown file. Encouraged by the experience, I wanted to create another notebook where I read an ePub file instead of a markdown file and convert the entire ePub into speech. In this notebook, we will go through the process of creating "audiobooks" with the Gemini 2.5 Flash Preview TTS model.

Every quarter, I receive a couple of magazines in ePub format that I do not often get a chance to read. These magazines are available in the ePub format which is a popular open standard for encoding data and metadata about complex texts like magazines and books. In this notebook, we will explore how to load these ePub files in Python, use the Gemini 2.0 Flash model to extract metadata from certain contents of the file, generate speech for the text, and finally save all generated audio along with chapter markers to a file. This audiobook can then be read on a supported app like *Book Player* on the iPhone.

First, let us load the required packages, set our secrets, and create a `genai` client. We can use the `uv` package for installing the packages and creating the necessary Python environment.

In [None]:
import json
import os
import time
from typing import Dict, List, Tuple

import ebooklib
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from ebooklib import epub
from google import genai
from google.genai import types
from pydub import AudioSegment

## Create Gemini client
We now load our API key and then create the client.

In [None]:
load_dotenv("../secrets.env")
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=GEMINI_API_KEY)

## Define prompts
In this notebook, we use prompts for 2 scenarios:
1. Use a Gemini 2.0 Flash model to tell if a given text contains a genuine magazine article or other content like advertisements, crosswords, etc. If the text constains an article, extract its title, sub-headline, author and the starting point of the body of the text.
2. Instruct Gemini 2.5 Flash Preview TTS on how to read the given text.

In [None]:
GEMINI_CHAPTER_CLASSIFY_AND_EXTRACT_PROMPT = """
You are an assistant for audiobook creation. Given the first 20 lines of a 
chapter from an ePub, do the following:
1. Determine if the content is an article (story, essay, or feature) or 
something else (advertisement, crossword, non-story, etc).
2. If it is an article, extract and return:
    - The headline (main title)
    - The sub-headline (subtitle, if present, else empty string)
    - The author (if present, else empty string)
    - The line number (0-based) from which the actual article body starts 
    (i.e., after headline, sub-headline, and author)
3. If it is not an article, do not extract any details.
Respond in the following JSON format:
{
    "is_article": true/false,
    "headline": "...",
    "sub_headline": "...",
    "author": "...",
    "body_start_line": int
}
If it is not an article, set is_article to false and leave other fields 
empty or null.
Here are the first 20 lines of the chapter (one per line):
"""

PROMPT_FOR_READING = "Read in an even tone with a West London accent."

## Read ePub
An ePub file is a special XML-based open-format that allows us to pack data and metadata into a single file. It is popularly used for books, magazine, and other kinds of reports. In an ePub file, different chapters or sections are denoted in two ways:
1. Each chapter contains its own HTML file in an archive that is the ePub file.
2. The file contains a table of contents which links to individual chapters or sections.

Using the `ebooklib` package, we define a function that can the ePub file, iterate over its chapters and then use the Gemini API to get further metadata about each chapter.

In [None]:
def extract_body(text_lines: list, body_start: int) -> str:
    """
    Extract the body text from the chapter, starting from the specified start line.
    Args:
        text_lines (list): List of non-empty lines from the chapter.
        body_start (int): The line number to start the body text extraction.
    Returns:
        str: The extracted body text.
    """
    return "\n".join(text_lines[body_start:]).strip()


def gemini_classify_and_extract_chapter(client, lines):
    """
    Use Gemini to classify and extract article metadata from the first lines
    of a chapter.

    Args:
        client: Gemini client.
        lines (list[str]): First 20 lines of the chapter.
    Returns:
        dict: {is_article, headline, sub_headline, author, body_start_line}
    """
    prompt = GEMINI_CHAPTER_CLASSIFY_AND_EXTRACT_PROMPT + "\n" + "\n".join(lines[:20])
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=prompt,
        config=types.GenerateContentConfig(temperature=0.0, max_output_tokens=256),
    )

    # Remove code block markers if present
    text = response.candidates[0].content.parts[0].text.strip()
    if text.startswith("```"):
        text = text.split("\n", 1)[-1]
    if text.endswith("```"):
        text = text.rsplit("\n", 1)[0]

    try:
        result = json.loads(text)
    except Exception:
        raise ValueError(f"Could not parse Gemini response: {text}")

    return result


def read_epub_by_chapters(epub_path: str) -> Dict[str, str]:
    """
    Reads an ePub file and returns a dictionary of chapter headlines to cleaned text,
    extracting headline, sub-headline, author, and body using Gemini.
    Args:
        epub_path (str): Path to the ePub file.
    Returns:
        dict: {headline: chapter_text}
    """
    book = epub.read_epub(epub_path)
    candidate_chapters = []
    for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
        soup = BeautifulSoup(item.get_content(), "html.parser")
        for tag in soup(["img", "script", "style"]):
            tag.decompose()

        text_lines = [l.strip() for l in soup.get_text("\n").split("\n") if l.strip()]
        if not text_lines:
            continue

        # Use Gemini to classify and extract metadata
        meta = gemini_classify_and_extract_chapter(client, text_lines)
        if not meta.get("is_article"):
            continue
        headline = meta.get("headline") or ""
        sub_headline = meta.get("sub_headline") or ""
        author = meta.get("author") or ""
        body_start = meta.get("body_start_line") or 0
        body = "\n".join(text_lines[body_start:]).strip()

        if headline and body:
            chapter_text = headline + "\n\n"
            if sub_headline:
                chapter_text += sub_headline + "\n\n"
            if author:
                chapter_text += author + "\n\n"
            chapter_text += body
            candidate_chapters.append((headline, chapter_text))

    filtered_chapters = {}
    for headline, chapter_text in candidate_chapters:
        filtered_chapters[headline] = chapter_text

    return filtered_chapters

## Split text
Since the Gemini model only accepts up to 8000 input tokens and each token is approximately 4 characters long, we define a function that splits text into chunks of up to 7000 tokens, adding two empty lines at the end of each chunk.

In [None]:
def split_text_to_chunks(
    text: str, max_tokens: int = 7000, token_chars: int = 4
) -> List[str]:
    """
    Split the input text into chunks, each with a maximum number of tokens.

    Args:
        text (str): The input text to split.
        max_tokens (int): Maximum number of tokens per chunk (default: 7000).
        token_chars (int): Number of characters per token (default: 4).

    Returns:
        List[str]: A list of text chunks, each ending with two newlines.
    """
    max_chars = max_tokens * token_chars
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + max_chars, len(text))
        chunk = text[start:end].rstrip() + "\n\n"
        chunks.append(chunk)
        start = end
    return chunks

## Create audiobook
Finally, we define the function creates the audiobooks. First, grouped by chapter, it joins all the sound bytes for a given chapter. It then creates a file with the `_chapter.txt` suffix with information about timestamps of each article. In order to play the audiobook with chapters on mobile device, we need to use `ffmpeg` and hence a terminal command is printed which should make it easy to generated the desired audiobook.

In [None]:
def group_pcm_by_chapter(
    pcm_chunks: List[bytes],
    chapters: List[str],
) -> Tuple[List[str], Dict[str, List[bytes]]]:
    """
    Groups consecutive PCM audio chunks by chapter name.

    Args:
        pcm_chunks (List[bytes]): Flat list of PCM audio data for all chapter chunks.
        chapters (List[str]): Flat list of chapter names, same length as pcm_chunks.
        Consecutive repeats for the same chapter.

    Returns:
        Tuple[List[str], Dict[str, List[bytes]]]:
            - List of chapter names in order of appearance (no duplicates).
            - Dictionary mapping chapter name to list of PCM chunks
            (consecutive for that chapter).
    """
    chapter_audio_dict: Dict[str, List[bytes]] = {}
    chapter_order: List[str] = []
    prev_chapter = None
    for ch, pcm in zip(chapters, pcm_chunks):
        if ch != prev_chapter:
            chapter_order.append(ch)
            chapter_audio_dict[ch] = []
        chapter_audio_dict[ch].append(pcm)
        prev_chapter = ch
    return chapter_order, chapter_audio_dict


def join_chapter_audio(
    chapter_order: List[str],
    chapter_audio_dict: Dict[str, List[bytes]],
    sample_width: int,
    rate: int,
    channels: int,
) -> List[Tuple[str, "AudioSegment"]]:
    """
    Joins PCM chunks for each chapter into a single AudioSegment.

    Args:
        chapter_order (List[str]): List of chapter names in order.
        chapter_audio_dict (Dict[str, List[bytes]]): Dict mapping chapter name to list of PCM chunks.
        sample_width (int): Sample width in bytes.
        rate (int): Sample rate in Hz.
        channels (int): Number of audio channels.

    Returns:
        List[Tuple[str, AudioSegment]]: List of (chapter name, joined AudioSegment) tuples in order.
    """
    chapter_audio_segments: List[Tuple[str, "AudioSegment"]] = []
    for ch in chapter_order:
        segs = [
            AudioSegment(
                data=pcm, sample_width=sample_width, frame_rate=rate, channels=channels
            )
            for pcm in chapter_audio_dict[ch]
        ]
        chapter_audio = sum(segs)
        chapter_audio_segments.append((ch, chapter_audio))
    return chapter_audio_segments


def write_chapters_txt(
    filename: str, chapter_audio_segments: List[Tuple[str, "AudioSegment"]]
) -> str:
    """
    Writes a chapters.txt file for ffmpeg using chapter start/end times and titles.

    Args:
        filename (str): Output audiobook filename (used to derive chapters.txt name).
        chapter_audio_segments (List[Tuple[str, AudioSegment]]): List of
        (chapter name, AudioSegment) tuples.

    Returns:
        str: Path to the generated chapters.txt file.
    """
    chapter_times: List[int] = []
    current_time = 0
    for _, seg in chapter_audio_segments:
        chapter_times.append(current_time)
        current_time += len(seg)
    full_audio_len = current_time
    chapters_txt = filename.rsplit(".", 1)[0] + "_chapters.txt"
    with open(chapters_txt, "w", encoding="utf-8") as f:
        for i, (start_ms, (ch, _)) in enumerate(
            zip(chapter_times, chapter_audio_segments)
        ):
            f.write(f"[CHAPTER]\nTIMEBASE=1/1000\nSTART={start_ms}\n")
            end_ms = (
                chapter_times[i + 1] if i + 1 < len(chapter_times) else full_audio_len
            )
            f.write(f"END={end_ms}\nTITLE={ch}\n\n")
    print(f"Chapters file saved as {chapters_txt}.")
    return chapters_txt


def save_audiobook(
    filename: str,
    pcm_chunks: List[bytes],
    chapters: List[str],
    channels: int = 1,
    rate: int = 24000,
    sample_width: int = 2,
) -> None:
    """
    Save concatenated PCM audio data to an M4B (AAC) audiobook file and generate
    a chapters.txt file for ffmpeg. Consecutive identical chapter names are grouped,
    their PCM chunks joined, and one entry per chapter is created in chapters.txt.

    Args:
        filename (str): Output M4B file name.
        pcm_chunks (List[bytes]): Flat list of PCM audio data for all chapter chunks.
        chapters (List[str]): Flat list of chapter names, same length as pcm_chunks.
        channels (int, optional): Number of audio channels. Defaults to 1.
        rate (int, optional): Sample rate in Hz. Defaults to 24000.
        sample_width (int, optional): Sample width in bytes. Defaults to 2.
    """
    if not filename.lower().endswith(".m4b"):
        filename = filename.rsplit(".", 1)[0] + ".m4b"
    chapter_order, chapter_audio_dict = group_pcm_by_chapter(pcm_chunks, chapters)
    chapter_audio_segments = join_chapter_audio(
        chapter_order, chapter_audio_dict, sample_width, rate, channels
    )
    # Export as M4B (AAC in M4B container)
    full_audio = sum(seg for _, seg in chapter_audio_segments)
    full_audio.export(filename, format="mp4", codec="aac")
    print(f"Audiobook saved as {filename}.")
    chapters_txt = write_chapters_txt(filename, chapter_audio_segments)
    print("To mux chapters into the M4B, use:")
    print(
        f"ffmpeg -i '{filename}' -f ffmetadata -i '{chapters_txt}' -map_metadata 1 -codec copy '{filename.rsplit('.', 1)[0]}_with_chapters.m4b'"
    )

## Prepare chunks
Having defined all the relevant functions, we now load the ePub file and convert its contents into chunks of up to 7000 tokens.

In [None]:
EPUB_FILE_PATH = "<INPUT FILE>.epub"
chapters = read_epub_by_chapters(EPUB_FILE_PATH)
chapter_chunks = {}
for title, text in chapters.items():
    chapter_chunks[title] = split_text_to_chunks(text)
print(f"Chapters found: {list(chapter_chunks.keys())}")

## Generate audio
We are now at the crux of the notebook where we iterate over each chunk of text and send it to Gemini 2.5 Flash Preview TTS to convert it to speech. After attaching a chapter name to each chunk, using the `save_audiobook()`, we save the speech chunks to a `m4b` file and then print a command that will attach the chapter metadata correctly to the `m4b` file using `ffmpeg`.

In [None]:
all_pcm_chunks = []
chapter_titles = []
for chapter_title, chunks in chapter_chunks.items():
    for idx, chunk in enumerate(chunks):
        print(f"Generating TTS for {chapter_title} - chunk {idx + 1}/{len(chunks)}")
        response = client.models.generate_content(
            model="gemini-2.5-flash-preview-tts",
            contents=f"{PROMPT_FOR_READING}: {chunk}",
            config=types.GenerateContentConfig(
                response_modalities=["AUDIO"],
                speech_config=types.SpeechConfig(
                    voice_config=types.VoiceConfig(
                        prebuilt_voice_config=types.PrebuiltVoiceConfig(
                            voice_name="Iapetus",
                        )
                    )
                ),
            ),
        )
        all_pcm_chunks.append(response.candidates[0].content.parts[0].inline_data.data)
        chapter_titles.append(chapter_title)
        time.sleep(120)  # Sleep to avoid hitting rate limits

In [None]:
AUDIOBOOK_FILE = "<OUTPUT FILE>.m4b"
save_audiobook(AUDIOBOOK_FILE, all_pcm_chunks, chapter_titles)