# Exploring Text To Speech with Gemini 2.5
At the Google I/O in 2025, the company announced the latest iteration of their Text to Speech (TTS) based on the Gemini 2.5 model. After playing around with it in Google AI Studio, I decided to explore it further in a notebook using the Gemini API with longer text. During the initial experiments, two features of the new model stood out. First, for diverse genre of texts spanning fiction and non-fiction material, the model generated human-like speech that was easy to understand and took pauses at the right places. Second, the manner in which the text should be read can now be specified with normal text prompts which makes it much easier to provide the model with additional context.

To get started, we create a Gemini API key and save it to `secrets.env`. We also need to install the `google-genai` package using the `uv` package manager.

In [None]:
from dotenv import load_dotenv
from typing import Union, List
import time
import os

from google import genai
from google.genai import types
from pydub import AudioSegment

In [None]:
load_dotenv("../secrets.env")
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

Next, we define a function that will write the output returned by Gemini to a `MP3` file.

In [None]:
def save_mp3_file(
    filename: str,
    pcm: Union[List[bytes], bytes],
    channels: int = 1,
    rate: int = 24000,
    sample_width: int = 2,
):
    """
    Save raw PCM audio data to an MP3 file.

    If pcm is a list of bytes, concatenate all parts before saving.

    Args:
        filename (str): The name of the output MP3 file.
        pcm (bytes or list of bytes): The raw PCM audio data to write.
        channels (int, optional): Number of audio channels. Defaults to 1 (mono).
        rate (int, optional): Sample rate in Hz. Defaults to 24000.
        sample_width (int, optional): Sample width in bytes. Defaults to 2 (16-bit audio).
    """
    if isinstance(pcm, list):
        pcm = b"".join(pcm)
    audio = AudioSegment(
        data=pcm,
        sample_width=sample_width,
        frame_rate=rate,
        channels=channels
    )
    audio.export(filename, format="mp3")

We now define a elementary function that estimates the cost of generating audio for the given text.

In [None]:
# Constants for Gemini 2.5 Flash TTS API pricing and token/audio calculations
INPUT_TOKEN_CHARS = 4  # Average number of characters per input token
INPUT_COST_PER_MILLION = 0.5  # USD per 1M input tokens
OUTPUT_TOKENS_PER_SECOND = 32  # Output tokens per second of audio
WORDS_PER_SECOND = 2  # Average spoken words per second
OUTPUT_COST_PER_MILLION = 10  # USD per 1M output tokens


def cost_estimator(text: str) -> dict:
    """
    Estimate the input, output, and total costs for using the Gemini 2.5 Flash
    TTS API.

    Args:
        text (str): The input text to be converted to speech.

    Returns:
        dict: A dictionary with numeric values for 'input', 'output', and
        'total' costs.
            - input: Estimated cost for input tokens (USD).
            - output: Estimated cost for output tokens (USD).
            - total: Sum of input and output costs (USD).
    """
    # Input cost calculation
    num_chars = len(text)
    num_input_tokens = num_chars / INPUT_TOKEN_CHARS
    input_cost = (num_input_tokens / 1_000_000) * INPUT_COST_PER_MILLION

    # Output cost calculation
    num_words = len(text.split())
    audio_seconds = num_words / WORDS_PER_SECOND
    num_output_tokens = audio_seconds * OUTPUT_TOKENS_PER_SECOND
    output_cost = (num_output_tokens / 1_000_000) * OUTPUT_COST_PER_MILLION

    total_cost = input_cost + output_cost

    return {"input": input_cost, "output": output_cost, "total": total_cost}

Since the Gemini TTS API has a limit of 8000 input tokens, we define a function to split the input text into chunks of at most 7000 tokens.

In [None]:
def split_text_to_chunks(
    text: str, max_tokens: int = 7000, token_chars: int = 4
) -> List[str]:
    """
    Split the input text into chunks, each with a maximum number of tokens.

    Args:
        text (str): The input text to split.
        max_tokens (int): Maximum number of tokens per chunk (default: 7000).
        token_chars (int): Number of characters per token (default: 4).

    Returns:
        List[str]: A list of text chunks, each ending with two newlines.
    """
    max_chars = max_tokens * token_chars
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + max_chars, len(text))
        chunk = text[start:end].rstrip() + "\n\n"
        chunks.append(chunk)
        start = end
    return chunks

We now read the text to be converted to speech and split it by paragraphs.

In [None]:
TEXT_FILE_PATH = "<FILE>.md"
with open(TEXT_FILE_PATH, "r", encoding="utf-8") as f:
    content = f.read()

In [None]:
content_by_chunks = split_text_to_chunks(content)
len(content_by_chunks)

We now define the prompt that will provide instructions to Gemini on how to read the text and any particular aspects of the text to focus or take care of.

In [None]:
PROMPT_FOR_READING = "Read in an even tone with a North London accent."

We now create a Gemini `client` that will allow us to interact with the API.

In [None]:
client = genai.Client(api_key=GEMINI_API_KEY)

Now to the exciting part! We generate speech for our input text, one paragraph at a time. We add a 2-minute sleep between each API call to prevent reaching [rate limits](https://ai.google.dev/gemini-api/docs/rate-limits). We also print the estimated costs at the beginning to get an indication for how much it will cost us.

In [None]:
estimated_costs = cost_estimator(content)
print(f"Estimated costs for TTS:\nInput: ${estimated_costs['input']:.4f}, "
      f"Response: ${estimated_costs['output']:.4f}"
      f", Total: ${estimated_costs['total']:.4f}")

tts_responses = []
for idx, chunk in enumerate(content_by_chunks):
    print(f"Generating TTS for chunk: {idx}")
    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-tts",
        contents=f"{PROMPT_FOR_READING}: {chunk}",
        config=types.GenerateContentConfig(
            response_modalities=["AUDIO"],
            speech_config=types.SpeechConfig(
                voice_config=types.VoiceConfig(
                    prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name="Algieba",
                    )
                )
            ),
        ),
    )

    tts_responses.append(response.candidates[0].content.parts[0].inline_data.data)

    time.sleep(120)  # Sleep to avoid hitting rate limits

Finally, we concatenate the generated audio snippets and save them to a `.mp3` file.

In [None]:
file_name = "<FILE>.mp3"  # Name of the output file
save_mp3_file(file_name, tts_responses)  # Saves the file to current directory