# Audio Transcription with Gemini

This notebook shows how to use Google's Gemini model for automatic speech recognition using the `google-genai` SDK.

## Import Libraries and Setup Client

Make sure to have your `.env` file ready:

```bash
cp env.example .env
```
then open .env with a text editor and fill in your GEMINI_API_KEY

In [None]:
import os
from dotenv import load_dotenv
from google import genai
from google.genai import types
from IPython.display import Audio, display
import textwrap

# Load environment variables from .env file
assert load_dotenv(), "Could not find credentials from .env file"
assert os.getenv(
    "GEMINI_API_KEY"
), "GEMINI_API_KEY not found. Make sure it is set in the .env file"

# Only needed on the Udacity workspace. Comment this out if running on another system.
os.environ['HF_HOME'] = '/voc/data/huggingface'
os.environ['OLLAMA_MODELS'] = '/voc/data/ollama/cache'
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['PATH'] = f"/voc/data/ollama/bin:/voc/data/ffmpeg/bin:{os.environ.get('PATH', '')}"
os.environ['LD_LIBRARY_PATH'] = f"/voc/data/ollama/lib:/voc/data/ffmpeg/lib:{os.environ.get('LD_LIBRARY_PATH', '')}"

# Initialize the Gemini client
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

print("Gemini client initialized")

Gemini client initialized


## Simple Audio Transcription

Let's start with basic transcription using a publicly available audio file:

In [2]:
from datasets import load_dataset

# Load a sample audio file from the LibriSpeech dataset
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
audio_sample = dataset[0]["audio"]

# Extract the audio array and sample rate
audio_array = audio_sample["array"]
sampling_rate = audio_sample["sampling_rate"]

All we have to do is to ask the model to transcribe, and then include our audio:

In [None]:
from io import BytesIO
import soundfile as sf


buffer = BytesIO()
sf.write(buffer, audio_array, sampling_rate, format="WAV")
buffer.seek(0)
audio_bytes = buffer.read()

response = client.models.generate_content(
    # NOTE: any other model of the Gemini family would work the same way
    model="gemini-2.5-flash-lite",
    contents=[
        "Please transcribe this audio file.",
        types.Part.from_bytes(
            data=audio_bytes,
            mime_type="audio/wav",
        ),
    ],
)

# NOTE: for longer files, you will need to save the sample,
# upload it to Google, then provide a link to the file in the prompt.
# Please note that this is not supported in the Udacity environment:
#
#     import audiofile as af
#
#     af.write("sample.ogg", audio_array, sampling_rate)
#     audio_file = client.files.upload(file="./sample.ogg")
#
#     response = client.models.generate_content(
#         model="gemini-2.5-flash-lite",
#         contents=[
#             "Please transcribe this audio file.",
#             audio_file
#         ),
#     ],
# )

display(Audio(audio_bytes))
print("Transcription:")
print(textwrap.fill(response.text, width=80))

Transcription:
Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his
gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells
us that at this festive season of the year, with Christmas and roast beef
looming before us, similes drawn from eating and its results occur most readily
to the mind. He has grave doubts whether Sir Frederick Leighton's work is really
Greek after all, and can discover in it but little of rocky Ithaca. Linell's
pictures are a sort of Up guards and Adam paintings, and Mason's exquisite
idylls are as national as a jingo poem. Mr. Birkett Foster's landscapes smile at
one much in the same way that Mr. Carker used to flash his teeth. And Mr. John
Collier gives his sitter a tearful slap on the back, before he says, like a
shampooer in a Turkish bath, Next man.


## Transcription with Timestamps

For timestamps, we need to enable the audio timestamp feature.

In [None]:
import re
import json


# Transcribe with timestamps
response = client.models.generate_content(
    model="gemini-2.5-flash-lite",
    contents=[
        """
        Transcribe this audio with timestamps in the format [HH:MM:SS] text, 
        where HH is the hour, MM is the minute, and SS is the second.
        """,
        types.Part.from_bytes(
            data=audio_bytes,
            mime_type="audio/wav",
        ),
    ],
)

print("Transcription with timestamps:")
display(Audio(audio_array, rate=sampling_rate))

# Split the result
time_stamps = {
    match.group(1): match.group(2).strip()
    for match in re.finditer(
        # This regular expression matches the timestamps and the following text
        r"\[(\d{2}:\d{2}:\d{2})\]\s*([^[]*?)(?=\[\d{2}:\d{2}:\d{2}\]|$)", response.text
    )
}
for timestamp, text in time_stamps.items():
    print(f"[{timestamp}] {textwrap.fill(text, width=80)}")

Transcription with timestamps:


[00:00:59] Mr. Quilter is the apostle of the middle classes and we're glad to welcome his
gospel.
[00:06:00] Nor is Mr. Quilter's manner less interesting than his matter.
[00:11:00] He tells us that at this festive season of the year, with Christmas and roast
beef looming before us, similes drawn from eating and its results occur most
readily to the mind.
[00:23:00] He has grave doubts whether Sir Frederick Leighton's work is really Greek after
all, and can discover in it but little of rocky Ithaca.
[00:33:00] Lannel's pictures are a sort of upguards and Adam paintings, and Mason's
exquisite idylls are as national as a jingo poem.
[00:44:00] Mr. Birket Foster's landscapes smile at one much in the same way that Mr.
Carruthers used to flash his teeth.
[00:52:00] And Mr. John Collier gives his sitter a cheerful slap on the back before he
says, like a shampooer in a Turkish bath, next man.


## Translation

Gemini can also translate between many languages:

In [12]:
# Multi-speaker transcription
# The translation is:
# It is very easy to use. Continue listening.
with open("italian.mp3", "rb") as f:
    audio_bytes = f.read()


response = client.models.generate_content(
    model="gemini-2.5-flash-lite",
    contents=[
        "Translate this audio to English",
        types.Part.from_bytes(
            data=audio_bytes,
            mime_type="audio/mp3",
        )
    ]
)

display(Audio(audio_bytes))
print("Translation:")
print(textwrap.fill(response.text, width=80))

Translation:
It's very easy to use. Keep listening.
