<a href="https://colab.research.google.com/github/03sarath/google-ai-studio-text-gen/blob/main/Audio_capabilities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Copyright 2025 Psitron Technologies Pvt Ltd



# **Explore audio capabilities with the Gemini API**
Gemini can respond to prompts about audio. For example, Gemini can:

Describe, summarize, or answer questions about audio content.
Provide a transcription of the audio.
Provide answers or a transcription about a specific segment of the audio.

**Note: You can't generate audio output with the Gemini API.**

This guide demonstrates different ways to interact with audio files and audio content using the Gemini API.

#Supported audio formats
Gemini supports the following audio format MIME types:

*   WAV - audio/wav
*   MP3 - audio/mp3
*   AIFF - audio/aiff
*   AAC - audio/aac
*   OGG Vorbis - audio/ogg
*   FLAC - audio/flac

#Technical details about audio
Gemini imposes the following rules on audio:

*   Gemini represents each second of audio as 25 tokens; for example, one minute of audio is represented as 1,500 tokens.
*   Gemini can only infer responses to English-language speech.
*   Gemini can "understand" non-speech components, such as birdsong or sirens.
*   The maximum supported length of audio data in a single prompt is 9.5 hours. Gemini doesn't limit the number of audio files in a single prompt; however, the total combined length of all audio files in a single prompt cannot exceed 9.5 hours.
*   Gemini downsamples audio files to a 16 Kbps data resolution.
*   If the audio source contains multiple channels, Gemini combines those channels down to a single channel.








# Before you begin: Set up your project and API key
**Install the Gemini API library**

Using Python 3.9+, install the google-genai package using the following pip command:

In [1]:
pip install -q -U google-genai


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/164.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m163.8/164.4 kB[0m [31m41.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.4/164.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h

#Get and secure your API key
You need an API key to call the Gemini API. If you don't already have one, create a key in Google AI Studio.

Using Colab Secrets, you can ensure that your Gemini API key is managed securely and persistently in your Google Colab notebooks.

In [3]:
from google import genai
import os
from google.colab import userdata  # Import for Colab Secrets

# Retrieve API key from Colab Secrets
try:
    api_key = userdata.get('GOOGLE_API_KEY')
    print("API key successfully retrieved from Colab Secrets.")  # Confirmation
except KeyError:
    raise ValueError(
        "GOOGLE_GENAI_API_KEY not found in Colab Secrets. Please configure your API key in Colab Secrets."
    )

API key successfully retrieved from Colab Secrets.


#Upload an audio file and generate content
You can use the File API to upload an audio file of any size. Always use the File API when the total request size (including the files, text prompt, system instructions, etc.) is larger than 20 MB.

Call media.upload to upload a file using the File API. The following code uploads an audio file and then uses the file in a call to models.generateContent.

(NOTE:-To use this function, the file must be uploaded in google colab files folder and the uploaded file is available only during that particular runtime in which it was uploaded and gets deleted once that runtime is terminated)

In [4]:
#Upload an audio file and generate content
from google import genai

client = genai.Client(api_key=api_key)

myfile = client.files.upload(file='/content/mlops_masterclass.mp3')

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=['Describe this audio clip', myfile]
)

print(response.text)

This audio clip features Sharath Kumar from Cytron Technologies, advertising a two-day, complete hands-on masterclass on MLOps (Machine Learning Operations), scheduled for March 11th and 12th. The class covers how to build, train, deploy, and monitor machine learning models in a production-ready MLOps pipeline. He emphasizes the live, hands-on nature of the projects, where participants replicate what they see on the instructor's screen. The class includes two real-time projects, focusing on AWS, specifically SageMaker, on the first day to understand machine learning, and then incorporating a complete MLOps pipeline on the second day to deploy in a real-time production environment. This automated ML pipeline includes automated data preparation, retraining, model deployment, and further support through a community group where interview questions, resume preparation, and optimization are discussed. A discount is being offered.


#Get metadata for a file
You can verify the API successfully stored the uploaded file and get its metadata by calling files.get.

In [5]:
#Get metadata for a file
myfile = client.files.upload(file='/content/mlops_masterclass.mp3')
file_name = myfile.name
myfile = client.files.get(name=file_name)
print(myfile)

name='files/e83d2apyd03v' display_name=None mime_type='audio/mpeg' size_bytes=1246316 create_time=datetime.datetime(2025, 5, 5, 14, 10, 28, 439923, tzinfo=TzInfo(UTC)) expiration_time=datetime.datetime(2025, 5, 7, 14, 10, 28, 404168, tzinfo=TzInfo(UTC)) update_time=datetime.datetime(2025, 5, 5, 14, 10, 28, 439923, tzinfo=TzInfo(UTC)) sha256_hash='YTk3ZGQzMTRmNDZlNzgyNGU5ZWEzNDcyY2UxYmZkMzViMmE4N2U1MzU5ZWRjZTM5ODQ1NDhiZDY5ZjE0NzQwYQ==' uri='https://generativelanguage.googleapis.com/v1beta/files/e83d2apyd03v' download_uri=None state=<FileState.ACTIVE: 'ACTIVE'> source=<FileSource.UPLOADED: 'UPLOADED'> video_metadata=None error=None


#Provide the audio file as inline data in the request
Instead of uploading an audio file, you can pass audio data in the same call that contains the prompt.

(Note the following about providing audio as inline data:-The maximum request size is 20 MB, which includes text prompts, system instructions, and files provided inline. If your file's size will make the total request size exceed 20 MB, then use the File API to upload files for use in requests.
If you're using an audio sample multiple times, it is more efficient to use the File API.)

Then, pass that downloaded small audio file along with the prompt to Gemini:

In [7]:
#Provide the audio file as inline data in the request
from google.genai import types

with open('/content/mlops_masterclass.mp3', 'rb') as f:
    image_bytes = f.read()

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=[
    'Describe this audio clip',
    types.Part.from_bytes(
      data=image_bytes,
      mime_type='audio/mp3',
    )
  ]
)

print(response.text)

The audio clip features Sharad Kumar from Cytron Technologies promoting an upcoming two-day hands-on masterclass on Machine Learning Operations (MLOps). The masterclass, scheduled for March 11th and 12th, will teach participants how to build, train, deploy, and monitor machine learning models in a production-ready MLOps pipeline.  He emphasizes that it's a live, hands-on experience where attendees will replicate projects on their own screens.  The masterclass will include building two real-time projects.  On day one, AWS (especially SageMaker) will be used to understand how to build, train, and deploy machine learning models.  Day two focuses on incorporating the built model into a complete MLOps pipeline and deploying it in a real-time production environment, automating processes like data preparation, retraining, and model deployment. He mentions an affordable price with a 50% discount.  In addition to the masterclass itself, participants will receive access to recordings, slides, an

#Get a transcript of the audio file
To get a transcript, just ask for it in the prompt. For example:

In [8]:
#To get a transcript, just ask for it in the prompt
myfile = client.files.upload(file='/content/mlops_masterclass.mp3')
prompt = 'Generate a transcript of the speech.'

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=[prompt, myfile]
)

print(response.text)

Hello this is sharad kumar from syton technologies I hope you are having a great day thanks for considering us and here is a quick audio message where I am going to provide a quick brief about our upcoming master class on machine learning operations so this is a two days complete hands on master class on mlops which is scheduled for 11th and 12th of march so where you going to learn how to build train deploy and monitor your machine learning models in a production ready mlops pipelines okay so this is a two days master class it's a complete live hands on master class so why do I specify this word live hands on so each and every project the hands ons that we go that we are going to do in this master class it's going to be live so whatever the projects that I will be replicating on my screen so literally you will be replicating on your screen that's how it works so that's the difference okay so in this particular master class we're going to build two real-time projects and on a day one w

#Get the duration of the audio file
To get the duration, just ask for it in the prompt. For example:

In [9]:
# Create a prompt to analyze the duration of the file.
myfile = client.files.upload(file='/content/mlops_masterclass.mp3')
prompt = "analyze and tell me how many minutes is the audio file"

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=[prompt, myfile]
)

print(response.text)

The audio file is approximately 2 minutes and 18 seconds long.


#Refer to timestamps in the audio file
A prompt can specify timestamps of the form MM:SS to refer to particular sections in an audio file. For example, the following prompt requests a transcript that:

Starts from the beginning of the file.
Ends at 3 minutes 9 seconds from the beginning of the file.

In [11]:
# Create a prompt containing timestamps.
myfile = client.files.upload(file='/content/mlops_masterclass.mp3')
prompt = 'Generate a transcript of the audio from the beginning of the audio file to 1 minutes 9 seconds of the file.'

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=[prompt, myfile]
)

print(response.text)

Hello, this is Sharath Kumar from Cyton Technologies. I hope you're having a great day. Thanks for considering us and here is a quick audio message where I'm going to provide a quick brief about our upcoming Masterclass on Machine Learning Operations. So, this is a two days complete hands on master class on ML-Ops, which is scheduled for 11th and 12th of March. So, we are going to learn how to build, train, deploy, and monitor your machine learning models in a production ready ML-Ops pipelines. Okay, so this is a two days master class, it's a complete live hands on master class. So, why do I specify this word live hands on. So, each and every projects the hands on that we go that we going to do in this master class it's going to be live. So, whatever the projects that I will be replicating on my screen, so literally you will be replicating on your screen, that's how it works. So, that's a difference. Okay, so in this particular master class, we're going to build two real time projects 

#Count tokens
Call the countTokens method to get a count of the number of tokens in the audio file. For example:

In [12]:
#Call the countTokens method to get a count of the number of tokens in the audio file.
myfile = client.files.upload(file='/content/mlops_masterclass.mp3')
response = client.models.count_tokens(
  model='gemini-2.0-flash',
  contents=[myfile]
)

print(response)

total_tokens=4986 cached_content_token_count=None
