<div align="center">

  <a href="https://ultralytics.com/yolo" target="_blank">
    <img width="1024", src="https://raw.githubusercontent.com/ultralytics/assets/main/yolov8/banner-yolov8.png"></a>

  [中文](https://docs.ultralytics.com/zh/) | [한국어](https://docs.ultralytics.com/ko/) | [日本語](https://docs.ultralytics.com/ja/) | [Русский](https://docs.ultralytics.com/ru/) | [Deutsch](https://docs.ultralytics.com/de/) | [Français](https://docs.ultralytics.com/fr/) | [Español](https://docs.ultralytics.com/es/) | [Português](https://docs.ultralytics.com/pt/) | [Türkçe](https://docs.ultralytics.com/tr/) | [Tiếng Việt](https://docs.ultralytics.com/vi/) | [العربية](https://docs.ultralytics.com/ar/)

  <a href="https://github.com/ultralytics/ultralytics/actions/workflows/ci.yml"><img src="https://github.com/ultralytics/ultralytics/actions/workflows/ci.yml/badge.svg" alt="Ultralytics CI"></a>
  <a href="https://colab.research.google.com/github/ultralytics/notebooks/blob/main/notebooks/how-to-use-google-gemini-models-for-object-detection-image-captioning-and-ocr.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
  
  <a href="https://ultralytics.com/discord"><img alt="Discord" src="https://img.shields.io/discord/1089800235347353640?logo=discord&logoColor=white&label=Discord&color=blue"></a>
  <a href="https://community.ultralytics.com"><img alt="Ultralytics Forums" src="https://img.shields.io/discourse/users?server=https%3A%2F%2Fcommunity.ultralytics.com&logo=discourse&label=Forums&color=blue"></a>
  <a href="https://reddit.com/r/ultralytics"><img alt="Ultralytics Reddit" src="https://img.shields.io/reddit/subreddit-subscribers/ultralytics?style=flat&logo=reddit&logoColor=white&label=Reddit&color=blue"></a>
  
  This notebook demonstrates how to use <a href="https://ai.google.dev/gemini-api/docs/models">Google Gemini models</a>, including the newly released Gemini 2.5 Pro (March 2025), with Ultralytics <a href="https://github.com/ultralytics/ultralytics">YOLO</a> utilities for object detection, image segmentation, and generating visualizations from text prompts such as image captioning.
  
  We aim to provide resources that help you maximize the potential of the Gemini family. If you need assistance, feel free to raise an issue on <a href="https://github.com/ultralytics/ultralytics">GitHub</a> or join our <a href="https://ultralytics.com/discord">Discord</a> community for discussions and support!

# What is Google Gemini?

Google Gemini is a family of multimodal AI models designed to help you process and understand various data types, including text, images, audio, video, and code. The suite includes both Large Language Models (LLMs) and Vision-Language Models (VLMs), enabling you to build versatile AI applications across domains.

In March 2025, Google released `Gemini 2.5 Pro Experimental`, which brings enhanced reasoning capabilities, improved code generation, and stronger  multimodal understanding, making it a powerful tool for vision-based workflows.

<img src="https://github.com/ultralytics/notebooks/releases/download/v0.0.0/gemini-2.5-pro-exp-benchmark.jpg" alt="Gemini 2.5 Pro Experimental Benchmarks" />

## Setup

To get started, we need to install the `ultralytics` and `google-genai` libraries. 🚀

pip install `ultralytics` and [dependencies](https://github.com/ultralytics/ultralytics/blob/main/pyproject.toml) and check software and hardware.

[![PyPI - Version](https://img.shields.io/pypi/v/ultralytics?logo=pypi&logoColor=white)](https://pypi.org/project/ultralytics/) [![Downloads](https://static.pepy.tech/badge/ultralytics)](https://www.pepy.tech/projects/ultralytics) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/ultralytics?logo=python&logoColor=gold)](https://pypi.org/project/ultralytics/)

In [1]:
!pip install -U -q google-genai ultralytics

import json

import cv2
import ultralytics
from google import genai
from google.genai import types
from PIL import Image
from ultralytics.utils.downloads import safe_download
from ultralytics.utils.plotting import Annotator, colors

ultralytics.checks()

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m949.8/949.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Inference function

Let’s configure the Gemini client to accept an image and perform tasks based on your text prompts. Find more information about [Gemini models](https://ai.google.dev/gemini-api/docs/models). To get started, generate your API key by logging into <a href="https://aistudio.google.com/">Google AI Studio</a>. 🚀

The inference function will be used throughout the notebook to perform various operations using the Gemini model.

In [6]:
# Initialize the Gemini client with your API key
client = genai.Client(api_key="api_key")


def inference(image, prompt, temp=0.5):
    """
    Performs inference using Google Gemini 2.5 Pro Experimental model.

    Args:
        image (str or genai.types.Blob): The image input, either as a base64-encoded string or Blob object.
        prompt (str): A text prompt to guide the model's response.
        temp (float, optional): Sampling temperature for response randomness. Default is 0.5.

    Returns:
        str: The text response generated by the Gemini model based on the prompt and image.
    """
    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",  # or "gemini-2.5-pro-exp-03-25"
        contents=[prompt, image],  # Provide both the text prompt and image as input
        config=types.GenerateContentConfig(
            temperature=temp,  # Controls creativity vs. determinism in output
        ),
    )

    return response.text  # Return the generated textual response

## Download and read the Image  



For testing, we'll fetch `gemini-image1.jpg` from [Ultralytics](https://ultralytics.com/) [notebooks assets](https://github.com/ultralytics/notebooks/releases/tag/v0.0.0) and use it for tasks like image captioning, object detection, image segmentation, and OCR. Feel free to use any image of your choice.

In [3]:
def read_image(filename=None):
    if filename is not None:
        image_name = filename
    else:
        image_name = "bus.jpg"  # or "zidane.jpg"

    # Download the image
    safe_download(f"https://github.com/ultralytics/notebooks/releases/download/v0.0.0/{image_name}")

    # Read image with opencv
    image = cv2.cvtColor(cv2.imread(f"{image_name}"), cv2.COLOR_BGR2RGB)

    # Extract width and height
    h, w = image.shape[:2]

    # # Read the image using OpenCV and convert it into the PIL format
    return Image.fromarray(image), w, h

![Input image for testing gemini-2.5-pro model](https://github.com/ultralytics/notebooks/releases/download/v0.0.0/gemini-inference-image.jpg)

## Results formatting

You can use this function to clean the raw string output by removing Markdown formatting (like ```json), so it can be safely parsed as JSON for bounding box extraction and plotting. 🧼

In [4]:
def clean_results(results):
    """Clean the results for visualization."""
    return results.strip().removeprefix("```json").removesuffix("```").strip()

## Object detection

Gemini models support object detection, helping you efficiently identify and recognize multiple objects within an image. 😀

In [None]:
# Define the text prompt
prompt = """
Detect the 2d bounding boxes of objects in image.
"""

# Fixed, plotting function depends on this.
output_prompt = "Return just box_2d and labels, no additional text."

image, w, h = read_image("gemini-image1.jpg")  # Read img, extract width, height

results = inference(image, prompt + output_prompt)  # Perform inference

cln_results = json.loads(clean_results(results))  # Clean results, list convert

annotator = Annotator(image)  # initialize Ultralytics annotator

for idx, item in enumerate(cln_results):
    # By default, gemini model return output with y coordinates first.
    # Scale normalized box coordinates (0–1000) to image dimensions
    y1, x1, y2, x2 = item["box_2d"]  # bbox post processing,
    y1 = y1 / 1000 * h
    x1 = x1 / 1000 * w
    y2 = y2 / 1000 * h
    x2 = x2 / 1000 * w

    if x1 > x2:
        x1, x2 = x2, x1  # Swap x-coordinates if needed
    if y1 > y2:
        y1, y2 = y2, y1  # Swap y-coordinates if needed

    annotator.box_label([x1, y1, x2, y2], label=item["label"], color=colors(idx, True))

Image.fromarray(annotator.result())  # display the output

![Object detection with gemini-2.5-pro model](https://github.com/ultralytics/notebooks/releases/download/v0.0.0/gemini-inference-image-processed.jpg)

## Reasoning capabilities

With Gemini models, you can tackle complex tasks using advanced reasoning that understands context and delivers more precise results. 🧠

In [None]:
# Define the text prompt
prompt = """
Detect the 2d bounding box around:
highlight the area of morning light +
notebook on PC table
potted plant near mirror.
"""

# Fixed, plotting function depends on this.
output_prompt = "Return just box_2d and labels, no additional text."

image, w, h = read_image("gemini-image2.jpg")  # Read image and extract width, height

results = inference(image, prompt + output_prompt)

# Clean the results and load results in list format
cln_results = json.loads(clean_results(results))

annotator = Annotator(image)  # initialize Ultralytics annotator

for idx, item in enumerate(cln_results):
    # By default, gemini model return output with y coordinates first.
    # Scale normalized box coordinates (0–1000) to image dimensions
    y1, x1, y2, x2 = item["box_2d"]  # bbox post processing,
    y1 = y1 / 1000 * h
    x1 = x1 / 1000 * w
    y2 = y2 / 1000 * h
    x2 = x2 / 1000 * w

    if x1 > x2:
        x1, x2 = x2, x1  # Swap x-coordinates if needed
    if y1 > y2:
        y1, y2 = y2, y1  # Swap y-coordinates if needed

    annotator.box_label([x1, y1, x2, y2], label=item["label"], color=colors(idx, True))

Image.fromarray(annotator.result())  # display the output

<img src="https://github.com/ultralytics/notebooks/releases/download/v0.0.0/gemini-inference-image-reasoning.png">

## Image captioning  

You can use Gemini models for image captioning to generate meaningful text descriptions that summarize the content of an image. 📝

In [179]:
# Define the text prompt
prompt = """
What's inside the image, generate a detailed captioning in the form of short
story, Make 4-5 lines and start each sentence on a new line.
"""

image, _, _ = read_image("gemini-image4.jpg")  # Read image and extract width, height

print(inference(image, prompt))  # Display the results

Downloading https://ultralytics.com/assets/gemini-image4.jpg to 'gemini-image4.jpg'...


100%|██████████| 267k/267k [00:00<00:00, 12.8MB/s]


Sunlight spilled across the wooden desk, illuminating the quiet workspace.
A laptop sat open, flanked by a steaming red mug and a waiting tablet.
Nearby, a notebook held potential ideas, bathed in the warm morning glow.
Potted plants added touches of green, bringing the outside indoors.
It was a peaceful setup, ready for a day of focused creativity.


<img src="https://github.com/ultralytics/notebooks/releases/download/v0.0.0/gemini-inference-image-captioning.jpg">

## OCR

Gemini models also support Optical Character Recognition (OCR), helping you detect and extract text from images with speed and accuracy. 🚀

In [None]:
# Define the text prompt
prompt = """
Extract the text from the image
"""

# Fixed, plotting function depends on this.
output_prompt = """
Return just box_2d which will be location of detected text areas + label"""

image, w, h = read_image("gemini-image3.png")  # Read image and extract width, height

results = inference(image, prompt + output_prompt)

# Clean the results and load results in list format
cln_results = json.loads(clean_results(results))

annotator = Annotator(image)  # initialize Ultralytics annotator

for idx, item in enumerate(cln_results):
    # By default, gemini model return output with y coordinates first.
    # Scale normalized box coordinates (0–1000) to image dimensions
    y1, x1, y2, x2 = item["box_2d"]  # bbox post processing,
    y1 = y1 / 1000 * h
    x1 = x1 / 1000 * w
    y2 = y2 / 1000 * h
    x2 = x2 / 1000 * w

    if x1 > x2:
        x1, x2 = x2, x1  # Swap x-coordinates if needed
    if y1 > y2:
        y1, y2 = y2, y1  # Swap y-coordinates if needed

    annotator.box_label([x1, y1, x2, y2], label=item["label"], color=colors(idx, True))

Image.fromarray(annotator.result())  # display the output

![OCR with gemini-2.5-pro model](https://github.com/ultralytics/notebooks/releases/download/v0.0.0/gemini-inference-image-ocr.png)

## Additional Resources  

✅ Learn more about Gemini 2.5: [here](https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/)  
✅ Ultralytics Annotator: [here](https://docs.ultralytics.com/reference/utils/plotting/)

🌟 Explore the [Ultralytics Notebooks](https://github.com/ultralytics/notebooks/) and give them a star to boost your AI journey! 🚀

Built with 💙 by [Ultralytics](https://ultralytics.com/)  