<a href="https://colab.research.google.com/github/rubaahmedkhan/Gemini-Experiments/blob/main/vision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Explore vision capabilities with the Gemini API**

Before you use the File API, you need to install the Gemini API SDK package and configure an API key. This section describes how to complete these setup steps.

# **Insall the python SDK and import Pakages**

In [None]:
!pip install -U -q google-genai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/129.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m122.9/129.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h

Import the necessary packages.



In [None]:
import google.generativeai as genai
from IPython.display import Markdown

**Set up your API key**

In [None]:
from google.colab import userdata
import os

os.environ['GOOGLE_API_KEY'] = userdata.get('GEMINI_API_KEY')

In [None]:
from google import genai

In [None]:
client = genai.Client()


# **Prompting with images**



**Base64 encoded images**

You can upload public image URLs by encoding them as Base64 payloads. You can use the httpx library to fetch the image URLs. The following code example shows how to do this:



In [None]:
import httpx
import base64

# Retrieve an image
image_path = "https://www.opportunitiescircle.com/wp-content/uploads/2024/12/Generative-AI-for-Everyone-Free-Course-2025.jpg"
image = httpx.get(image_path)

# Choose a Gemini model
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

# Create a prompt
prompt = "Caption this image."
response = model.generate_content(
    [
        {
            "mime_type": "image/jpeg",
            "data": base64.b64encode(image.content).decode("utf-8"),
        },
        prompt,
    ]
)

Markdown(">" + response.text)

>Here's a caption for the image:

**Option 1 (Concise):**

> Learn Generative AI for FREE in 2025! This fully funded course is open to all countries, with no deadline.  [Link to website] #generativeAI #freeCourse #AICourse #DeepLearning


**Option 2 (More detailed):**

> Level up your AI skills with this fully funded Generative AI course, brought to you by DeepLearning.AI and Opportunities Circle!  No application deadline – learn at your own pace. Open to participants worldwide.  Enroll now! [Link to website] #AI #GenerativeAI #FreeEducation #OnlineCourse


**Option 3 (Focus on urgency/scarcity - even though there's no deadline):**

> Don't miss out on this incredible opportunity!  A completely FREE, fully funded Generative AI course is now available to everyone, regardless of location. Learn the skills of the future – enroll today! [Link to website] #AICourse #FreeLearning #GenerativeAI #LimitedTimeOffer (although not technically limited time)


**Remember to replace "[Link to website]" with the actual URL from the image.** Choose the option that best suits your target audience and platform.


# **Multiple images**

To prompt with multiple images in Base64 encoded format, you can do the following:

In [None]:
import httpx
import base64

# Retrieve two images
image_path_1 = "https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSg1y2smnVz6-RMLN8eL2Tef1VakEV8U08StG2NNt7yZD1n6hPI"
image_path_2 = "https://images.olx.com.pk/thumbnails/279059287-240x180.jpeg"

image_1 = httpx.get(image_path_1)
image_2 = httpx.get(image_path_2)

# Create a prompt
prompt = "Generate a list of all the objects contained in both images."

response = model.generate_content([
{'mime_type':'image/jpeg', 'data': base64.b64encode(image_1.content).decode('utf-8')},
{'mime_type':'image/jpeg', 'data': base64.b64encode(image_2.content).decode('utf-8')}, prompt])

Markdown(response.text)

Here's a list of the objects or concepts present in both images:

* **Machine Learning:** This is explicitly mentioned in both images as a core component of AI and a subject of study.

* **Deep Learning:**  Also explicitly named in both;  the first image identifies it as a part of AI, and the second shows it as a learning component alongside machine learning.

* **Python:**  Appears in the second image as a programming language used in AI/ML related studies.  While not explicitly named in the first image, Python is a very common language used in the AI fields described there, implying its presence implicitly.

* **Data (implicitly):**  Both images strongly imply the importance of data. The first image discusses AI components dealing with data (machine learning, computer vision), and the second explicitly mentions "Data Visualization" and "Data Science."


It is important to note that the other items mentioned in the images (Neural Networks, Natural Language Processing, Computer Vision, Data Structures, OOP, etc.) are either components *of* the above items or closely related concepts within the broader field of AI.  They don't appear as explicit commonalities across both images in the same way as the ones listed above.


# **Upload one or more locally stored image files**

Alternatively, you can upload one or more locally stored image files..

You can download and First, save these files to your local directory.

Then click Files on the left sidebar. For each file, click the Upload button, then navigate to that file's location and upload it:

In [None]:
import PIL.Image

sample_file_2 = PIL.Image.open('demo1.jpeg')
sample_file_3 = PIL.Image.open('demo2.webp')

In [None]:
import google.generativeai as genai

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")

# Create a prompt.
prompt = "what sees in both images explain in two lines"

response = model.generate_content([sample_file_2, sample_file_3, prompt])

Markdown(response.text)

Both images relate to machine learning. The first is an advertisement for tutoring services, while the second is a cheat sheet summarizing various machine learning algorithms categorized by learning style.

# **Large image payloads**

**Upload an image file using the File API**

In [None]:
!curl -o jetpack.jpg https://storage.googleapis.com/generativeai-downloads/images/jetpack.jpg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  349k  100  349k    0     0  2015k      0 --:--:-- --:--:-- --:--:-- 2018k


In [None]:
# Upload the file and print a confirmation.
sample_file = genai.upload_file(path="jetpack.jpg",
                            display_name="Jetpack drawing")

print(f"Uploaded file '{sample_file.display_name}' as: {sample_file.uri}")

Uploaded file 'Jetpack drawing' as: https://generativelanguage.googleapis.com/v1beta/files/ui7oqyac3jl9


In [None]:
file = genai.get_file(name=sample_file.name)
print(f"Retrieved file '{file.display_name}' as: {sample_file.uri}")

Retrieved file 'Jetpack drawing' as: https://generativelanguage.googleapis.com/v1beta/files/ui7oqyac3jl9


**Prompt with the uploaded image and text**

After uploading the file, you can make GenerateContent requests that reference the File API URI. Select the generative model and provide it with a text prompt and the uploaded image.

In [None]:
# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

# Prompt the model with text and the previously uploaded image.
response = model.generate_content([sample_file, "Describe how this product might be manufactured."])

Markdown(response.text)

Here's a breakdown of how the Jetpack Backpack, as conceptually designed, might be manufactured, keeping in mind that some aspects are highly speculative due to the lack of detailed specifications in the original image.

**1. Backpack Shell:**

* **Materials:**  Likely a durable, lightweight material like nylon or ripstop fabric.  This would need to be water-resistant or water-proof for weather protection.  For the internal structure, potentially a combination of reinforced plastics or lightweight metals for structural support.
* **Manufacturing Process:**  The shell would likely be cut and sewn from the fabric, with potentially some sections being heat-sealed or welded for added durability. Internal frame components (if any) could be injection-molded or die-cast.

**2.  Strap Support:**

* **Materials:**  High-density foam padding covered in a durable fabric would be used for comfortable shoulder straps.
* **Manufacturing Process:**  Foam is typically cut to shape, covered with fabric, and sewn to the backpack shell.

**3.  USB-C Charging System:**

* **Components:**  This would require a battery pack (possibly multiple smaller ones for weight distribution), charging circuitry, and a USB-C output port.
* **Manufacturing Process:**  The battery pack would be assembled from individual cells or procured as a pre-built unit.  The circuitry would need to be carefully integrated, likely on a small printed circuit board (PCB), and securely housed within the backpack.  The port would be fixed to the external shell.

**4.  Retractable Boosters:**

* **This is the most challenging aspect to manufacture:** The image implies a steam-powered system.  Building a safe, lightweight, and efficient system for generating steam, controlling its release, and retracting the mechanism would be extremely complex.
* **Potential Components:**  A miniature steam boiler (possibly using a rapid heating element), valves, pressure regulators, and a mechanism for retracting the nozzles.
* **Manufacturing Process:**  This would involve precise machining of metal parts, specialized welding or brazing, and sophisticated control systems for regulating steam pressure and nozzle deployment.  Miniaturization is crucial for practical implementation.

**5.  Laptop Compartment:**

* **Materials:**  Padding for laptop protection and possibly a dedicated, rigid internal frame or shell for extra security.
* **Manufacturing Process:**  Would involve cutting and sewing, with the padding materials attached to form the compartment within the larger backpack.

**6.  Assembly:**

*  Once all components are manufactured, the entire backpack would be assembled. This would likely involve skilled labor due to the complex integration of the different parts, especially the steam-powered system (if actually implemented).  Quality control would be a critical part of the process.

**Challenges:**

* **Steam System Miniaturization:**  Creating a safe, lightweight, and efficient steam-powered propulsion system is extremely difficult.
* **Safety Regulations:**  A steam-powered backpack would require stringent safety certifications and testing to ensure it's not dangerous.
* **Battery Life and Power:**  Achieving a 15-minute battery life for a steam-powered system would be a significant engineering challenge.
* **Cost:**  The advanced engineering and precision manufacturing required would likely make the backpack very expensive.

The design, as presented, is highly imaginative, but translating it into a real product would face immense technical and logistical hurdles.  It's more likely a conceptual idea for now, than a readily manufacturable product.


# **Capabilties**

This section outlines specific vision capabilities of the Gemini model, including object detection and bounding box coordinates.



## **Get bounding boxes**
Gemini models are trained to return bounding box coordinates as relative widths or heights in the range of [0, 1]. These values are then scaled by 1000 and converted to integers. Effectively, the coordinates represent the bounding box on a 1000x1000 pixel version of the image. Therefore, you'll need to convert these coordinates back to the dimensions of your original image to accurately map the bounding boxes.

In [None]:
# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

# Create a prompt to detect bounding boxes.
prompt = "Return a bounding box for each of the objects in this image in [ymin, xmin, ymax, xmax] format."
response = model.generate_content([sample_file_2, prompt])

Markdown(response.text)

Here are the bounding boxes for the objects in the image.  Note that some of the text is split across multiple lines, so the bounding boxes encompass the entire text phrase.  Accuracy is limited by the image resolution and OCR quality.

* **Python, AI, Machine Learning Tutoring:** [274, 36, 402, 272]
* **Machine Learning:** [145, 396, 204, 546]
* **Supervised, Unsupervised, Reinforcement:** [212, 396, 246, 602]
* **Deep Learning:** [145, 655, 204, 796]
* **ANN, CNN, Computer Vision, PyTorch:** [212, 655, 272, 845]
* **Data Visualization:** [359, 521, 418, 695]
* **Python Programming:** [487, 350, 559, 511]
* **Data Structures, OOP:** [561, 350, 584, 468]
* **Data Science:** [593, 768, 653, 863]
* **Predictive Analytics:** [656, 768, 691, 867]


Please note that the OCR had some errors ("Unsupe" instead of "Unsupervised", "Camper Van" instead of "Computer Vision").  The bounding boxes reflect the actual text as captured by the OCR.


# **Prompting with video**
In this tutorial, you will upload a video using the File API and generate content based on those images.

## Upload a video file to the File API

In [None]:
!wget https://storage.googleapis.com/generativeai-downloads/images/GreatRedSpot.mp4

--2025-02-09 09:26:41--  https://storage.googleapis.com/generativeai-downloads/images/GreatRedSpot.mp4
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.141.207, 173.194.210.207, 173.194.212.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.141.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 238090979 (227M) [video/mp4]
Saving to: ‘GreatRedSpot.mp4’


2025-02-09 09:26:42 (155 MB/s) - ‘GreatRedSpot.mp4’ saved [238090979/238090979]



Upload the video to the File API and print the URI.

In [None]:
video_file_name = "GreatRedSpot.mp4"

print(f"Uploading file...")
video_file = genai.upload_file(path=video_file_name)
print(f"Completed upload: {video_file.uri}")

Uploading file...
Completed upload: https://generativelanguage.googleapis.com/v1beta/files/s0siccetja03


# **Verify file upload and check state**
Verify the API has successfully received the files by calling the files.get method.

NOTE: Video files have a State field in the File API. When a video is uploaded, it will be in the PROCESSING state until it is ready for inference. Only ACTIVE files can be used for model inference.

In [None]:
import time

# Check whether the file is ready to be used.
while video_file.state.name == "PROCESSING":
    print('.', end='')
    time.sleep(10)
    video_file = genai.get_file(video_file.name)

if video_file.state.name == "FAILED":
  raise ValueError(video_file.state.name)

.

# **Prompt with a video and text**
Once the uploaded video is in the ACTIVE state, you can make GenerateContent requests that specify the File API URI for that video. Select the generative model and provide it with the uploaded video and a text prompt.

In [None]:
# Create the prompt.
prompt = "Summarize this video. Then create a quiz with answer key based on the information in the video."

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

# Make the LLM request.
print("Making LLM inference request...")
response = model.generate_content([video_file, prompt],
                                  request_options={"timeout": 600})

# Print the response, rendering any Markdown
Markdown(response.text)

Making LLM inference request...


Here is a summary of the video, followed by a short quiz.

The video describes Jupiter's Great Red Spot, a gigantic, centuries-old storm.  Data from NASA missions, including Voyager, Hubble, and Juno, reveal that the storm is shrinking and becoming rounder. While scientists initially expected wind speeds within the spot to increase as it shrank (like a figure skater spinning faster when they pull in their arms), data show the storm is actually getting taller.  The Great Red Spot, once large enough to fit three Earths, is now only slightly larger than one.

**Quiz:**

1.  **True or False:** The Great Red Spot is a hurricane.
2.  **What three NASA missions contributed data to this research?**
3.  **What is the analogy used in the video to explain the initial expectation of what would happen to the storm?**
4.  **What is actually happening to the Great Red Spot, according to the data?**
5.  **How many Earths could fit inside the Great Red Spot today?**

**Answer Key:**

1.  False (It's an anticyclone)
2. Voyager, Hubble, Juno
3. A figure skater pulling in their arms to spin faster.
4. It is getting taller.
5. A little more than one.

# **Refer to timestamps in the content**
You can use timestamps of the form MM:SS to refer to specific moments in the video.



In [None]:
# Create the prompt.
prompt = "What are the examples given at 01:05 and 01:19 supposed to show us?"

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

# Make the LLM request.
print("Making LLM inference request...")
response = model.generate_content([prompt, video_file],
                                  request_options={"timeout": 600})
Markdown(response.text)

Making LLM inference request...


Here's an explanation of what the examples at 01:05 and 01:19 are meant to show.

At **01:05**, the video shows a figure skater pulling in her arms while spinning. This is used as an analogy for the shrinking Great Red Spot.  As the skater brings her arms closer to her body, her spin increases due to the conservation of angular momentum. The expectation was that the Great Red Spot would behave similarly. As it shrinks, its rotation speed would increase.

The example at **01:19** shows a potter shaping a lump of clay on a spinning wheel. As the potter works the clay, it becomes taller and more narrow, but it doesn't spin faster. The video uses this to illustrate that the Great Red Spot, unlike the ice skater, is getting taller as it shrinks, not spinning faster.

# **Transcribe video and provide visual descriptions**
The Gemini models can transcribe and provide visual descriptions of video content by processing both the audio track and visual frames. For visual descriptions, the model samples the video at a rate of 1 frame per second. This sampling rate may affect the level of detail in the descriptions, particularly for videos with rapidly changing visuals.


[ ]


In [None]:
# Create the prompt.
prompt = "Transcribe the audio from this video, giving timestamps for salient events in the video. Also provide visual descriptions."

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

# Make the LLM request.
print("Making LLM inference request...")
response = model.generate_content([video_file, prompt],
                                  request_options={"timeout": 600})
Markdown(response.text)

Making LLM inference request...


Here's a transcription of the audio from the video, along with timestamps and visual descriptions.

[00:00:00 to 00:00:15] The video opens with a dark space background speckled with stars. A partially illuminated planet Jupiter is in the center of the screen. The narrator begins to speak, "Jupiter is the largest and oldest planet in our solar system. Its history spans 4.5 billion years. This gas giant is made of the same elements as a star, but it did not grow massive enough to ignite."

[00:00:16 to 00:00:31] The camera focuses on the right half of Jupiter. The rings around Jupiter are visible. The narrator says, "Jupiter's appearance is the result of its swirling interior of gases and liquids producing a tapestry of colorful cloud bands, as well as the iconic Great Red Spot."  The camera zooms in on the Great Red Spot.

[00:00:32 to 00:00:44] The Great Red Spot is shown in more detail.  A smaller, darker, circular storm is also shown near the Great Red Spot. The narrator says, "The Great Red Spot is a gigantic storm. It's an anticyclone, and with no land mass on the planet to slow it down, the Great Red Spot has raged for over a century."

[00:00:45 to 00:01:04] The camera shows a view of Jupiter from a slightly different angle. The narrator says, "But scientists studying the spot have noticed that it has been changing over time. The color is deepening, and it's actually shrinking and getting rounder. Those studying it expected to therefore see the wind speeds inside the Great Red Spot increasing as the storm shrinks."

[00:01:05 to 00:01:11]  A black-and-white video clip of a figure ice skating is shown. The skater spins faster as she pulls her arms in. The narrator says, "Like an ice skater who spins faster as she pulls in her arms. But this isn't the case."

[00:01:12 to 00:01:24] The camera again focuses on the Great Red Spot. The narrator states, "Data reveals the storm isn't spinning faster; it's actually getting taller. You can think of it like working with pottery. As the wide lump of clay spins, forces within are driving it taller."  A split-screen appears, showing a graph of the height of the Great Red Spot over time, and a video clip of someone working with clay on a pottery wheel.

[00:01:25 to 00:01:35] The video returns to a view of Jupiter.  Images of the Great Red Spot from 1995, 2009, and 2015 are shown, highlighting its shrinking size.  The narrator comments, "So, from our perspective looking down on the clouds, we see the spot getting smaller and rounder. The Great Red Spot used to be big enough to fit three Earths. Now it's just a little over one."  An image of Earth is superimposed on the Great Red Spot to illustrate the comparison.

[00:01:36 to 00:01:46]  A 3D rendering of the Voyager spacecraft is shown approaching Jupiter. The narrator explains, "These discoveries were made by analyzing data from numerous NASA missions, including Voyager, Hubble, and most recently, Juno."  Images of the Juno spacecraft are shown.

[00:01:47 to 00:01:57] The video returns to the opening scene of Jupiter in space. The narrator says, "And through more investigations, scientists hope to unlock more secrets of the mysterious Great Red Spot."  Close up images of the Great Red Spot are again shown.

[00:01:58 to 00:02:06]  The video ends with a final shot of the Juno spacecraft, followed by the NASA Goddard Space Flight Center logo and website address.

# **List files**
You can list all uploaded files and their URIs using files.list_files().

In [None]:
# List all files
for file in genai.list_files():
    print(f"{file.display_name}, URI: {file.uri}")

GreatRedSpot.mp4, URI: https://generativelanguage.googleapis.com/v1beta/files/s0siccetja03
Jetpack drawing, URI: https://generativelanguage.googleapis.com/v1beta/files/ui7oqyac3jl9


# **Delete files**
Files are automatically deleted after 2 days. You can also manually delete them using files.delete().

In [None]:
genai.delete_file(video_file.name)
print(f'Deleted file {video_file.uri}')

Deleted file https://generativelanguage.googleapis.com/v1beta/files/s0siccetja03
