# Images and Vision

## Basic Connection and Packages

### Importing OpenAI and Initializing the Client

To begin, we'll import the `OpenAI` class from the `openai` library, which allows us to interact with the OpenAI API. Next, we initialize a client instance, which we'll use to send requests and receive responses from the OpenAI models.

In [1]:
"""
This script is a simple example of using the OpenAI API
It uses the OpenAI Python client library to open a connection to the OpenAI API.
This also looks for the OPENAI_API_KEY environment variable to authenticate the client.
"""
from openai import OpenAI

client = OpenAI()

### Base64
Next, we'll import Python's built-in `base64` library. This module allows us to encode or decode binary data (such as images or files) into a text-based representation, which is often required when working with images in API requests or responses.


In [14]:
import base64

## Passing an Image URL
In the following code cell, we'll use **"gpt-4o"** to analyze an image using a URL. The model will examine the picture and generate a descriptive response, which we'll then print out. This demonstrates how AI can interpret visual content alongside text-based instructions.


<img src="https://upload.wikimedia.org/wikipedia/commons/5/53/202412_Taiwan_Railway_Haifeng_EMU500_Tourist_Train_at_Houlong_Station.jpg" width="512" height="512">



In [35]:

response = client.responses.create(
    model="gpt-4o-mini",
    input=[{
        "role": "user",
        "content": [
            {"type": "input_text", "text": "Tell me what is in this image."},
            {
                "type": "input_image",
                "image_url": "https://upload.wikimedia.org/wikipedia/commons/5/53/202412_Taiwan_Railway_Haifeng_EMU500_Tourist_Train_at_Houlong_Station.jpg",
            },
        ],
    }],
)

print(response.output_text)

The image features a train at a station. The train has a distinct mint green color with two large windows in the front and several lights. There are also various details such as a nameplate, likely indicating the train's name or route. The surroundings include a fence next to the railway and some buildings in the background, with a clear sky above.


## Passing a Base64 Encoded Image
In the following code cell, we'll use **"gpt-4o"** to analyze an image using a URL. The model will examine the picture and generate a descriptive response, which we'll then print out. This demonstrates how AI can interpret visual content alongside text-based instructions.

<img src="./artifacts/mystery_gathering.jpg" width="512" height="512">



In [36]:
# Helper function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to our image
image_path = "./artifacts/mystery_gathering.jpg"

# Getting the Base64 string
base64_image = encode_image(image_path)


response = client.responses.create(
    model="gpt-4o-mini",
    input=[
        {
            "role": "user",
            "content": [
                { "type": "input_text", "text": "Tell me what is in this image." },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image}",
                },
            ],
        }
    ],
)

print(response.output_text)

The image depicts a busy street scene with many people gathered. A person is seen jumping onto the roof of a parked black car, while others appear to be running or engaging in various activities around them. The environment looks urban, with buildings in the background and some construction or barriers visible. The overall atmosphere seems chaotic, as a large group is present, suggesting excitement or some sort of event.


## Detail: Low vs High Resolution
The detail parameter tells the model what level of detail to use when processing and understanding the image (low, high, or auto to let the model decide). If you skip the parameter, the model will use auto.

### Low Detail
You can save tokens and speed up responses by using "detail": "low". This lets the model process the image with a budget of 85 tokens. The model receives a low-resolution 512px x 512px version of the image. This is fine if your use case doesn't require the model to see with high-resolution detail (for example, if you're asking about the dominant shape or color in the image).

In [None]:
# Helper function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to our image
image_path = "./artifacts/mystery_gathering.jpg"

# Getting the Base64 string
base64_image = encode_image(image_path)


response = client.responses.create(
    model="o1",
    input=[
        {
            "role": "user",
            "content": [
                { "type": "input_text", "text": "Tell me all the details you see in this image." },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image}",
                    "detail":"low"
                },
            ],
        }
    ],
)

print(response.output_text)

1. A crowd of people is gathered on a city street, with one individual jumping on top of a parked car amidst the commotion.  
2. A turquoise train awaits at a platform, surrounded by buildings and greenery under a soft, early morning light.


### High Detail
You can give the model more detail to generate its understanding by using "detail": "high". This lets the model see the low-resolution image (using 85 tokens) and then creates detailed crops using 170 tokens for each 512px x 512px tile.

In [31]:
# Helper function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to our image
image_path = "./artifacts/mystery_gathering.jpg"

# Getting the Base64 string
base64_image = encode_image(image_path)


response = client.responses.create(
    model="o1",
    input=[
        {
            "role": "user",
            "content": [
                { "type": "input_text", "text": "Tell me all the details you see in this image." },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image}",
                    "detail":"high"
                },
            ],
        }
    ],
)

print(response.output_text)

It appears to be a busy urban street scene with multiple people gathered around a black sedan parked (or stopped) along the curb. Here’s a detailed, factual description of what can be observed:

• Car in foreground:  
  – A black four-door sedan with its rear license plate covered or edited out.  
  – The roof and trunk area of the car appear intact, but a person is currently leaping onto it or over it.  

• Person in red pants:  
  – This individual is captured in midair above the car’s roof, wearing bright red pants, light-colored shoes, a gray or light-colored top, and a backpack with both shoulder straps on.  
  – Their legs are bent upward, suggesting a jump or leap over (or possibly onto) the car.  

• Surrounding crowd:  
  – Many people, mostly on foot, are gathered in the street around the vehicle.  
  – Several individuals seem to be in motion, with some gesturing or moving toward the car.  
  – Their clothing varies, including T-shirts, sweatpants, and jackets in different c

## Multiple Image Inputs
The Responses API can take in and process multiple image inputs. The model processes each image and uses the information to answer questions about all images or each image independently.

In [None]:
# Helper function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to our image
image_path = "./artifacts/mystery_gathering.jpg"

# Getting the Base64 string
base64_image = encode_image(image_path)


response = client.responses.create(
    model="gpt-4o-mini",
    input=[
        {
            "role": "user",
            "content": [
                { "type": "input_text", "text": "Give me one sentence describing each image you see." },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image}",
                    "detail":"low"
                },
                {
                "type": "input_image",
                "image_url": "https://upload.wikimedia.org/wikipedia/commons/5/53/202412_Taiwan_Railway_Haifeng_EMU500_Tourist_Train_at_Houlong_Station.jpg",
            },
            ],
        }
    ],
)

print(response.output_text)