
# 1. OpenAI VLM (GPT) - Basics
This section demonstrates the basic usage of OpenAI's Vision Language Model (VLM) capabilities using GPT-4.1.
We will use the OpenAI API to analyze an image and provide detailed textual insights.

**Support Material**

- https://platform.openai.com/docs/quickstart
- https://platform.openai.com/docs/guides/text
- https://platform.openai.com/docs/guides/images-vision?api-mode=chat
- https://platform.openai.com/docs/guides/structured-outputs


In [33]:
import openai
from dotenv import load_dotenv  
import base64
import json
import textwrap

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')


load_dotenv()
openAIclient = openai.OpenAI()


# Path to your image
img = "images/street_scene.jpg"




In [13]:
#basic call to gpt with prompt and image

completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ],
        }
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message.content)
print(textwrap.fill(response, width=120))


The image depicts a busy urban street scene at a crosswalk. Several people are engaged in different activities: one
person is sitting on the ground using a tablet or phone, another person is lying down on the pavement nearby, a man is
playing a guitar while walking across the street, and others are sitting on a wooden bench reading or contemplating.
There are pigeons scattered around the area near the bench. Vehicles, including a motorcycle, a scooter, and cars, are
moving along the road. The backdrop features tall buildings under a bright sky, and there is a traffic light overhead.
The setting seems to be a lively city environment during the day.



# 1.1 Structured Output
Here, we expand upon the VLM example to request structured outputs. This approach allows for extracting 
well-organized information from images in a machine-readable format, such as JSON.

**Support Material**:
- https://platform.openai.com/docs/guides/structured-outputs


In [27]:
completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ]}
    ],
    response_format={ "type": "json_object" },# NEW!!
    temperature = 0
)

returnValue = completion.choices[0].message.content


We parse the json in a dict structure:

In [28]:
output = json.loads(returnValue)

So we can access specific infos:

In [29]:
output["description"]["foreground"]

KeyError: 'description'


# JSON Schema for Controlled Structured Outputs
In this section, we define a JSON schema for a more controlled and specific output from the model. 
Using this schema, we can ensure the model adheres to predefined data types and structures while describing images.In this case we will provide the json schema directly.



In [21]:
completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ]}
    ],
    response_format={
                "type": "json_schema",    
                "json_schema": {
                    "name": "img_extract",
                    "schema": {
                    "type": "object",
                    "properties": {
                        "numberOfPeople": {
                        "type":"integer",
                        "description": "The total number of people in the environment",
                        "minimum": 0
                        },
                        "atmosphere": {
                        "type": "string",
                        "description": "Description of the atmosphere, e.g., calm, lively, etc."
                        },
                        "hourOfTheDay": {
                        "type": "integer",
                        "description": "The hour of the day in 24-hour format",
                        "minimum": 0,
                        "maximum": 23
                        },
                        "people": {
                        "type": "array",
                        "description": "List of people and their details",
                        "items": {
                            "type": "object",
                            "properties": {
                            "position": {
                                "type": "string",
                                "description": "Position of the person in the environment, e.g., standing, sitting, etc."
                            },
                            "age": {
                                "type": "integer",
                                "description": "Age of the person",
                                "minimum": 0
                            },
                            "activity": {
                                "type": "string",
                                "description": "Activity the person is engaged in, e.g., reading, talking, etc."
                            },
                            "gender": {
                                "type": "string",
                                "description": "Gender of the person",
                                "enum": ["male", "female", "non-binary", "other", "prefer not to say"]
                            }
                            },
                            "required": ["position", "age", "activity", "gender"]
                        }
                        }
                    },
                    "required": ["numberOfPeople", "atmosphere", "hourOfTheDay", "people"]
                    }}},
    temperature = 0
)

returnValue = completion.choices[0].message.content


In [22]:
output_image_extraction = json.loads(returnValue)

In [23]:
output_image_extraction["people"]

[{'position': 'sitting on the sidewalk',
  'age': 16,
  'activity': 'using a smartphone',
  'gender': 'male'},
 {'position': 'lying on the sidewalk',
  'age': 18,
  'activity': 'resting or sleeping',
  'gender': 'male'},
 {'position': 'sitting on a bench',
  'age': 65,
  'activity': 'thinking or resting',
  'gender': 'male'},
 {'position': 'sitting on a bench',
  'age': 25,
  'activity': 'reading a newspaper',
  'gender': 'female'},
 {'position': 'walking on the sidewalk',
  'age': 20,
  'activity': 'using a smartphone',
  'gender': 'female'},
 {'position': 'riding a motorcycle',
  'age': 30,
  'activity': 'driving',
  'gender': 'male'},
 {'position': 'walking on the street',
  'age': 28,
  'activity': 'playing guitar',
  'gender': 'male'},
 {'position': 'riding a scooter',
  'age': 27,
  'activity': 'driving',
  'gender': 'female'},
 {'position': 'walking on the sidewalk',
  'age': 35,
  'activity': 'walking',
  'gender': 'female'},
 {'position': 'walking on the sidewalk',
  'age': 40

Alternatively: 


OpenAI SDKs for Python and JavaScript also make it easy to define object schemas using Pydantic and Zod respectively. Below, you can see how to extract information from unstructured text that conforms to a schema defined in code.


In [19]:
from pydantic import BaseModel


class Person(BaseModel):
    position: str 
    age: int 
    activity: str 
    gender: str


class ImageExtraction(BaseModel):
    number_of_people: int 
    atmosphere: str 
    hour_of_the_day: int 
    people: list[Person] 

completion = openAIclient.beta.chat.completions.parse(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": "describe the image in detail"}
    ],
    response_format=ImageExtraction,
)

output_image_extraction = completion.choices[0].message.parsed

We can then integrate the extracted information in full or partially in a new prompt for a new extraction

In [5]:
#alert service prompt 

alert_sys_prompt = " you are an experienced first aid paramedical"
alert_prompt= """Extract from the following scene analysis give to you in json format, 
if anyone might be in danger and if the Child Hospital or normal Hospital should be alerted. 
Give the a concise answer
The situation is given to you from this object: """ + str(output_image_extraction)


In [8]:

completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "user", "content": alert_prompt},
        {"role": "user", "content": alert_prompt}
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message.content)
print(textwrap.fill(response, width=120))

No one appears to be in immediate danger based on the given information. No hospital, child or normal, needs to be
alerted.


In [26]:
completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Considering this list of people"+str(output_image_extraction["people"])+".Identify the youngest in the picture I provide and give me back their coordinates. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ],
        }
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message.content)
print(textwrap.fill(response, width=120))

The youngest person in the list is the 16-year-old male sitting on the sidewalk using a smartphone.  In the image, the
person fitting this description is the one seated on the lower left corner, near the crosswalk, interacting with a
smartphone or handheld device.  Their coordinates normalized to 0-1000 in box_2d format [ymin, xmin, ymax, xmax] are
approximately: [720, 320, 940, 470]



# 2. Google VLM (Gemini)
This section demonstrates the use of Google's Vision Language Model, Gemini. 
We explore basic text generation as well as its ability to analyze images and provide relevant outputs.

**Support Material**:
- https://ai.google.dev/gemini-api/docs/quickstart
- https://ai.google.dev/gemini-api/docs/text-generation
- https://ai.google.dev/gemini-api/docs/image-understanding
- https://ai.google.dev/gemini-api/docs/structured-output?example=recipe

In [3]:
%matplotlib inline
from dotenv import load_dotenv  
from google import genai
from PIL import Image
import textwrap

import json


load_dotenv()
client = genai.Client()

# Path to your image
img = "images/street_scene.jpg"

Basic call:

In [4]:
response = client.models.generate_content(
    model="gemini-2.5-flash", contents="Explain how AI works to a 90 years old. in few words"
)

print(textwrap.fill(response.text, width=120))

It's like a very smart, helpful assistant in a computer. It learns from lots of examples, just like a person learns from
experience. Then, it uses what it learned to figure things out and help you.


and with images: 

In [30]:
im = Image.open(img)

response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[im, "Describe the scene in details\n"],
                                          )

print(textwrap.fill(response.text, width=120))


This image captures a vibrant and bustling urban street scene, likely during the golden hour of either a late afternoon
or early morning, judging by the warm, directional light and long shadows. The overall impression is one of dynamic city
life, with a mix of motion and moments of repose.  **Foreground (Bottom of the image):** At the very bottom left, a
wooden planter filled with bright red flowers sits on the pavement. To its right, a young person with short brown hair
is sitting cross-legged on the sidewalk, engrossed in a tablet or smartphone. Adjacent to them, another young person is
lying flat on their back on the pavement, head resting on their left arm, gazing upwards or towards the viewer. They are
wearing a red top and blue jeans. Several pigeons are scattered around this area, some pecking at the ground, others
standing.  Towards the center and right foreground, a wooden park bench with black metal supports is visible. An older
man in a dark suit sits on the left side of the

Also here we can extract structured output (Gemini actually prefers pydantic syntax - let's see what happens with a schema as before)-> check limitations in https://ai.google.dev/gemini-api/docs/structured-output?example=recipe 

In [32]:
json_schema = {
                    "name": "img_extract",
                    "schema": {
                    "type": "object",
                    "properties": {
                        "numberOfPeople": {
                        "type":"integer",
                        "description": "The total number of people in the environment",
                        "minimum": 0
                        },
                        "atmosphere": {
                        "type": "string",
                        "description": "Description of the atmosphere, e.g., calm, lively, etc."
                        },
                        "hourOfTheDay": {
                        "type": "integer",
                        "description": "The hour of the day in 24-hour format",
                        "minimum": 0,
                        "maximum": 23
                        },
                        "people": {
                        "type": "array",
                        "description": "List of people and their details",
                        "items": {
                            "type": "object",
                            "properties": {
                            "position": {
                                "type": "string",
                                "description": "Position of the person in the environment, e.g., standing, sitting, etc."
                            },
                            "age": {
                                "type": "integer",
                                "description": "Age of the person",
                                "minimum": 0
                            },
                            "activity": {
                                "type": "string",
                                "description": "Activity the person is engaged in, e.g., reading, talking, etc."
                            },
                            "gender": {
                                "type": "string",
                                "description": "Gender of the person",
                                "enum": ["male", "female", "non-binary", "other", "prefer not to say"]
                            }
                            },
                            "required": ["position", "age", "activity", "gender"]
                        }
                        }
                    },
                    "required": ["numberOfPeople", "atmosphere", "hourOfTheDay", "people"]}}



config={
        "response_mime_type": "application/json",
        "response_json_schema": json_schema,
    }


response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[im, "Describe the scene in details, follwoing exactly the given json schema\n"],
                                          config=config
                                          )



print(response.text)

{
  "description": "A bustling urban street scene unfolds under the warm, golden light of what appears to be late afternoon or early morning. The foreground is dominated by a wide pedestrian crosswalk with black and white stripes. On the left side of the crosswalk, a young person with short hair sits cross-legged on the sidewalk, looking down at a tablet in their hands. Beside them, a large wooden pot with vibrant red flowers adds a splash of color. In front of the crosswalk, a young man in a red jacket and blue jeans is lying flat on his back on the sidewalk, looking upwards. Several pigeons are scattered on the pavement near these individuals and further along. Vehicles are in motion on the street, indicated by motion blur: a silver car with a yellow sign on its roof (possibly a taxi) is prominently visible crossing the street from left to right, and an orange car is partially visible behind it, moving in the same direction. In the midground, a man on a motorcycle, wearing a black le

Let's try to use Gemini to detect an object in the image and get its coordinates:


In [30]:
prompt = "Identify the youngest in the picture and give me back their coordinates. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."


config={"response_mime_type": "application/json"}

response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[img, prompt],
                                          config=config
                                          )

bounding_boxes = json.loads(response.text)
print(bounding_boxes)


{'box_2d': [542, 771, 644, 814]}


Gemini2+ was trained specifically for object detection/ segmentation tasks. More details: https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Spatial_understanding.ipynb


## Extract Structured Infos from Hand-written note - GPT & Gemini

Let's try not to extract a structured infos from a hand-writtien note (e.g. prescription1.jpg) (using both models). Consider the file /images/prescription1.jpg. Have a look at it. 

Let's define a json schema for it: 

In [4]:
json_schema_prescription = {
 "name": "prescription_extract",
"schema": {
  "type": "object",
  "properties": {
    "doctor_name": { "type": "string" },
    "patient_name": { "type": "string" },
    "patient_dob": { "type": "string" },
    "meds": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "dose": { "type": "string" },
          "frequency": { "type": "string" },
          "instructions": { "type": "string" }
        },
        "required": ["name"]
      }
    },
    "signature": { "type": "boolean" }
  },
  "required": ["doctor_name", "patient_name", "meds"]
}}

Extract structured infos using Gemini: 

In [28]:
im = Image.open("images/prescription1.jpg")

config={
        "response_mime_type": "application/json",
        "response_json_schema": json_schema_prescription,
    }


response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[im, "Extract infos from image, follwoing the given json schema.\n"],
                                          config=config
                                          )



print(response.text)

{
"doctor_name": "Dr. Markus Müller",
"doctor_id": null,
"patient_name": "Claudie Fischer",
"patient_dob": "1.4.1978",
"patient_gender": "f",
"medication": "Ibuprofen",
"dosage": "400mg",
"frequency": "3x",
"instructions": "nach dem Essen",
"signature": "Dr. Markus Müller's signature"
}


In this case json needs to be parsed before being loaded in a python dict - this is an example of function for it. 
(Since the output of Gemini is actually a Pydantic model one could also try to use methods from the Pydantic library, we avoid it for general compatibility.)

In [16]:
import re
import json 
def parse_json_in_output(output):
    """
    Extracts and converts JSON-like data from the given text output to a Python dictionary.
    
    Args:
        output (str): The text output containing the JSON data.
    
    Returns:
        dict: The parsed JSON data as a Python dictionary.
    """
    # Regex to extract JSON-like portion
    json_match = re.search(r"\{.*?\}", output, re.DOTALL)
    if json_match:
        json_str = json_match.group(0)
        # Fix single quotes and ensure proper JSON formatting
        json_str = json_str.replace("'", '"')  # Replace single quotes with double quotes
        try:
            # Convert the fixed JSON string into a dictionary
            json_data = json.loads(json_str)
            return json_data
        except json.JSONDecodeError:
            return "The extracted JSON is still not valid after formatting."
    else:
        return "No JSON data found in the given output."

In [17]:

print(parse_json_in_output(response.text))

{'doctor_name': 'Dr. Markus Hütter', 'patient_name': 'Claude Fischer', 'patient_dob': '1.4.1978', 'patient_gender': 'f', 'medication': 'Ibuprofen', 'dosage': '400mg', 'frequency': '3x', 'instructions': 'nach dem Essen', 'doctor_signature': 'Kopfhülle'}


Now let's do the same with GPT

In [34]:
im = "images/prescription1.jpg"

completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(im)}",
                        #"detail": "low"
                    }
                },
            ]}
    ],
    response_format={
                "type": "json_schema",   "json_schema": json_schema_prescription},
    temperature = 0
)

returnValue = completion.choices[0].message.content

In [35]:
returnValue

'{"doctor_name":"Dr. Markus Müller","patient_name":"Claudia Fischer","patient_dob":"1.4.1978","meds":[{"name":"Ibuprofen","dose":"400 mg","frequency":"3x","instructions":"nach dem Essen"}],"signature":true}'

No need for parsing now. We load the json in a python dict structure with json.loads

In [37]:
print(json.loads(returnValue))

{'doctor_name': 'Dr. Markus Müller', 'patient_name': 'Claudia Fischer', 'patient_dob': '1.4.1978', 'meds': [{'name': 'Ibuprofen', 'dose': '400 mg', 'frequency': '3x', 'instructions': 'nach dem Essen'}], 'signature': True}
