
# 1. OpenAI VLM (GPT) - Basics
This section demonstrates the basic usage of OpenAI's Vision Language Model (VLM) capabilities using GPT-4.1.
We will use the OpenAI API to analyze an image and provide detailed textual insights.

**Support Material**

- https://platform.openai.com/docs/quickstart
- https://platform.openai.com/docs/guides/text
- https://platform.openai.com/docs/guides/images-vision?api-mode=chat
- https://platform.openai.com/docs/guides/structured-outputs


In [21]:
import openai
from dotenv import load_dotenv  
import base64
import json
import textwrap

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')


load_dotenv()
openAIclient = openai.OpenAI()


# Path to your image
img = "images/street_scene.jpg"




In [22]:
#basic call to gpt with prompt and image

completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ],
        }
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message.content)
print(textwrap.fill(response, width=120))


The image depicts a lively urban street scene at a pedestrian crossing in a city with tall buildings and a mix of old
and modern architecture. Various people are engaged in different activities:  - In the foreground, a person is sitting
on the sidewalk using a smartphone or tablet. - Nearby, another person is lying on the pavement, wearing casual clothes.
- On a wooden bench, an older man in a suit appears deep in thought, while a woman next to him is reading a newspaper. -
Several pigeons are scattered around on the pavement near the seated and lying individuals. - In the background, there
are people crossing the street and vehicles, including cars, a motorcycle, and a scooter, moving past. - A person
playing the guitar is walking along the pedestrian crossing. - One woman is strolling while looking at her phone.  The
scene captures a moment of urban life with a mix of calm and movement, various modes of transportation, and people
engaged in everyday activities. The lighting suggests 


# 1.1 Structured Output
Here, we expand upon the VLM example to request structured outputs. This approach allows for extracting 
well-organized information from images in a machine-readable format, such as JSON.

**Support Material**:
- https://platform.openai.com/docs/guides/structured-outputs


In [23]:
completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ]}
    ],
    response_format={ "type": "json_object" },# NEW!!
    temperature = 0
)

returnValue = completion.choices[0].message.content


In [24]:
returnValue

'{\n  "scene": "Urban city street intersection during daytime",\n  "background": {\n    "buildings": [\n      {\n        "style": "Brick facade with green awnings",\n        "windows": "Multiple, reflecting sunlight",\n        "shops": "Visible with warm lighting inside"\n      },\n      {\n        "style": "Modern glass skyscrapers",\n        "height": "Tall, extending into the sky",\n        "reflection": "Sky and surrounding buildings"\n      },\n      {\n        "style": "Historic church-like building",\n        "features": "Steeple with a pointed roof",\n        "location": "Center background"\n      }\n    ],\n    "traffic_light": {\n      "color": "Yellow",\n      "position": "Hanging over the intersection"\n    },\n    "street": {\n      "crosswalk": "Wide zebra stripes",\n      "vehicles": [\n        {\n          "type": "Taxi",\n          "color": "White",\n          "motion": "Blurred, indicating movement"\n        },\n        {\n          "type": "SUV",\n          "color": 

We parse the json in a dict structure:

In [25]:
output = json.loads(returnValue)
#json. loads() converts JSON strings to Python objects
print(output)

{'scene': 'Urban city street intersection during daytime', 'background': {'buildings': [{'style': 'Brick facade with green awnings', 'windows': 'Multiple, reflecting sunlight', 'shops': 'Visible with warm lighting inside'}, {'style': 'Modern glass skyscrapers', 'height': 'Tall, extending into the sky', 'reflection': 'Sky and surrounding buildings'}, {'style': 'Historic church-like building', 'features': 'Steeple with a pointed roof', 'location': 'Center background'}], 'traffic_light': {'color': 'Yellow', 'position': 'Hanging over the intersection'}, 'street': {'crosswalk': 'Wide zebra stripes', 'vehicles': [{'type': 'Taxi', 'color': 'White', 'motion': 'Blurred, indicating movement'}, {'type': 'SUV', 'color': 'Gray', 'motion': 'Moving'}, {'type': 'Sedan', 'color': 'Orange', 'motion': 'Moving'}]}}, 'foreground': {'people': [{'position': 'Left side, sitting on the ground', 'activity': 'Using a tablet', 'clothing': 'Green jacket, shorts, sneakers'}, {'position': 'Center, lying on the groun

In [26]:
list(output.keys())

['scene', 'background', 'foreground', 'lighting', 'mood']

So we can access specific infos:

In [27]:
print(output["background"]["street"]["vehicles"][1]["location"])


KeyError: 'location'


# JSON Schema for Controlled Structured Outputs
In this section, we define a JSON schema for a more controlled and specific output from the model. 
Using this schema, we can ensure the model adheres to predefined data types and structures while describing images.In this case we will provide the json schema directly.



In [28]:
completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ]}
    ],
    response_format={
                "type": "json_schema",    
                "json_schema": {
                    "name": "img_extract",
                    "schema": {
                    "type": "object",
                    "properties": {
                        "numberOfPeople": {
                        "type":"integer",
                        "description": "The total number of people in the environment",
                        "minimum": 0
                        },
                        "atmosphere": {
                        "type": "string",
                        "description": "Description of the atmosphere, e.g., calm, lively, etc."
                        },
                        "hourOfTheDay": {
                        "type": "integer",
                        "description": "The hour of the day in 24-hour format",
                        "minimum": 0,
                        "maximum": 23
                        },
                        "people": {
                        "type": "array",
                        "description": "List of people and their details",
                        "items": {
                            "type": "object",
                            "properties": {
                            "position": {
                                "type": "string",
                                "description": "Position of the person in the environment, e.g., standing, sitting, etc."
                            },
                            "age": {
                                "type": "integer",
                                "description": "Age of the person",
                                "minimum": 0
                            },
                            "activity": {
                                "type": "string",
                                "description": "Activity the person is engaged in, e.g., reading, talking, etc."
                            },
                            "gender": {
                                "type": "string",
                                "description": "Gender of the person",
                                "enum": ["male", "female", "non-binary", "other", "prefer not to say"]
                            }
                            },
                            "required": ["position", "age", "activity", "gender"]
                        }
                        }
                    },
                    "required": ["numberOfPeople", "atmosphere", "hourOfTheDay", "people"]
                    }}},
    temperature = 0
)

returnValue = completion.choices[0].message.content


In [29]:
returnValue

'{"numberOfPeople":12,"atmosphere":"busy urban with a mix of calm and activity","hourOfTheDay":17,"people":[{"position":"sitting on the ground near a flower pot","age":16,"activity":"using a smartphone","gender":"male"},{"position":"lying on the ground","age":18,"activity":"resting or sleeping","gender":"male"},{"position":"sitting on a bench","age":65,"activity":"reading a newspaper","gender":"female"},{"position":"sitting on a bench","age":70,"activity":"thinking or resting with hand on face","gender":"male"},{"position":"walking on the sidewalk near the bench","age":20,"activity":"looking at a smartphone","gender":"female"},{"position":"riding a motorcycle","age":30,"activity":"driving","gender":"male"},{"position":"walking on the street playing guitar","age":25,"activity":"playing guitar","gender":"male"},{"position":"riding a scooter","age":28,"activity":"driving","gender":"female"},{"position":"walking on the sidewalk in the background","age":30,"activity":"walking","gender":"fem

In [30]:
output_image_extraction = json.loads(returnValue)


In [31]:
output_image_extraction["people"]

[{'position': 'sitting on the ground near a flower pot',
  'age': 16,
  'activity': 'using a smartphone',
  'gender': 'male'},
 {'position': 'lying on the ground',
  'age': 18,
  'activity': 'resting or sleeping',
  'gender': 'male'},
 {'position': 'sitting on a bench',
  'age': 65,
  'activity': 'reading a newspaper',
  'gender': 'female'},
 {'position': 'sitting on a bench',
  'age': 70,
  'activity': 'thinking or resting with hand on face',
  'gender': 'male'},
 {'position': 'walking on the sidewalk near the bench',
  'age': 20,
  'activity': 'looking at a smartphone',
  'gender': 'female'},
 {'position': 'riding a motorcycle',
  'age': 30,
  'activity': 'driving',
  'gender': 'male'},
 {'position': 'walking on the street playing guitar',
  'age': 25,
  'activity': 'playing guitar',
  'gender': 'male'},
 {'position': 'riding a scooter',
  'age': 28,
  'activity': 'driving',
  'gender': 'female'},
 {'position': 'walking on the sidewalk in the background',
  'age': 30,
  'activity': '

Alternatively: 


OpenAI SDKs for Python and JavaScript also make it easy to define object schemas using Pydantic and Zod respectively. Below, you can see how to extract information from unstructured text that conforms to a schema defined in code.


```python
from pydantic import BaseModel


class Person(BaseModel):
    position: str 
    age: int 
    activity: str 
    gender: str


class ImageExtraction(BaseModel):
    number_of_people: int 
    atmosphere: str 
    hour_of_the_day: int 
    people: list[Person] 

completion = openAIclient.beta.chat.completions.parse(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": "describe the image in detail"}
    ],
    response_format=ImageExtraction,
)

output_image_extraction = completion.choices[0].message.parsed


We can then integrate the extracted information in full or partially in a new prompt for a new extraction

In [32]:
#alert service prompt 

alert_sys_prompt = " you are an experienced first aid paramedical"
alert_prompt= """Extract from the following scene analysis give to you in json format, 
if anyone might be in danger and if the Child Hospital or normal Hospital should be alerted. 
Give the a concise answer
The situation is given to you from this object: """ + str(output_image_extraction)


In [33]:

completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "user", "content": alert_prompt},
        {"role": "user", "content": alert_prompt}
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message.content)
print(textwrap.fill(response, width=120))

No one appears to be in immediate danger. The person lying on the ground (age 18) seems to be resting or sleeping, not
injured. No hospital alert is necessary. If any medical facility should be informed, it would be a normal hospital, not
a child hospital, since the youngest individual is 16 and no signs of distress are present.


In [34]:
completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Considering this list of people"+str(output_image_extraction["people"])+".Identify the youngest in the picture I provide and give me back their coordinates. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ],
        }
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message.content)
print(textwrap.fill(response, width=120))

The youngest person in the provided list is the 16-year-old male sitting on the ground near a flower pot using a
smartphone.  In the image, this person corresponds to the young man sitting on the ground in the foreground, wearing a
green jacket and looking at a phone.  Based on the image and normalized to 0-1000 in [ymin, xmin, ymax, xmax] format,
his approximate 2D bounding box coordinates are:  [710, 230, 960, 480]



# 2. Google VLM (Gemini)
This section demonstrates the use of Google's Vision Language Model, Gemini. 
We explore basic text generation as well as its ability to analyze images and provide relevant outputs.

**Support Material**:
- https://ai.google.dev/gemini-api/docs/quickstart
- https://ai.google.dev/gemini-api/docs/text-generation
- https://ai.google.dev/gemini-api/docs/image-understanding
- https://ai.google.dev/gemini-api/docs/structured-output?example=recipe

In [35]:
%matplotlib inline
from dotenv import load_dotenv  
from google import genai
from PIL import Image
import textwrap

import json


load_dotenv()
client = genai.Client()

# Path to your image
img = "images/street_scene.jpg"

Basic call:

In [36]:
response = client.models.generate_content(
    model="gemini-2.5-flash", contents="Explain how AI works to a 90 years old. in few words"
)

print(textwrap.fill(response.text, width=120))

ClientError: 400 INVALID_ARGUMENT. {'error': {'code': 400, 'message': 'API key expired. Please renew the API key.', 'status': 'INVALID_ARGUMENT', 'details': [{'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'API_KEY_INVALID', 'domain': 'googleapis.com', 'metadata': {'service': 'generativelanguage.googleapis.com'}}, {'@type': 'type.googleapis.com/google.rpc.LocalizedMessage', 'locale': 'en-US', 'message': 'API key expired. Please renew the API key.'}]}}

and with images: 

In [None]:
im = Image.open(img)

response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[im, "Describe the scene in details\n"],
                                          )

print(textwrap.fill(response.text, width=120))


This bustling urban scene captures a moment in a vibrant city during what appears to be late afternoon or early evening,
bathed in the warm, golden light of a low sun. The image is rich with activity, showcasing a blend of architectural
styles and diverse individuals going about their day.  In the **foreground**, a wide pedestrian crosswalk with bold
black and white stripes diagonally cuts across the bottom left. On the sidewalk to the left, a small pot of red
geraniums sits. Next to it, a young person with short brown hair sits cross-legged, engrossed in a tablet or phone.
Further to the right and slightly in front, another young person, wearing a red hoodie and blue jeans, lies casually on
their back on the pavement, looking upwards. Several pigeons are scattered on the sidewalk and crosswalk, pecking at the
ground, adding to the authentic urban feel.  On the right side of the foreground, a classic wooden park bench is
occupied by two individuals. An older man in a dark suit sits on 

Also here we can extract structured output (Gemini actually prefers pydantic syntax - let's see what happens with a schema as before)-> check limitations in https://ai.google.dev/gemini-api/docs/structured-output?example=recipe 

In [None]:
json_schema = {
                    "name": "img_extract",
                    "schema": {
                    "type": "object",
                    "properties": {
                        "numberOfPeople": {
                        "type":"integer",
                        "description": "The total number of people in the environment",
                        "minimum": 0
                        },
                        "atmosphere": {
                        "type": "string",
                        "description": "Description of the atmosphere, e.g., calm, lively, etc."
                        },
                        "hourOfTheDay": {
                        "type": "integer",
                        "description": "The hour of the day in 24-hour format",
                        "minimum": 0,
                        "maximum": 23
                        },
                        "people": {
                        "type": "array",
                        "description": "List of people and their details",
                        "items": {
                            "type": "object",
                            "properties": {
                            "position": {
                                "type": "string",
                                "description": "Position of the person in the environment, e.g., standing, sitting, etc."
                            },
                            "age": {
                                "type": "integer",
                                "description": "Age of the person",
                                "minimum": 0
                            },
                            "activity": {
                                "type": "string",
                                "description": "Activity the person is engaged in, e.g., reading, talking, etc."
                            },
                            "gender": {
                                "type": "string",
                                "description": "Gender of the person",
                                "enum": ["male", "female", "non-binary", "other", "prefer not to say"]
                            }
                            },
                            "required": ["position", "age", "activity", "gender"]
                        }
                        }
                    },
                    "required": ["numberOfPeople", "atmosphere", "hourOfTheDay", "people"]}}



config={
        "response_mime_type": "application/json",
        "response_json_schema": json_schema,
    }


response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[im, "Describe the scene in details, follwoing exactly the given json schema\n"],
                                          config=config
                                          )



print(response.text)

{
  "visual_description": "A dynamic street scene in a bustling city, featuring a mix of people, vehicles, and architecture under a soft, golden hour light. The foreground shows a wide crosswalk where several individuals are engaged in various activities, while cars and motorcycles move past. Tall buildings, ranging from classic brick structures to modern skyscrapers, line the street, creating a deep urban perspective. Pigeons are scattered on the sidewalk, and a single potted plant adds a touch of nature.",
  "elements": [
    "crosswalk",
    "traffic light",
    "street lamps",
    "buildings",
    "skyscrapers",
    "cars",
    "motorcycle",
    "scooter",
    "benches",
    "pigeons",
    "potted plant with red flowers",
    "sidewalk"
  ],
  "people": [
    "A man in a dark jacket and helmet riding a motorcycle across the crosswalk.",
    "A man in a dark jacket and hat walking and playing a guitar across the crosswalk.",
    "A woman riding a scooter in the background, to the ri

Does it match your schema?

Let's try to use Gemini to detect an object in the image and get its coordinates:


In [None]:
prompt = "Identify the youngest in the picture and give me back their coordinates. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000."


config={"response_mime_type": "application/json"}

response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[img, prompt],
                                          config=config
                                          )

bounding_boxes = json.loads(response.text)
print(bounding_boxes)


{'box_2d': [664, 461, 794, 513]}
{
  "box_2d": [664, 461, 794, 513]
}


Gemini2+ was trained specifically for object detection/ segmentation tasks. More details: https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Spatial_understanding.ipynb


## 3.Extract Structured Infos from Hand-written note - GPT & Gemini

Let’s try **not** to extract structured information from a handwritten note (e.g., `prescription1.jpg`) using **both models**.

Consider the file: `/images/prescription1.jpg`.  
Have a look at it.

### JSON Schema
Let’s define a JSON schema for the extraction task:


In [None]:
json_schema_prescription = {
 "name": "prescription_extract",
"schema": {
  "type": "object",
  "properties": {
    "doctor_name": { "type": "string" },
    "patient_name": { "type": "string" },
    "patient_dob": { "type": "string" },
    "meds": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "dose": { "type": "string" },
          "frequency": { "type": "string" },
          "instructions": { "type": "string" }
        },
        "required": ["name"]
      }
    },
    "signature": { "type": "boolean" }
  },
  "required": ["doctor_name", "patient_name", "meds"]
}}

Extract structured infos using Gemini: 

In [None]:
im = Image.open("images/prescription1.jpg")

config={
        "response_mime_type": "application/json",
        "response_json_schema": json_schema_prescription,
    }


response = client.models.generate_content(model="gemini-2.5-flash",
                                          contents=[im, "Extract infos from image, follwoing the given json schema.\n"],
                                          config=config
                                          )



print(response.text)

{
  "doctor": "Dr. Markus Müller",
  "patient": "Claudle Fischer",
  "dateOfBirth": "01.04.1978",
  "gender": "Female",
  "medication": [
    "Ibuprofen",
    "3x 400mg",
    "nach dem Essen"
  ],
  "signature": "Reptuller"
}


If the output is **not valid JSON** and contains extra strings, it must be **parsed** before it can be loaded into a Python dict.  
Below is an example helper function that does this.

> **Note:** Since Gemini returns a Pydantic model, you *could* use Pydantic methods to handle parsing.  
> We avoid that here to keep the workflow generally compatible across models.


In [None]:
import re
import json 
def parse_json_in_output(output):
    """
    Extracts and converts JSON-like data from the given text output to a Python dictionary.
    
    Args:
        output (str): The text output containing the JSON data.
    
    Returns:
        dict: The parsed JSON data as a Python dictionary.
    """
    # Regex to extract JSON-like portion
    json_match = re.search(r"\{.*?\}", output, re.DOTALL)
    if json_match:
        json_str = json_match.group(0)
        # Fix single quotes and ensure proper JSON formatting
        json_str = json_str.replace("'", '"')  # Replace single quotes with double quotes
        try:
            # Convert the fixed JSON string into a dictionary
            json_data = json.loads(json_str)
            return json_data
        except json.JSONDecodeError:
            return "The extracted JSON is still not valid after formatting."
    else:
        return "No JSON data found in the given output."

In [None]:
#print(parse_json_in_output(response.text))


In [None]:
json.loads(response.text)

{'patientName': 'Claudie Fischer',
 'doctorName': 'Dr. Markus Müller',
 'medications': ['Ibuprofen', '400mg', '3x', 'nach dem Essen'],
 'dateOfBirth': '01.04.1978',
 'gender': 'f',
 'diagnosis': None,
 'signature': 'Reichmüller'}

Now let's do the same with GPT

In [None]:
im = "images/prescription1.jpg"

completion = openAIclient.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "you are a careful observer. the response should be in json format"},
        {"role": "user", "content": [
                {"type": "text", "text": "Describe the image in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(im)}",
                        #"detail": "low" -> je tiefer desto weniger tokens werden verwendet
                    }
                },
            ]}
    ],
    response_format={
                "type": "json_schema",   "json_schema": json_schema_prescription},
    temperature = 0
)

returnValue = completion.choices[0].message.content

In [None]:
returnValue

'{"doctor_name":"Dr. Markus Müller","patient_name":"Claudia Fischer","patient_dob":"1.4.1978","meds":[{"name":"Ibuprofen","dose":"400 mg","frequency":"3x","instructions":"nach dem Essen"}],"signature":true}'

Any difference wiht the output of Gemini vs your schema? 

No need for parsing now. We load the json in a python dict structure with json.loads

In [None]:
print(json.loads(returnValue))

{'doctor_name': 'Dr. Markus Müller', 'patient_name': 'Claudia Fischer', 'patient_dob': '1.4.1978', 'meds': [{'name': 'Ibuprofen', 'dose': '400 mg', 'frequency': '3x', 'instructions': 'nach dem Essen'}], 'signature': True}
