
# Multimodal Prompting

**What you'll explore**
- How to prompt *vision models* with **image URLs** and **local files**.
- Examples:
    - scene description
    - shelf stock checks
    - extracting info from receipts
    - interpreting sales charts
    - filtering menus / labels

**Exercises**

At the end of the notebook, you’ll find 4 exercises to practice these skills.

In [2]:

from openai import OpenAI
import base64, os, json

# Optional: set your key here for local testing (avoid committing real keys)
os.environ["OPENAI_API_KEY"] = "" 

client = OpenAI()
MODEL_VISION = "gpt-4o"          # Vision-capable
MODEL_FAST = "gpt-4o-mini"       # Cheaper/faster text model (used later if needed)


## 1) Describe a scene (Image URL)
Working with image inputs and generating detailed visual descriptions.”

In [6]:
from IPython.display import Image, display

# ---------- 1) Show the image inline in a notebook ----------
# Set the image URL you want to display
image_url = "https://upload.wikimedia.org/wikipedia/commons/1/1b/Cycling_12.jpg"

# Display the image inside the notebook (so you can visually confirm it)
display(Image(url=image_url))  


# ---------- 2) Send the image to the model with a text instruction ----------
# We call the Vision-capable model (MODEL_VISION must support image inputs).
# The `messages` array contains one user message with *two parts*:
#   - a text instruction ("Describe the image in detail")
#   - the actual image (via image_url)
resp = client.chat.completions.create(
    model=MODEL_VISION,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the image in detail"},
            {"type": "image_url", "image_url": {"url": image_url}}
        ]
    }],
    temperature=0.2   # Lower temp = more deterministic description
)


# ---------- 3) Read the model’s description ----------
# The model’s answer comes back as natural language text
print(resp.choices[0].message.content)


The image depicts a scenic outdoor setting with two cyclists riding on a paved road. The foreground features a cyclist wearing a helmet, a white and red cycling jersey, and black shorts. The cyclist is riding a road bike and is positioned on the left side of the image. A number is visible on the bike, suggesting participation in an event or race.

In the background, another cyclist is visible further down the road. The road curves gently to the right and is bordered by grass and small plants. The landscape is characterized by rolling hills with patches of grass and rocky outcrops. A stone wall runs along the hillside on the left.

The sun is shining brightly, casting long shadows of the cyclists on the road. The sky is clear, contributing to the overall bright and serene atmosphere of the scene. The lighting suggests it might be early morning or late afternoon.


## 2) Shelf Check (Image URL)
Analyzing images to detect patterns or conditions in real-world scenes - Identify **low/empty** spots and name categories if visible.

In [7]:
image_url = "https://upload.wikimedia.org/wikipedia/commons/f/f0/Milk_shelf_at_Singapore_supermarket.png"
# Alternate demo image:
# image_url = "https://gibsonretail.com.au/wp-content/uploads/2023/03/Shelving-Systems_Outrigger-Shelving_Intro-Image-1.jpg"

# Display the chosen image inline
display(Image(url=image_url))  

# Ask the vision model to check shelves for low/empty stock
resp = client.chat.completions.create(
    model=MODEL_VISION,
    messages=[{
        "role":"user",
        "content":[
            {"type":"text","text":"Identify any low-stock or empty areas and name the product categories if visible. Keep it to 3 bullets."},
            {"type":"image_url","image_url":{"url": image_url}}
        ]
    }],
    temperature=0.1
)

# Print the model response
print(resp.choices[0].message.content)


- **Bottom Shelf (Right Side):** The area appears empty, likely meant for juice or milk products.
- **Middle Shelf (Right Side):** Low stock of bottled drinks, possibly flavored milk or coffee.
- **Bottom Shelf (Left Side):** Low stock of large milk cartons or bottles.


## 3) Receipt Extraction (Local File)
Upload a **local receipt** and extract **store, date, total**.

In [9]:
local_image_path = "receipt_1.jpeg"  # <- replace with a real file on your machine
local_image_path_1 = "receipt_2.jpeg"  # <- handwritten receipt
# Display the local image in the notebook
display(Image(url=local_image_path))  

try:
    # Convert image bytes → base64 → data URL (so it can be passed inline)
    with open(local_image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    data_url = f"data:image/jpeg;base64,{b64}"

    # Ask the vision model to parse key details from the receipt
    resp2 = client.chat.completions.create(
        model=MODEL_VISION,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract store name, city/state if visible, date, items ordered and total amount."},
                {"type": "image_url", "image_url": {"url": data_url}}
            ]
        }],
        temperature=0.2
    )
    print(resp2.choices[0].message.content)

# Handle the case where the file path is invalid or the image is missing
except FileNotFoundError:
    print("⚠️ Set local_image_path to a real image before running.")


- **Store Name:** The Tack Room
- **City/State:** Lincoln, MA
- **Date:** 4/8/24
- **Items Ordered:**
  1. BBQ Potato Chips - $7.00
  2. Diet Coke - $3.00
  3. Trillium Fort Point - $10.00
  4. Fried Chicken Sandwich - $14.00
  5. Famous Duck Grilled Cheese - $25.00
  6. Mac & Cheese - $12.00
  7. Burger of the Moment - $16.00
- **Total Amount:** $124.53


## 4) Menu/Label Filter (Image URL)
Demonstrates how to query an image while applying constraints — for example, filtering menu items that are vegetarian and under a given price. (e.g., vegetarian under $5).

In [12]:
menu_img = "https://renderer.mhmcdn.com/design/thumbnail/34913cd4-8611-42ac-b942-b4ab1c42e2f4?width=500&update=1756292135305"  # example menu image

# Display the menu image in the notebook
display(Image(url=menu_img))  

# Ask the vision model to read the menu and apply a logical filter
resp4 = client.chat.completions.create(
    model=MODEL_VISION,
    messages=[{
        "role":"user",
        "content":[
            {"type":"text","text":"From this menu, list vegetarian items priced under $6."},
            {"type":"image_url","image_url":{"url": menu_img}}
        ]
    }],
    temperature=0.2  
)

# Print the model’s filtered list of menu items
print(resp4.choices[0].message.content)


The vegetarian items priced under $6 are:

- Sweet Potato Fries - $4.95
- Breadstick & Sauce - $4.95


## 5) Sales Chart Interpretation (Image URL)
Summarize the **trend** in 2 bullets.
Use a chart image as input and have the model summarize visible trends concisely. LLMs can interpret graphs and plots too.”

In [13]:
chart_url = "https://www.thesaascfo.com/wp-content/uploads/2017/05/Committed-Monthly-Recurring-Revenue-Chart.png"

# Display the sales chart in the notebook
display(Image(url=chart_url))  

# Ask the vision model to interpret the revenue trend in concise bullet points
resp3 = client.chat.completions.create(
    model=MODEL_VISION,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain the revenue trend in 5 concise bullets (<=15 words each)."},
            {"type": "image_url", "image_url": {"url": chart_url}}
        ]
    }],
    temperature=0.3  # slight randomness for varied phrasing
)

# Print the model’s summary of the chart trend
print(resp3.choices[0].message.content)


- Steady MRR growth from $1,055,000 to $2,348,000 over 12 months.
- New business consistently contributes to revenue increase each month.
- Upsells enhance revenue, maintaining upward trend.
- Minimal impact from downgrades and churn on overall growth.
- CMRR shows a positive, consistent upward trajectory.


# 🖼 Exercises — Multimodal Prompting

### Exercise 1: Chart Analysis  
Use the revenue chart provided - https://www.thesaascfo.com/wp-content/uploads/2017/05/Committed-Monthly-Recurring-Revenue-Chart.png.  
- Summarize the trend in **2 bullets**: one on growth, one on driver composition.  
- Then, explain the same chart to a **10-year-old in 2 simple sentences**.  

---

### Exercise 2: GST Invoice Extraction  
You’re given a GST tax invoice image - https://www.outputbooks.com/wp-content/themes/outputbooks/images/oub_GST_Invoice_Format.png.  
- Extract the **invoice number, invoice date, and due date**.  
- Extract the **seller details** (company, address, GSTIN) and **buyer details** (name, address, GSTIN, contact).  
- Extract all **line items** (name, manufacturing date, quantity, rate, tax, amount).  
- Extract the **subtotal and grand total**.  
- Capture the **amount in words** if present.  

---

### Exercise 3: Spot the Difference — Shelf Changes  
You’re given a **before** and **after** shelf image.  
- Compare the two.  
- Highlight the changes in **3 concise bullets**.  

---

### Exercise 4: Design Your Own Multimodal Scenario  
Think of a situation where combining **image + text prompting** could help.  
- Write your own prompt.  
- Test it with a relevant image.  
- Share your result.  
