# Using ChatGPT-4 API for Image Text Extraction and Question Answering

This notebook demonstrates how to use the ChatGPT-4 API to extract information from an image, such as the bounding boxes and the text within those boxes. We will also show how to ask questions about the content of the image, for example, information contained in a document.

## Installation

First, we need to install the required libraries. You can install them using pip.

In [None]:
# Install the required libraries
!pip install openai

# Get your API key!

1. Visit this [link](https://platform.openai.com/settings/profile?tab=api-keys) to get your API key.
2. Press "+ Create new secret key" to create a new API key.
3. Fill in the name of the key and (optional) configure the permissions as shown in the image below. Then press "Create" to create the key.
   
<img src="../assets/tutorials/create_new_secret_key_dialog.png" alt="Create secret key" width="500" />

1. Copy the API key. WARNING: This is the only time you will be able to see the key. Make sure to save it in a secure location.

<img src="../assets/tutorials/save_your_key_dialog.png" alt="Save your key" width="500" />

## Initialize OpenAI Client

You need to set up your OpenAI API key to use the API. Replace `your_api_key` with your actual OpenAI API key.

In [None]:
from openai import OpenAI

OPENAI_API_KEY = "sk-..."

# Initialize the OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

## Preview your document

We will load the image from a specified path and display it.

In [None]:
from PIL import Image

# Load the image from a path
image_path = "/path/to/image.jpg"
image = Image.open(image_path)

display(image)

## Encode image to url link with base64
To send the image to the ChatGPT-4 API, we need to encode the image to a base64 string and then convert it to a URL link.

In [None]:
import base64
import io

def encode_image(image: Image.Image) -> str:
    """Encode an image into base64 format."""
    buffered = io.BytesIO()
    image.save(buffered, format="JPEG")
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

def create_link(base64_image: str) -> str:
    """Create a link from a base64 image."""
    return f"data:image/jpeg;base64,{base64_image}"


# Encode the image
encoded_image = encode_image(image)
print("Base64 image:", encoded_image)
image_link = create_link(encoded_image)
print("Image link:", image_link)

## Asking Questions about the Image

We will send a question to the ChatGPT-4 API along with the image to get an answer about its content.

In [None]:
def extract_information(image_path: str, question: str) -> str:
    """Extract information from the image and return the answer."""
    # Load the image
    image = Image.open(image_path)
    # Encode the image
    encoded_image = encode_image(image)
    image_link = create_link(encoded_image)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant designed to extract information from the input document and user question. Please always answer the question based on the information extracted from the document and in a concise manner.",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": question,
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": image_link,
                        },
                    },
                ],
            },
        ],
    )

    answer = response.choices[0].message.content
    return answer


question = "น้ำหนักรถรวมเท่าไหร่"
answer = extract_information(image_path, question)
print("Question:", question)
print("Answer:", answer)

# Extract Information from the document

## Create question prompt

In [None]:
extract_question = """You are provided with a scanned or photographed image of a Thai vehicle registration book (สมุดทะเบียนรถ). Your task is to extract the following information from the image.The extracted value is typically located on the right side of the key in the document.
Extract these details:

1. วันจดทะเบียน (date_of_registration)
2. เลขทะเบียน (registration_no)
3. จังหวัด (car_province)
4. ประเภท (vehicle_use)
5. รย. (type)
6. ลักษณะ (body_style)
7. ยี่ห้อรถ (manufacturer)
8. แบบ (model)
9. รุ่นปี คศ (year)
10. สี (color)
11. เลขตัวรถ (chassis_number)
12. อยู่ที่ (chassis_location)
13. ยี่ห้อเครื่องยนต์ (engine_manufacturer)
14. เลขเครื่องยนต์ (engine_number)
15. อยู่ที่ (engine_location)
16. เชื้อเพลิง (fuel_type)
17. เลขถังแก๊ส (fuel_tank_number)
18. จำนวน (cylinders)
19. ซีซี (cubic_capacity)
20. แรงม้า (horse_power)
21. จำนวนเพลาและล้อ (axles_wheels_no)
22. น้ำหนักรถ (unladen_weight)
23. น้ำหนักบรรทุก/น้ำหนักเพลา (load_capacity)
24. น้ำหนักรวม (gross_weight)
25. ที่นั่ง (seats)

Instructions:

Carefully examine the image and locate each piece of information.
If a particular field is not visible or not present in the image, use the value "N/A" for that field.
Ensure all text extracted from the image is in its original language (Thai or English) as it appears in the document.
Return the extracted information in a JSON format, using the English key names provided in parentheses.
Only return the JSON output, without any additional explanation or text.

Example of expected output format:
{
  "date_of_registration": "1 ม.ค. 2566",
  "registration_no": "กข 1234",
  "car_province": "กรุงเทพมหานคร",
  ...
  "seats": "4"
}
"""

In [None]:
from IPython.display import Markdown

answer = extract_information(image_path, extract_question)
print("Answer:")
print(answer)

Because the output from the ChatGPT-4 API can be error sometimes, we will create a clean prompt to ask the model to reformat the output.

In [None]:
def clean_json(json_answer: str) -> str:
    """Extract information from the image and return the answer."""
    # Clean prompt
    clean_prompt = """
You are a JSON formatting assistant. Your task is to take a potentially malformed or incomplete JSON string and return a properly formatted, valid JSON object. Follow these steps:

1. Analyze the input text for JSON-like structure.
2. Identify and correct common JSON formatting errors such as:
   - Missing closing braces or brackets
   - Trailing commas
   - Unquoted keys
   - Missing values
3. If a value is missing or incomplete, use "N/A" as the value.
4. Ensure all keys and string values are properly quoted with double quotes.
5. Remove any extraneous text before or after the JSON object.
6. Format the JSON with proper indentation for readability.

Input: {}

Instructions:
- Return only the corrected JSON object, without any additional explanation or text.
- Ensure the output is a complete, valid JSON object that can be parsed by Python's json.loads() function.
- Preserve the original data as much as possible, only making changes necessary for valid JSON formatting.

Example of expected output format:
{{
  "key1": "value1",
  "key2": "value2",
  "key3": "N/A"
}}
"""
    # Make the API call
    response = client.chat.completions.create(
        model="gpt-4o",  # or another suitable model
        messages=[
            {"role": "system", "content": "You are a JSON formatting assistant."},
            {"role": "user", "content": clean_prompt},
        ],
    )

    # Extract the cleaned JSON from the response
    cleaned_json_str = response.choices[0].message.content
    return cleaned_json_str

In [None]:
# Try to parse the JSON, if it fails, clean it
import json


def parse_json(answer: str) -> dict:
    """Parse a JSON string and return a dictionary."""
    try:
        # Remove ```json from the start and end of the string
        # and try to parse the JSON
        answer = answer.replace("```json", "").replace("```", "")
    except json.JSONDecodeError:
        answer = clean_json(answer)
    answer = json.loads(answer)
    return answer


# Parse the JSON
answer = parse_json(answer)
print("Parsed JSON:")
print(answer)

## Let's run on all images!

In [None]:
import json
from tqdm.auto import tqdm
from glob import glob
from pathlib import Path
import pandas as pd

extracted_values = []
error_paths = []
paths = glob("/path/to/folder/*.jpg")
for path in tqdm(paths):
    answer = extract_information(image_path=path, question=extract_question)
    print(f"path {path} is processed")
    try:
        answer = parse_json(answer)
        answer["image_path"] = str(Path(path).stem)
        extracted_values.append(answer)
    except:
        error_paths.append(path)
        print(f"Error processing: {path}")
extracted_values_df = pd.DataFrame(extracted_values)
extracted_values_df.to_excel("predicted_results_chatgpt.xlsx", index=False)
# Preview the DataFrame
display(extracted_values_df)

# Evaluation

In [None]:
import numpy as np
from torchmetrics.text import CharErrorRate

def calculate_cer(preds: list, targets: list):
    cer = CharErrorRate() # Initialize the CharErrorRate metric
    cer_val = cer(preds, targets) # Calculate CER
    return cer_val.tolist()

In [None]:
annotated_df = pd.read_excel('annotated_results.xlsx', dtype=str).fillna("")
predicted_df = pd.read_excel('predicted_results_chatgpt.xlsx', dtype=str).fillna("")

columns_of_interest = [
    'date_of_registration', 'registration_no', 'car_province', 'vehicle_use', 'type', 'body_style',
    'manufacturer', 'model', 'year', 'color', 'chassis_number', 'chassis_location', 'engine_manufacturer',
    'engine_number', 'engine_location', 'fuel_type', 'fuel_tank_number', 'cylinders', 'cubic_capacity',
    'horse_power', 'axles_wheels_no', 'unladen_weight', 'load_capacity', 'gross_weight', 'seats'
]
merged_df = pd.merge(annotated_df, predicted_df, on='image_path', suffixes=('_annotation', '_prediction'))

In [None]:
eval_list = []
for col in columns_of_interest:
    if f"{col}_annotation" in merged_df.columns and f"{col}_prediction" in merged_df.columns:
        avg_cer = np.mean(calculate_cer(merged_df[f"{col}_prediction"], merged_df[f"{col}_annotation"]))
        avg_accuracy = (merged_df[f"{col}_prediction"] == merged_df[f"{col}_annotation"]).mean() * 100
        eval_list.append({
            "column_name": col,
            "cer": avg_cer,
            "accuracy": avg_accuracy
        })
eval_df = pd.DataFrame(eval_list)

In [None]:
display(eval_df)