# Benchmarking OCR services from remote clouds 

| OCR Model                     | Provider       | Accuracy | Handwriting | Tables/Forms | Languages | Pricing Model          | Best For                          |
|-------------------------------|---------------|----------|-------------|--------------|-----------|------------------------|-----------------------------------|
| **Google Cloud Vision**        | Google Cloud  | ★★★★★    | Yes         | Yes          | 50+       | $1.50/1K pages         | Cloud scaling, multilingual docs  |
| **Amazon Textract**            | AWS           | ★★★★☆    | Limited     | Yes          | 50+       | $0.0015/page           | AWS-based doc automation          |
| **Google Document AI**         | Google Cloud  | ★★★★★    | Yes         | Yes          | 50+       | $1.50/1K pages         | Enterprise OCR, PDF, tables       |
| **Azure Document Intelligence**| Microsoft     | ★★★★☆    | Yes         | Yes          | 100+      | €8.85/1K pages         | Layout, tables, key-value pairs   |
| **Meta Llama 4 Maverick**      | Meta/Groq     | ★★★★☆    | No          | Yes (layout) | 20+       | $0.80/1M tokens         | Layout-preserving OCR, VLM        |
| **Meta Llama 4 Scout**         | Meta/Groq     | ★★★★☆    | No          | Yes          | 20+       | $0.80/1M tokens         | Fast, layout-aware OCR            |
| **Mistral OCR**                | Mistral       | ★★★☆☆    | No          | Yes          | 20+       | $0.80/1M tokens         | Fast, markdown output, no coords  |
| **Qwen2.5-VL-72B-Instruct**    | Alibaba       | ★★★★☆    | No          | Yes (coords) | 20+       | $0.80/1M tokens         | Coordinates, layout, VLM OCR      |
| **Gemini 2.5 Flash**           | Google Vertex | ★★★★☆    | Yes         | Yes          | 30+       | $2.80/1K pages          | Fast, tables, coordinates, PDF    |

## Method 1: meta-llama/llama-4-scout-17b-16e-instruct

In [4]:
%pip install groq

Note: you may need to restart the kernel to use updated packages.


In [5]:

from groq import Groq
import os
import base64
image_files = [
    os.path.join('task1c_reportsTemplates', f)
    for f in sorted(os.listdir('task1c_reportsTemplates'))
    if f.endswith('.png') or f.endswith('jpeg')
]

from groq import Groq
import os
import base64
def encode_image(image_path):

  with open(image_path, "rb") as image_file:

    return base64.b64encode(image_file.read()).decode('utf-8')

client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)
results = []

for img_path in image_files:
    base64_image = encode_image(img_path)
    chat_completion = client.chat.completions.create(
 messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert OCR assistant specialized in extracting text from images. "
                    "Follow these rules strictly:\n"
                    "1. Extract ALL text verbatim, including numbers, symbols, and formatting.\n"
                    "2. Preserve line breaks, spacing, and indentation.\n"
                    "3. Do not correct spelling or modify the text in any way.\n"
                    "4. If text is unclear, mark as '[UNREADABLE]'.\n"
                    "5. Return ONLY the raw extracted text, no explanations."
                )
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract the text from this image exactly as it appears:"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"
                        },
                    },
                ],
            }
        ],

    model="meta-llama/llama-4-scout-17b-16e-instruct",        
    temperature=0.0,
    max_tokens=4096,
    )
    results.append(chat_completion.choices[0].message.content)

# Print all results at once
print('\n\n'.join(results))


 
LABORATOIRE DE BIOLOGIE MÉDICALE

  
N° FINESS : 34 3 2
 -  - 
Tél :   Fax : 
  - Biologiste(s) Médical(aux)

Docteur    
CABINET MEDICAL " "
 
Copie à : Docteur    , DR 
X Demande n° 01/02/ -LABO--TP

Madame   

  (100)
Edité le, lundi 1 février 2021
Copie à : Docteur    , DR 

Patient né(e)   le 
FSE Tiers payant  - 
Prélèvements effectués par le laboratoire le 01/02/2021 à 10H27

Vos résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable
1) Communiquez votre mail au laboratoire  2) Recevez un email dès que vos résultats sont disponibles  3) Cliquez sur le lien

INFORMATION COVID-19
Rendez-vous sur notre site internet dédié pour connaître notre organisation : https:// .fr/depistage-covid-19/

Hématologie

                        Valeurs de référence     Antériorités
Hémogramme
(Sang total - Variation d'impédance, photométrie, cytométrie en flux)  - 

Hématies .................................................... 4,94 Téra/L    3,80 à 5,90    4,97
Hémoglob

Both meta llama 4 models are excellent. this one has more respect for the layout. Worse performance on the pdf OCR, the table in page 3 of the pdf Highlighted output was't captured well, it is vertically split in two so not entirely valid. 

## Method 2: meta-llama/llama-4-maverick-17b-128e-instruct 

In [2]:
%pip install groq

Collecting groq
  Downloading groq-0.28.0-py3-none-any.whl.metadata (15 kB)
Collecting distro<2,>=1.7.0 (from groq)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Downloading groq-0.28.0-py3-none-any.whl (130 kB)
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Installing collected packages: distro, groq
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [groq][32m1/2[0m [groq]
[1A[2KSuccessfully installed distro-1.9.0 groq-0.28.0
Note: you may need to restart the kernel to use updated packages.


In [5]:

from groq import Groq
import os
import base64
image_files = [
    os.path.join('task1c_reportsTemplates', f)
    for f in sorted(os.listdir('task1c_reportsTemplates'))
    if f.endswith('.png') or f.endswith('.jpeg')
]

def encode_image(image_path):

  with open(image_path, "rb") as image_file:

    return base64.b64encode(image_file.read()).decode('utf-8')

client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)
results = []

for img_path in image_files:
    base64_image = encode_image(img_path)
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert OCR assistant specialized in extracting text from images. "
                    "Follow these rules strictly:\n"
                    "1. Extract ALL text verbatim, including numbers, symbols, and formatting.\n"
                    "2. Preserve line breaks, spacing, and indentation.\n"
                    "3. Do not correct spelling or modify the text in any way.\n"
                    "4. If text is unclear, mark as '[UNREADABLE]'.\n"
                    "5. Return ONLY the raw extracted text, no explanations."
                )
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract the text from this image exactly as it appears:"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"
                        },
                    },
                ],
            }
        ],
        model="meta-llama/llama-4-maverick-17b-128e-instruct",
        temperature=0.0,
        max_tokens=4096,
    )
    results.append(chat_completion.choices[0].message.content)

# Print results for each image
for i, text in enumerate(results, 1):
    print(f"\n--- Page {i} ---\n{text}\n")


--- Page 1 ---
``` 
  
N° FINESS : 34 37302 2                 LABORATOIRE DE BIOLOGIE MÉDICALE
 -  -                   
  - Biologiste(s) Médical(aux)

Docteur                            Madame   
CABINET MEDICAL " "                   
                                     (100)

Copie à : Docteur    , DR 
X Demande n° 01/02/ - LABO--TP       Edité le, lundi 1 février 2021
                                                    Copie à : Docteur    , DR
                                                    

Patient né(e)   le 
FSE Tiers payant  - 
Prélèvements effectués par le laboratoire le 01/02/21 à 10H27

Vos résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable
1) Communiquez votre mail au laboratoire 2) Recevez un email dès que vos résultats sont disponibles 3) Cliquez sur le lien

INFORMATION COVID-19
Rendez-vous sur notre site internet dédié pour connaître notre organisation : https:// .fr/depistage-covid-19/

Hématologie

Hémogramme
(Sang total - Variat


Quite good but does not keep coordinates. Should probably try to mix both things. There are libraries that can match the text between the OCR and the VLM to know location and words. This would be the ideal. https://docs.python.org/3/library/difflib.html


Better than the other meta model, one of the only models in this document that actually respect the layout to perfection. 
| Llama 4 Maverick 17Bx128E | Input Token Price (per 1M) | Output Token Price (per 1M) | Total Price (per 1M tokens) |
|---------------------------|----------------------------|-----------------------------|------------------------------|
| Per 1M Tokens             | $0.20                      | $0.60                       | $0.80                        |


## Method 3: google gemma-3-27b-it

In [None]:
%pip install -U transformers
%pip install python-dotenv

In [None]:
from huggingface_hub import InferenceClient
import os 
import base64
from dotenv import load_dotenv
load_dotenv()
client = InferenceClient(
    provider="nebius",
    api_key=os.environ.get("HF_PLAYGROUND")
)
image_files = [
    os.path.join('task1c_reportsTemplates', f)
    for f in sorted(os.listdir('task1c_reportsTemplates'))
    if f.endswith('.png') or f.endswith('.jpeg')
]
def image_to_base64(image_path):
    """Convert image file to base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')
    
for image in image_files:
    base64_image=image_to_base64(image)
    # Create the message payload
    messages = [
        {
            "role": "system",
            "content": "You are an expert OCR assistant. Extract ALL text exactly as shown, preserving all formatting, symbols and spacing. Mark unclear parts as [UNREADABLE]."
        },
        {
            "role": "user",
            "content": [{
          "type": "text",
          "text": "Can you describe this image?",
        }, {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          },
        }, ],
        }
    ]
    completion = client.chat.completions.create(
        model="google/gemma-3-27b-it",
        messages=messages,
        temperature=0.1,
        max_tokens=16384,
        top_p=0.7,
    )

    print(completion.choices[0].message)

***tried to do it from here but they wouldn't allow it. I had to use the huggingface playground.***

## Method 4: Qwen2.5-VL-72B-Instruct

In [None]:
%pip install --upgrade huggingface_hub

In [6]:
from huggingface_hub import InferenceClient
import base64
import os
image_files = [
    os.path.join('task1c_reportsTemplates', f)
    for f in sorted(os.listdir('task1c_reportsTemplates'))
    if f.endswith('.png') or f.endswith('.jpeg')
]
# Initialize client
client = InferenceClient(
    provider="hyperbolic",
    api_key=os.environ.get("HF_PLAYGROUND")
)

def image_to_base64(image_path):
    """Convert image file to base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

for image in image_files:
    base64_image=image_to_base64(image)
    # Create the message payload
    messages = [
        {
            "role": "system",
            "content": "You are an expert OCR assistant. Extract ALL text exactly as shown, preserving all formatting, symbols and spacing. Mark unclear parts as [UNREADABLE]."
        },
        {
            "role": "user",
            "content": [{
          "type": "text",
          "text": "Can you describe this image?",
        }, {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          },
        }, ],
        }
    ]

    # Make the request
    completion = client.chat.completions.create(
        model="Qwen/Qwen2.5-VL-72B-Instruct",
        messages=messages,
        temperature=0.1,
        max_tokens=16384,
        top_p=0.7
    )
    print(completion.choices[0].message.content)

Copie électronique
LABORATOIRE DE BIOLOGIE MÉDICALE
  
N° FINESS : 3432
 -  -  ☎:  📧: 
  - Biologiste(s) Médical(aux)

Docteur    
CABINET MEDICAL " "
 
Copie à : Docteur    , DR 
X Demande n° 01/02/ -LABO--TP
Patient né(e)   le 
FSE Tiers payant  - 
Prélèvements effectués par le laboratoire le 01/02/21 à 10H27

Madame   

  (100)
Edité le, lundi 1 février 2021
Copie à : Docteur    , DR


Vos résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable
1) Communiquez votre mail au laboratoire 2) Recevez un email dès que vos résultats sont disponibles 3) Cliquez sur le lien

INFORMATION COVID-19
Rendez-vous sur notre site internet dédié pour connaître notre organisation : https:// .fr/depistage-covid-19/

Hématologie
Hémogramme
(Sang total - Variation d'impédance, photométrie, cytométrie en flux)  - 
Hématies ................................................ 4,94 Tera/L 3,80 à 5,90  4,97
Hémoglobine ........................................ 13,6 g/dL 11,5 à 17,5 13,8

Good spacing but not enough. Llama maverick 4 is more reliable on output quality. The advantage is that it has coordinates and Llama Maverick 4 doesn't, but still, it has less in depth line reading.

In [10]:
for item in response["Blocks"]:
    if "BoundingBox" in item.get("Geometry", {}):
        box = item["Geometry"]["BoundingBox"]
        text = item.get("Text", "")
        print(f"Text: '{text}' | Left: {box['Left']}, Top: {box['Top']}, Width: {box['Width']}, Height: {box['Height']}")

Text: '' | Left: 0.0, Top: 9.537294317851774e-06, Width: 1.0, Height: 0.9999904632568359
Text: 'Copie électronique' | Left: 0.7135121822357178, Top: 0.007279777433723211, Width: 0.13882844150066376, Height: 0.011959744617342949
Text: 'LABORATOIRE DE BIOLOGIE MÉDICALE' | Left: 0.5724204778671265, Top: 0.028493167832493782, Width: 0.3976910710334778, Height: 0.01365593634545803
Text: ' ' | Left: 0.10864158719778061, Top: 0.041259318590164185, Width: 0.22936062514781952, Height: 0.02035103552043438
Text: '  ' | Left: 0.21428395807743073, Top: 0.0891493484377861, Width: 0.19549520313739777, Height: 0.011410328559577465
Text: 'N° FINESS 3432' | Left: 0.2138015329837799, Top: 0.10613875836133957, Width: 0.15449538826942444, Height: 0.0073406146839261055
Text: '    8 ' | Left: 0.21407967805862427, Top: 0.11628034710884094, Width: 0.5963878631591797, Height: 0.008507906459271908
Text: '  Biologiste(s) Médical(aux)' | Left: 0.2132289558649063, Top: 0.12795408070087433, Width: 0.2884592711925506

## Method 5: Mistral OCR

> Does not keep track of coordinates, it needs a separate call to bounding box annotations.


In [11]:
%pip install mistralai

Note: you may need to restart the kernel to use updated packages.


In [12]:
import base64
import os
from mistralai import Mistral
from dotenv import load_dotenv
load_dotenv()
def encode_image(image_path):
    """Encode the image to base64."""
    try:
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')
    except FileNotFoundError:
        print(f"Error: The file {image_path} was not found.")
        return None
    except Exception as e:  # Added general exception handling
        print(f"Error: {e}")
        return None

image_files = [
    os.path.join('task1c_reportsTemplates', f)
    for f in sorted(os.listdir('task1c_reportsTemplates'))
    if f.endswith('.png') or f.endswith('.jpeg')
]
full_response = []
for image_path in image_files:
    # Getting the base64 string
    base64_image = encode_image(image_path)

    api_key = os.environ.get("MISTRAL_API_KEY")
    client = Mistral(api_key=api_key)

    ocr_response = client.ocr.process(
        model="mistral-ocr-latest",
        document={
            "type": "image_url",
            "image_url": f"data:image/jpeg;base64,{base64_image}" 
        },
        include_image_base64=True
    )
    full_response.append(ocr_response)


In [13]:
def clean_ocr_text(ocr_responses):
    """Extracts and cleans text from OCR responses, removing base64 and markdown formatting."""
    clean_texts = []
    
    for response in ocr_responses:
        for page in response.pages:
            # Remove markdown formatting symbols
            clean_page = page.markdown.replace('$\\checkmark$', '✓')  # Replace checkmarks
            clean_page = clean_page.replace('\\', '')  # Remove LaTeX escapes
            clean_page = clean_page.replace('$', '')  # Remove math symbols
            clean_page = clean_page.replace('*', '')  # Remove markdown emphasis
            clean_page = clean_page.replace('#', '')  # Remove markdown headers
            
            # Remove image tags and base64 strings
            clean_page = '\n'.join(
                line for line in clean_page.split('\n') 
                if not line.startswith('![') and 'base64' not in line
            )
            
            clean_texts.append(clean_page)
    
    return '\n\n'.join(clean_texts)

# Example usage:
clean_content = clean_ocr_text(full_response)
print(clean_content)

# Or to print with document/page structure:
for i, resp in enumerate(full_response, 1):
    print(f"\n--- Document {i} ---")
    for page in resp.pages:
        clean_page = page.markdown.replace('$\\checkmark$', '✓')
        clean_page = '\n'.join(
            line for line in clean_page.split('\n') 
            if not line.startswith('![') and 'base64' not in line
        )
        print(f"\n--- Page {page.index + 1} ---\n")
        print(clean_page)

  

 LABORATOIRE DE BIOLOGIE MÉDICALE

   

Nº FINESS :   -  -  :  -    - Biologiste(s) Médical(aux)

 Docteur    

 CABINET MEDICAL " "

  

Copie à : Docteur    , DR 

X Demande n° 01/02/ -LABO--TP

Patient né(e)   le 

FSE Tiers payant  - 

Prélèvements effectués par le laboratoire le 01/02/21 à 10H27

 Madame   

 

   (100)

Edité le, lundi 1 février 2021 Copie à : Docteur    , DR 

 Vos résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable

1) Communiquez votre mail au laboratoire 2) Recevez un email dès que vos résultats sont disponibles 3) Cliquez sur le lien

 INFORMATION COVID-19

Rendez-vous sur notre site internet dédié pour connaître notre organisation : https:// .fr/depistage-covid-19/

 Hématologie

|   | Valeurs de référence | Antériorités  |
| --- | --- | --- |
|  Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)  -  |  |   |
|  Hématies | 4,94 Téra/L | 3,80 à 5,90  |
|  Hémoglobine | 13,6 g/dL | 11,5 à 17,5  

not good enough output, markdown formatting doesn't always work well sadly. Stick with maverick given that none of the two can detect coordinates. 


# Bounding boxes



## Pytesseract bounding boxes

In [1]:
%pip install pytesseract
%pip install pytesseract opencv-python pillow


Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
import cv2
import pytesseract
image = cv2.imread('task1c_reportsTemplates/pdf2image_denoised_page_1.png')

# Run Tesseract OCR on the image
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

# Draw bounding boxes around the detected text
for i, word in enumerate(data['text']):
    if word.strip():
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)

# Display the image with bounding boxes
cv2.imshow('Image with Bounding Boxes', image)
cv2.waitKey(0)
cv2.destroyAllWindows()


## Method 6: ChatGPT 4.1 Vision 


In [2]:
%pip install openai
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
import base64
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "task1c_reportsTemplates/pdf2image_denoised_page_1.png"

# Getting the Base64 string
base64_image = encode_image(image_path)


response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": (
                        "You are an OCR assistant. "
                        "Given the image, extract all visible text and, for each text segment, provide its bounding box coordinates "
                        "in the format [x_min, y_min, x_max, y_max], where coordinates refer to the top-left and bottom-right corners of the text region in the image. "
                        "Present the results as a JSON array, where each entry contains: "
                        "- \"text\": the recognized string, "
                        "- \"bbox\": the bounding box as a list of four integers. "
                        "Only output the JSON array, with no extra commentary. "
                        "Example: "
                        "["
                        "  {\"text\": \"Sample Title\", \"bbox\": [120, 35, 480, 80]},"
                        "  {\"text\": \"Date: 2025-06-18\", \"bbox\": [50, 120, 250, 150]}"
                        "]"
                    ),
                },
                {
                    "type": "input_image",
                    "image_url": f"data:image/png;base64,{base64_image}",
                },
            ],
        }
    ],
)
print(response.output_text)

[
  {"text": " ", "bbox": [59, 43, 178, 78]},
  {"text": "Copie électronique", "bbox": [596, 24, 726, 41]},
  {"text": "LABORATOIRE DE BIOLOGIE MÉDICALE", "bbox": [445, 49, 744, 71]},
  {"text": "  ", "bbox": [59, 91, 226, 107]},
  {"text": "N° FINESS : ", "bbox": [59, 110, 241, 126]},
  {"text": " -  - ", "bbox": [59, 128, 397, 143]},
  {"text": "  - Biologiste(s) Médical(aux)", "bbox": [58, 145, 365, 161]},
  {"text": "Docteur    ", "bbox": [59, 174, 274, 191]},
  {"text": "CABINET MEDICAL \" \"", "bbox": [59, 192, 330, 209]},
  {"text": " ", "bbox": [59, 212, 198, 225]},
  {"text": "Copie à : Docteur    , DR ", "bbox": [59, 228, 456, 241]},
  {"text": "X Demande n° 01/02/ -LABO--TP", "bbox": [59, 244, 388, 264]},
  {"text": "Madame   ", "bbox": [435, 174, 699, 191]},
  {"text": "", "bbox": [435, 193, 627, 210]},
  {"text": "  (100)", "bbox": [436, 214, 572, 229]},
  {"text": "Edité le, lundi 1 février 2021", "bbox": [436, 233, 627, 247]},
  {"text": "Copie à : Docteur    , DR ", "bb

| Model   | Input | Cached Input | Output | Blended Pricing* |
|---------|-------|--------------|--------|------------------|
| gpt-4.1 | $2.00 | $0.50        | $8.00  | $1.84            |

*Prices are per 1M tokens


In [8]:
data = response.output_text
list=data.split(",")
clear_list=[]
substring="bbox"
import re
import base64
import json
import numpy as np
import cv2
def extract_number(text):
    match = re.search(r'-?\d+', text)
    return int(match.group()) if match else -1
A4_WIDTH = 2480   # 210mm at 300 DPI
A4_HEIGHT = 3508  # 297mm at 300 DPI
# Create a blank white image with A4 size
blank_image = 255 * np.ones((A4_HEIGHT, A4_WIDTH, 3), dtype=np.uint8)
image = blank_image.copy()

# Extract text segments from the JSON array in response.output_text

try:
    ocr_results = json.loads(response.output_text)
except Exception as e:
    print("Error parsing JSON from response.output_text:", e)
    ocr_results = []

# Draw bounding boxes and text
for entry in ocr_results:
    if "bbox" in entry and "text" in entry:
        x1, y1, x2, y2 = entry["bbox"]
        cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
        # Put text above the box (or inside if not enough space)
        text = entry["text"]
        font = cv2.FONT_HERSHEY_SIMPLEX
        font_scale = 0.7
        thickness = 2
        text_size, _ = cv2.getTextSize(text, font, font_scale, thickness)
        text_x = x1
        text_y = y1 - 10 if y1 - 10 > 10 else y1 + text_size[1] + 5
        cv2.putText(image, text, (text_x, text_y), font, font_scale, (0, 0, 255), thickness, cv2.LINE_AA)
for i in range (0,len(list)):
    if substring in list[i]:    
        clear_list.append(substring)

        clear_list.append( extract_number(list[i]))
        clear_list.append( extract_number(list[i+1]))

        clear_list.append( extract_number(list[i+2]))

        clear_list.append( extract_number(list[i+3]))


# Now draw rectangles using clear_list (which contains bbox coordinates as strings)
# clear_list format: ['bbox', x1, y1, x2, y2, 'bbox', x1, y1, x2, y2, ...]
for i in range(0, len(clear_list), 5):
    try:
        x1 = int(clear_list[i+1])
        y1 = int(clear_list[i+2])
        x2 = int(clear_list[i+3])
        y2 = int(clear_list[i+4])
        cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
    except Exception as e:
        print(f"Error drawing rectangle at index {i}: {e}")

# Resize the image to a tenth of its original size

cv2.imshow('Annotated Image', image)
cv2.waitKey(0)
cv2.destroyAllWindows()


no good way to show it, but the coordinates are actualy quite accurate. It separates the boxes too much and captivates only words. Not good enough, also very expensive.  

## Method 7: Azure AI Document Intelligence. 

In [None]:
%pip install azure-ai-documentintelligence

In [16]:
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import DocumentAnalysisFeature, AnalyzeResult
from dotenv import load_dotenv
import os
def _in_span(word, spans):
    for span in spans:
        if word.span.offset >= span.offset and (word.span.offset + word.span.length) <= (span.offset + span.length):
            return True
    return False

def _format_bounding_region(bounding_regions):
    if not bounding_regions:
        return "N/A"
    return ", ".join(
        f"Page #{region.page_number}: {_format_polygon(region.polygon)}" for region in bounding_regions
    )

def _format_polygon(polygon):
    if not polygon:
        return "N/A"
    return ", ".join([f"[{polygon[i]}, {polygon[i + 1]}]" for i in range(0, len(polygon), 2)])
load_dotenv()
endpoint = os.environ.get("AZURE_ENDPOINT_KEY")
key = os.environ.get("AZURE_API_KEY")

path_to_sample_documents= "highlighted_output.pdf"

document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open(path_to_sample_documents, "rb") as f:
    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout",
        body=f,
        features=[DocumentAnalysisFeature.KEY_VALUE_PAIRS],
    )
result: AnalyzeResult = poller.result()

if result.styles:
    for style in result.styles:
        if style.is_handwritten:
            print("Document contains handwritten content: ")
            print(",".join([result.content[span.offset : span.offset + span.length] for span in style.spans]))

print("----Key-value pairs found in document----")
if result.key_value_pairs:
    for kv_pair in result.key_value_pairs:
        if kv_pair.key:
            print(
                f"Key '{kv_pair.key.content}' found within "
                f"'{_format_bounding_region(kv_pair.key.bounding_regions)}' bounding regions"
            )
        if kv_pair.value:
            print(
                f"Value '{kv_pair.value.content}' found within "
                f"'{_format_bounding_region(kv_pair.value.bounding_regions)}' bounding regions\n"
            )

for page in result.pages:
    print(f"----Analyzing document from page #{page.page_number}----")
    print(f"Page has width: {page.width} and height: {page.height}, measured with unit: {page.unit}")

    if page.lines:
        for line_idx, line in enumerate(page.lines):
            words = []
            if page.words:
                for word in page.words:
                    print(f"......Word '{word.content}' has a confidence of {word.confidence}")
                    if _in_span(word, line.spans):
                        words.append(word)
            print(
                f"...Line #{line_idx} has {len(words)} words and text '{line.content}' within "
                f"bounding polygon '{_format_polygon(line.polygon)}'"
            )

    if page.selection_marks:
        for selection_mark in page.selection_marks:
            print(
                f"Selection mark is '{selection_mark.state}' within bounding polygon "
                f"'{_format_polygon(selection_mark.polygon)}' and has a confidence of "
                f"{selection_mark.confidence}"
            )

if result.tables:
    for table_idx, table in enumerate(result.tables):
        print(f"Table # {table_idx} has {table.row_count} rows and {table.column_count} columns")
        if table.bounding_regions:
            for region in table.bounding_regions:
                print(
                    f"Table # {table_idx} location on page: {region.page_number} is {_format_polygon(region.polygon)}"
                )
        for cell in table.cells:
            print(f"...Cell[{cell.row_index}][{cell.column_index}] has text '{cell.content}'")
            if cell.bounding_regions:
                for region in cell.bounding_regions:
                    print(
                        f"...content on page {region.page_number} is within bounding polygon '{_format_polygon(region.polygon)}'\n"
                    )
print("----------------------------------------")

Document contains handwritten content: 
G
----Key-value pairs found in document----
Key 'Nº FINESS :' found within 'Page #1: [1.7606, 1.2299], [2.3457, 1.2307], [2.3456, 1.3416], [1.7605, 1.3409]' bounding regions
Value '
du  de
-  - ' found within 'Page #1: [2.0499, 1.2314], [4.3839, 1.2277], [4.3843, 1.477], [2.0503, 1.4807]' bounding regions

Key ':' found within 'Page #1: [4.4827, 1.3485], [4.6342, 1.3488], [4.6339, 1.4777], [4.4824, 1.4773]' bounding regions
Value '' found within 'Page #1: [4.6924, 1.3485], [5.5574, 1.347], [5.5576, 1.477], [4.6926, 1.4784]' bounding regions

Key ':' found within 'Page #1: [5.6497, 1.3468], [5.7822, 1.347], [5.7821, 1.4762], [5.6496, 1.476]' bounding regions
Value '' found within 'Page #1: [5.8386, 1.3472], [6.7232, 1.3488], [6.723, 1.4788], [5.8384, 1.4772]' bounding regions

Key 'Copie à :' found within 'Page #1: [0.3893, 2.7645], [0.821, 2.7616], [0.8218, 2.8848], [0.3902, 2.8877]' bounding regions
Value 'Docteur    , DR ' found within 'Page #1

| Feature                          | Pricing Tier         | Price (per 1,000 pages) | Notes                                                        |
|-----------------------------------|----------------------|-------------------------|--------------------------------------------------------------|
| Prebuilt Layout (Single/Batch)    | All Prebuilt Models  | €8.848                 | Includes lines, words, tables, key-value pairs, selection marks, etc. |

Good coordinate tracking. An option to consider, documentation for the setup was great too. Good table recognition but lags on the lines and divides cells when it sometimes shouldn't. 

## Method 8: Google Cloud Document AI

In [None]:
%pip install google-cloud-documentai

First, we need to create a processor. 

In [3]:

from google.api_core.client_options import ClientOptions
from google.cloud import documentai  # type: ignore

# TODO(developer): Uncomment these variables before running the sample.
project_id = 'grounded-block-463311-d3'
location = 'eu' # Format is 'us' or 'eu'
processor_display_name = 'My Processor' # Must be unique per project, e.g.: 'My Processor'
processor_type = 'OCR_PROCESSOR' # Use fetch_processor_types to get available processor types


def create_processor_sample(
    project_id: str, location: str, processor_display_name: str, processor_type: str
) -> None:
    # You must set the api_endpoint if you use a location other than 'us'.
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the location
    # e.g.: projects/project_id/locations/location
    parent = client.common_location_path(project_id, location)

    # Create a processor
    processor = client.create_processor(
        parent=parent,
        processor=documentai.Processor(
            display_name=processor_display_name, type_=processor_type
        ),
    )

    # Print the processor information
    print(f"Processor Name: {processor.name}")
    print(f"Processor Display Name: {processor.display_name}")
    print(f"Processor Type: {processor.type_}")

create_processor_sample(project_id,location,processor_display_name,processor_type)

Processor Name: projects/1009810802339/locations/eu/processors/1ebe6ab85734cb87
Processor Display Name: My Processor
Processor Type: OCR_PROCESSOR


We add the data from the project cloud and launch the request. It operates at surprising speed. Also need to add credentials, see the [guides](https://cloud.google.com/vision/product-search/docs/auth)

In [15]:
from typing import Optional, Sequence
from google.api_core.client_options import ClientOptions
from google.cloud import documentai

def main():
    # Replace these with your actual values
    project_id = 'grounded-block-463311-d3'
    location = "eu"  # Format is "us" or "eu"
    processor_id = "1ebe6ab85734cb87"  # Create processor in Cloud Console
    processor_version = "rc"  # Or your specific version
    file_name = "highlighted_output.pdf"  # Document in same folder as script
    mime_type = "application/pdf"  # Or appropriate type for your file

    process_document_ocr_sample(
        project_id=project_id,
        location=location,
        processor_id=processor_id,
        processor_version=processor_version,
        file_path=file_name,
        mime_type=mime_type,
    )

def process_document_ocr_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> None:
    # Optional: Additional configurations for Document OCR Processor.
    process_options = documentai.ProcessOptions(
        ocr_config=documentai.OcrConfig(
            enable_native_pdf_parsing=True,
            enable_image_quality_scores=True,
            enable_symbol=True,
            premium_features=documentai.OcrConfig.PremiumFeatures(
                compute_style_info=True,
                enable_math_ocr=False,
                enable_selection_mark_detection=True,
            ),
        )
    )
    
    document = process_document(
        project_id,
        location,
        processor_id,
        processor_version,
        file_path,
        mime_type,
        process_options=process_options,
    )

    text = document.text
    print(f"Full document text: {text}\n")
    print(f"There are {len(document.pages)} page(s) in this document.\n")

    for page in document.pages:
        print(f"Page {page.page_number}:")
        print_page_dimensions(page.dimension)
        print_detected_languages(page.detected_languages)

        print_blocks(page.blocks, text)
        print_paragraphs(page.paragraphs, text)
        print_lines(page.lines, text)
        print_tokens(page.tokens, text)

        if page.symbols:
            print_symbols(page.symbols, text)

        if page.image_quality_scores:
            print_image_quality_scores(page.image_quality_scores)

        if page.visual_elements:
            print_visual_elements(page.visual_elements, text)

def print_bounding_box(layout: documentai.Document.Page.Layout, element_name: str = "Element") -> None:
    """Print bounding box information for a layout element"""
    if layout.bounding_poly:
        vertices = layout.bounding_poly.vertices
        normalized_vertices = layout.bounding_poly.normalized_vertices
        
        print(f"        {element_name} Bounding Box:")
        print(f"          Absolute coordinates:")
        for i, vertex in enumerate(vertices):
            print(f"            Vertex {i+1}: ({vertex.x}, {vertex.y})")
        
        if normalized_vertices:
            print(f"          Normalized coordinates (0-1):")
            for i, vertex in enumerate(normalized_vertices):
                print(f"            Vertex {i+1}: ({vertex.x:.3f}, {vertex.y:.3f})")

def print_page_dimensions(dimension: documentai.Document.Page.Dimension) -> None:
    print(f"    Width: {str(dimension.width)}")
    print(f"    Height: {str(dimension.height)}")

def print_detected_languages(
    detected_languages: Sequence[documentai.Document.Page.DetectedLanguage],
) -> None:
    print("    Detected languages:")
    for lang in detected_languages:
        print(f"        {lang.language_code} ({lang.confidence:.1%} confidence)")

def print_blocks(blocks: Sequence[documentai.Document.Page.Block], text: str) -> None:
    print(f"    {len(blocks)} blocks detected:")
    first_block_text = layout_to_text(blocks[0].layout, text)
    print(f"        First text block: {repr(first_block_text)}")
    print_bounding_box(blocks[0].layout, "First Block")
    
    last_block_text = layout_to_text(blocks[-1].layout, text)
    print(f"        Last text block: {repr(last_block_text)}")
    print_bounding_box(blocks[-1].layout, "Last Block")

def print_paragraphs(
    paragraphs: Sequence[documentai.Document.Page.Paragraph], text: str
) -> None:
    print(f"    {len(paragraphs)} paragraphs detected:")
    first_paragraph_text = layout_to_text(paragraphs[0].layout, text)
    print(f"        First paragraph text: {repr(first_paragraph_text)}")
    print_bounding_box(paragraphs[0].layout, "First Paragraph")
    
    last_paragraph_text = layout_to_text(paragraphs[-1].layout, text)
    print(f"        Last paragraph text: {repr(last_paragraph_text)}")
    print_bounding_box(paragraphs[-1].layout, "Last Paragraph")

def print_lines(lines: Sequence[documentai.Document.Page.Line], text: str) -> None:
    print(f"    {len(lines)} lines detected:")
    first_line_text = layout_to_text(lines[0].layout, text)
    print(f"        First line text: {repr(first_line_text)}")
    print_bounding_box(lines[0].layout, "First Line")
    
    last_line_text = layout_to_text(lines[-1].layout, text)
    print(f"        Last line text: {repr(last_line_text)}")
    print_bounding_box(lines[-1].layout, "Last Line")

def print_tokens(tokens: Sequence[documentai.Document.Page.Token], text: str) -> None:
    print(f"    {len(tokens)} tokens detected:")
    first_token_text = layout_to_text(tokens[0].layout, text)
    first_token_break_type = tokens[0].detected_break.type_.name
    print(f"        First token text: {repr(first_token_text)}")
    print(f"        First token break type: {repr(first_token_break_type)}")
    print_bounding_box(tokens[0].layout, "First Token")
    
    if tokens[0].style_info:
        print_style_info(tokens[0].style_info)

    last_token_text = layout_to_text(tokens[-1].layout, text)
    last_token_break_type = tokens[-1].detected_break.type_.name
    print(f"        Last token text: {repr(last_token_text)}")
    print(f"        Last token break type: {repr(last_token_break_type)}")
    print_bounding_box(tokens[-1].layout, "Last Token")
    
    if tokens[-1].style_info:
        print_style_info(tokens[-1].style_info)

def print_symbols(
    symbols: Sequence[documentai.Document.Page.Symbol], text: str
) -> None:
    print(f"    {len(symbols)} symbols detected:")
    first_symbol_text = layout_to_text(symbols[0].layout, text)
    print(f"        First symbol text: {repr(first_symbol_text)}")
    print_bounding_box(symbols[0].layout, "First Symbol")
    
    last_symbol_text = layout_to_text(symbols[-1].layout, text)
    print(f"        Last symbol text: {repr(last_symbol_text)}")
    print_bounding_box(symbols[-1].layout, "Last Symbol")

def print_image_quality_scores(
    image_quality_scores: documentai.Document.Page.ImageQualityScores,
) -> None:
    print(f"    Quality score: {image_quality_scores.quality_score:.1%}")
    print("    Detected defects:")
    for detected_defect in image_quality_scores.detected_defects:
        print(f"        {detected_defect.type_}: {detected_defect.confidence:.1%}")

def print_style_info(style_info: documentai.Document.Page.Token.StyleInfo) -> None:
    print(f"           Font Size: {style_info.font_size}pt")
    print(f"           Font Type: {style_info.font_type}")
    print(f"           Bold: {style_info.bold}")
    print(f"           Italic: {style_info.italic}")
    print(f"           Underlined: {style_info.underlined}")
    print(f"           Handwritten: {style_info.handwritten}")
    print(
        f"           Text Color (RGBa): {style_info.text_color.red}, {style_info.text_color.green}, {style_info.text_color.blue}, {style_info.text_color.alpha}"
    )

def print_visual_elements(
    visual_elements: Sequence[documentai.Document.Page.VisualElement], text: str
) -> None:
    checkboxes = [x for x in visual_elements if "checkbox" in x.type]
    math_symbols = [x for x in visual_elements if x.type == "math_formula"]

    if checkboxes:
        print(f"    {len(checkboxes)} checkboxes detected:")
        print(f"        First checkbox: {repr(checkboxes[0].type)}")
        print_bounding_box(checkboxes[0].layout, "First Checkbox")
        print(f"        Last checkbox: {repr(checkboxes[-1].type)}")
        print_bounding_box(checkboxes[-1].layout, "Last Checkbox")

    if math_symbols:
        print(f"    {len(math_symbols)} math symbols detected:")
        first_math_symbol_text = layout_to_text(math_symbols[0].layout, text)
        print(f"        First math symbol: {repr(first_math_symbol_text)}")
        print_bounding_box(math_symbols[0].layout, "First Math Symbol")

def print_all_elements_with_boxes(page: documentai.Document.Page, text: str) -> None:
    """Print ALL elements with their bounding boxes - use this for detailed analysis"""
    print(f"\n=== DETAILED BOUNDING BOX ANALYSIS FOR PAGE {page.page_number} ===")
    
    # All tokens with bounding boxes
    print(f"\nALL TOKENS ({len(page.tokens)}):")
    for i, token in enumerate(page.tokens):
        token_text = layout_to_text(token.layout, text)
        print(f"  Token {i+1}: {repr(token_text)}")
        print_bounding_box(token.layout, f"Token {i+1}")
    
    # All lines with bounding boxes
    print(f"\nALL LINES ({len(page.lines)}):")
    for i, line in enumerate(page.lines):
        line_text = layout_to_text(line.layout, text)
        print(f"  Line {i+1}: {repr(line_text)}")
        print_bounding_box(line.layout, f"Line {i+1}")

def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    with open(file_path, "rb") as image:
        image_content = image.read()

    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        process_options=process_options,
    )

    result = client.process_document(request=request)
    return result.document

def layout_to_text(layout: documentai.Document.Page.Layout, text: str) -> str:
    return "".join(
        text[int(segment.start_index) : int(segment.end_index)]
        for segment in layout.text_anchor.text_segments
    )

if __name__ == "__main__":
    main()

Full document text: Copie électronique
  
N° FINESS : 
 -  -  (:  7: 
  - Biologiste(s) Médical(aux)
Docteur    
CABINET MEDICAL " "
 
Copie à : Docteur    , DR 
X Demande n° 01/02/ -LABO--TP
Patient né(e)   le 
FSE Tiers payant  - 
Prélèvements effectués par le laboratoire le 01/02/21 à 10H27
Madame   

  (100)
Edité le, lundi 1 février 2021
Copie à : Docteur    , DR

Vos résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable
1) Communiquez votre mail au laboratoire 2) Recevez un email dès que vos résultats sont disponibles 3) Cliquez sur le lien
INFORMATION COVID-19
Rendez-vous sur notre site internet dédié pour connaître notre organisation : https:// .fr/depistage-covid-19/
Hématologie
✔ Hémogramme
Valeurs de référence
Antériorités
(Sang total - Variation d'impédance, photométrie, cytométrie en flux)  - 
Hématies ........................................
Hémoglobine ....................................
Hématocrite ......................................
V.G

| Processor                        | 1 - 5,000,000 pages/month | 5,000,001+ pages/month         |
|-----------------------------------|--------------------------|-------------------------------|
| Enterprise Document OCR Processor | $1.50 per 1,000 pages    | $0.60 per 1,000 pages         |

Good coordinates and cheap model. can also detect images in png in many other formats. Great model, but it fails when it comes to reading straight, it will vertically split documents when not really necessary. Again, really get in the way of the text extraction, they get all messed up.

## Method 9: Google Cloud Vision

In [None]:
%pip install --upgrade google-cloud-vision
%pip show google-cloud-vision

Creating a service account key. You need to install google's CLI, authenticate the account, turn off telemetry(OPTIONAL) and then add a billing account cause it won't work otherwise. Be sure to check the installation [guides](https://cloud.google.com/vision/product-search/docs/auth)

In [6]:
from google.cloud import vision

def detect_document(path):
    client = vision.ImageAnnotatorClient()

    with open(path, "rb") as image_file:
        content = image_file.read()

    image = vision.Image(content=content)
    response = client.document_text_detection(image=image)
    print(response)
    if response.error.message:
        raise Exception(
            "{}\nFor more info on error messages, check: "
            "https://cloud.google.com/apis/design/errors".format(response.error.message)
        )

    # Print the full detected text
    print(response.full_text_annotation.text)

    for page in response.full_text_annotation.pages:
        # Process the page as needed
        pass

# Example usage
detect_document("task1c_reportsTemplates/pdf2image_denoised_page_1.png")



text_annotations {
  locale: "fr"
  description: "S\n \n  \nN\302\260 FINESS: \nCopie \303\251lectronique\nLABORATOIRE DE BIOLOGIE M\303\211DICALE\n du G\303\251n\303\251ral de  - -  : :\n  - Biologiste(s) M\303\251dical(aux)\nDocteur    \nCABINET MEDICAL \" \"\n \nCopie \303\240 Docteur    , DR \nX Demande n\302\272 01/02/ -LABO--TP\nPatient n\303\251(e)   le \nFSE Tiers payant  - \nPr\303\251l\303\250vements effectu\303\251s par le laboratoire le 01/02/21 \303\240 10H27\nMadame   \n\n  (100)\nEdit\303\251 le, lundi 1 f\303\251vrier 2021\nCopie \303\240 Docteur    , DR\n\nVos r\303\251sultats sur internet: Acc\303\250s s\303\251curis\303\251, rapide, gratuit, pratique, \303\251coresponsable\n1) Communiquez votre mail au laboratoire 2) Recevez un email d\303\250s que vos r\303\251sultats sont disponibles 3) Cliquez sur le lien\nINFORMATION COVID-19\nRendez-vous sur notre site internet d\303\251di\303\251 pour conna\303\256tre notre organisation: https:// .fr/depistage-covid-19/\nH\303\

Extremely quick, sadly, even if the boxes seem correct, they are not really that good. The OCR is unorderly performed on words and can turn into serious issues. The tables aren't well read.  

| Feature                  | First 1000 units/month | Units 1001 - 5,000,000 / month | Units 5,000,001 and higher / month |
|--------------------------|-----------------------|-------------------------------|-------------------------------------|
| Text Detection           | Free                  | $1.50                         | $0.60                               |
| Document Text Detection  | Free                  | $1.50                         | $0.60                               |

## Method 10: AWS Textract

In [None]:
%pip install boto3

You wil need to get their CLI app and have the credentials in the file: ~/.aws 
Here you have the guide to do so: [guides](https://docs.aws.amazon.com/textract/latest/dg/security-iam.html)

Not an easy setup, you would expect more and better documentation to use their product. 

In [3]:
import boto3
from dotenv import load_dotenv
load_dotenv()
import os
# Document
documentName = "task1c_reportsTemplates/testPic.png"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())
# Amazon Textract client
session = boto3.Session(
    region_name='eu-south-2',
    aws_access_key_id=os.environ.get("AWS_ACCESS_KEY"),
    aws_secret_access_key=os.environ.get("AWS_SECRET_KEY"),
    )

textract=session.client('textract')
# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print (item["Text"])

Medications
Current Prescriptions
Medication
Name
Dose
Route
Frequency
Start Date
Prescriber
Indication
Saline Nasal
2
Intranasal
Every 4
2025-04-28
Dr. 
Nasal moisture
Spray (Ocean)
sprays/
hours PRN
Lane, MD
replacement
nostril
Fluticasone
50 mcg/
Intranasal
Once daily
2025-04-28
Dr. 
Suspected
Propionate
spray
Lane, MD
allergic rhinitis
Nasal Spray
Cetirizine
10 mg
Oral
Once daily
2024-11-12
Dr. 
Allergic rhinitis
(Zyrtec)
Kent, MD
Over-the-Counter (Reported)
Medication Name
Frequency
Last Use
Purpose
Vitamin C (500 mg)
Daily
2025-04-29
Immune support
Vaseline (petroleum jelly)
Applied PRN
2025-04-29
Applied around nostrils to prevent cracking
Discontinued Medications (Last 6 Months)
Medication Name
Dose
Reason for Discontinuation
Date Discontinued
Diphenhydramine (Benadryl)
25 mg at bedtime
Caused drowsiness & dry mucosa
2025-02-01


In [3]:
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        box = item["Geometry"]["BoundingBox"]
        print(f'Text: {item["Text"]}')
        print(f'BoundingBox - Left: {box["Left"]}, Top: {box["Top"]}, Width: {box["Width"]}, Height: {box["Height"]}')


Text: Copie électronique
BoundingBox - Left: 0.7135121822357178, Top: 0.007279777433723211, Width: 0.13882844150066376, Height: 0.011959744617342949
Text: LABORATOIRE DE BIOLOGIE MÉDICALE
BoundingBox - Left: 0.5724204778671265, Top: 0.028493167832493782, Width: 0.3976910710334778, Height: 0.01365593634545803
Text:  
BoundingBox - Left: 0.10864158719778061, Top: 0.041259318590164185, Width: 0.22936062514781952, Height: 0.02035103552043438
Text:   
BoundingBox - Left: 0.21428395807743073, Top: 0.0891493484377861, Width: 0.19549520313739777, Height: 0.011410328559577465
Text: N° FINESS 3432
BoundingBox - Left: 0.2138015329837799, Top: 0.10613875836133957, Width: 0.15449538826942444, Height: 0.0073406146839261055
Text:     8 
BoundingBox - Left: 0.21407967805862427, Top: 0.11628034710884094, Width: 0.5963878631591797, Height: 0.008507906459271908
Text:   Biologiste(s) Médical(aux)
BoundingBox - Left: 0.2132289558649063, Top: 0.12795408070087433, Width: 0.28845927119255066, Height: 0.010040

Cannot read pdf input unless it is in a bucket on their own cloud. Cheap and reliable option. Cannot read pdf's outside their cloud which is quite bad and set up was terrible. Lags with double lined table cells. 


| Amazon Textract API| First million pages in a month	|Over 1 million pages in a month|
|-----------------------|-------------------------------|-------------------------------|
|	Per 1,000 Pages|	$1.50|	$0.60|

## Method 11: Google Vertex AI test

In [14]:
%pip install --upgrade google-genai

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from google import genai
from google.genai.types import (
    GenerateContentConfig,
    HarmBlockThreshold,
    HarmCategory,
    HttpOptions,
    Part,
    SafetySetting,
)

api_key = os.environ.get("GCV_API_KEY")
client = genai.Client(http_options=HttpOptions(api_version="v1"), api_key=api_key)

# Path to your local PDF
local_pdf_path = "highlighted_output.pdf"  # Update if needed

# Read the PDF file as bytes
with open(local_pdf_path, "rb") as pdf_file:
    pdf_bytes = pdf_file.read()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        Part.from_bytes(
            data=pdf_bytes,
            mime_type="application/pdf",
        ),
        "Extract and output the positions of objects and the text in them for each page.",
    ],
    # config=config,  # Uncomment and define if you want custom response config
)

print(response.text)
import requests
from google import genai
from google.genai.types import (
    GenerateContentConfig,
    HarmBlockThreshold,
    HarmCategory,
    HttpOptions,
    Part,
    SafetySetting,
)
from PIL import Image, ImageColor, ImageDraw
from pydantic import BaseModel
import io
from pathlib import Path
import os

api_key=os.environ.get("GCV_API_KEY")
client = genai.Client(http_options=HttpOptions(api_version="v1"),api_key=api_key)

# Path to your local image
local_image_path = "task1c_reportsTemplates/testPic.png"  # Update this with your actual image path

# Read the image file as bytes
with open(local_image_path, "rb") as image_file:
    image_bytes = image_file.read()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        Part.from_bytes(
            data=image_bytes,
            mime_type="image/png",  # Update this if your image is PNG, etc.
        ),
        "Output the positions of objects in the image and the text in them.",
    ],
#    config=config,
)

print(response.text)

```json
[
  {
    "page": 1,
    "elements": [
      {
        "box_2d": [46, 568, 56, 972],
        "text_content": "LABORATOIRE DE BIOLOGIE MÉDICALE"
      },
      {
        "box_2d": [89, 213, 103, 412],
        "text_content": "  "
      },
      {
        "box_2d": [104, 213, 114, 396],
        "text_content": "N° FINESS: 343  2"
      },
      {
        "box_2d": [116, 213, 127, 810],
        "text_content": "  -  :85 15 :53 15 44"
      },
      {
        "box_2d": [128, 213, 139, 532],
        "text_content": "  - Biologiste(s) Médical (aux)"
      },
      {
        "box_2d": [170, 47, 183, 277],
        "text_content": "Docteur    "
      },
      {
        "box_2d": [178, 544, 192, 832],
        "text_content": "Madame   "
      },
      {
        "box_2d": [198, 48, 211, 461],
        "text_content": "CABINET MEDICAL \" \""
      },
      {
        "box_2d": [199, 549, 213, 790],
        "text_content": ""
      },
      {
        "box_2d": [221, 69, 234, 192],
        "te


| Type | Price per 1000 pages(USD)  |
| :-- | :-- |
| Input (text, image, video) | \$0.30 |
| Text output | \$2.50 |
|total| $2.80 per 1000 pages. 


Reads tables properly and has a good coordinate system.It reads double lined cells and represents the returns with \n. A very good option for the medical since it is fairly easy to remove and have good OCR. The only problem is I did notice the order wasn't always good enough so finding values may become a hassle. 

## OCR Price Comparison

| Method / Model                        | Speed (seconds per page) | Total Seconds | Notes / Comments                |
|----------------------------------------|--------------------------|--------------|---------------------------------|
| meta-llama/llama-4-scout-17b-16e      | 2,42                       | 14,5         | LPU                              |
| meta-llama/llama-4-maverick-17b-128e  | 18,60                      |  111,6        | LPU                             |
| Qwen2.5-VL-72B-Instruct               | 48,33                         |  290         |                               |
| Mistral OCR                           |  4,78                        |    28,7          |                             |
| ChatGPT 4.1 Vision                    |   61,4                       |   61,4        |    performed on only one png  |
| Azure AI Document Intelligence         | 6,32                         |  31,6            |  on pdf, high detail      |
| Google Cloud Document AI               |  1,08                         |  5,4            |    on pdf                 |
| Google Cloud Vision                    |  0,9                        |  0,9            |  performed on one png       |
| AWS Textract                          |   2,1                       |   2,1           |   performed on one png      |
| Google Vertex AI                      |   25,7                       |  25,7            |   performed on one png     |