In [1]:
import base64
import httpx
import json
import pymupdf
import os

from pathlib import Path
from typing import Dict, Any, Optional, List
from dotenv import load_dotenv
load_dotenv()

True

# Combining `mistral-ocr` and `mistral-small-2503` for advanced document analysis 

This notebook showcases a  example of combining two Mistral AI models for advanced document analysis:

- `mistralai-ocr-2503` to extract text and images content from a document,
- `mistral-small-2503` to process the extracted text and image content.

You will use the [Pixtral 12B](https://arxiv.org/pdf/2410.07073) technical report as example of a raw input document. The entire workflow will be implemented using Azure AI Foundry deployments.

> **Important**: The OCR API endpoint does not have Internet access so you will have to ensure that the input documents are readily available from your client machine - or download them beforehand.

## 0. Gathering metadata

Fill in the following variables with your own endpoint URL and API keys:

In [2]:
AZURE_MISTRAL_OCR_ENDPOINT = os.getenv("AZURE_MISTRAL_OCR_ENDPOINT", "")
AZURE_MISTRAL_OCR_API_KEY = os.getenv("AZURE_MISTRAL_OCR_API_KEY", "")
AZURE_MISTRAL_SMALL_ENDPOINT = os.getenv("AZURE_MISTRAL_SMALL_ENDPOINT", "")
AZURE_MISTRAL_SMALL_API_KEY = os.getenv("AZURE_MISTRAL_SMALL_API_KEY", "")

Next, put the (local) path to the document to analyze here:

In [3]:
# Issue with jpeg images in the Mistral OCR model (use png instead)
INPUT_DOCUMENT_PATH = os.path.join("docs", os.getenv("INPUT_DOCUMENT", "receipt.png"))

The following cell contains default values for the system messages used by `mistral-small-2503` for each post-OCR task to execute:

In [4]:
SUMMARIZATION_SYSTEM_MESSAGE = """
Your mission is to summarize the following text in a short and concise way.
Always answer in a well-formatted JSON object containing a single string item called 'summary' 
"""

DESC_FIG_SYSTEM_MESSAGE = """
Your mission is to provide a brief and informative description of each image you will be shown.
Always answer in a well-formatted JSON object containing:
- type: a string describing the type of figure you see (plot, picture, diagram, etc.)
- description: the information you can derive from the figure
"""

## 1. Creating helper functions

Before getting started, we need to create a few building blocks that will make the code more modular:

- The function `_encode_document_pages_to_base64` takes the file path of a PDF document as input and returns a base64-encoded string representation of that document.

- The function `_call_ocr_model` sends an API call to the `mistral-ocr-2503` model endpoint and returns a parsed version of the document with the extracted text and images.

- The function `_call_vlm_model` sends an API call to the `mistral-small-2503` model endpoint and returns a JSON-formatted response. 

In [5]:
def _encode_document_pages_to_base64(pdf_doc_path: str) -> List[str]:
    encoded_pages: List[str] = []
    doc = pymupdf.open(pdf_doc_path)
    for page in doc:
        page_bytes = page.get_pixmap().tobytes("jpeg")
        page_b64_encoded = base64.b64encode(page_bytes).decode("utf-8")
        encoded_pages.append(page_b64_encoded)
    return encoded_pages


def _encode_document_to_base64(document_path: str) -> str:
    with Path(document_path).open(mode="rb") as f_in:
        doc_encoded = base64.b64encode(f_in.read()).decode("utf-8")
        return doc_encoded


def _call_ocr_model(
    endpoint: str, api_key: str, base64_input_data: str
) -> Dict[str, Any]:
    endpoint_url = f"{endpoint}"
    headers = {
        "Content-Type": "application/json",
        "Accept": "application/json",
        "Authorization": f"Bearer {api_key}",
    }
    payload = {
        "model": os.getenv("AZURE_MISTRAL_OCR_MODEL", "mistral-ocr-2503-eus"),
        "document": {"type": "document_url", "document_url": base64_input_data},
        "include_image_base64": True,
    }
    with httpx.Client() as client:
        ocr_resp = client.post(
            url=endpoint_url, headers=headers, json=payload, timeout=60.0
        )
        ocr_resp.raise_for_status()
        return ocr_resp.json()


def _call_vlm_model(
    endpoint: str,
    api_key: str,
    user_message: Dict[str, Any],
    system_message: Dict[str, str],
) -> Dict[str, Any]:
    url = f"{endpoint}"
    headers = {
        "Content-Type": "application/json",
        "Accept": "application/json",
        "Authorization": f"Bearer {api_key}",
    }
    payload = {
        "model": os.getenv("AZURE_MISTRAL_SMALL_MODEL", "mistral-small-2503-glo"),
        "messages": [system_message, user_message],
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
    with httpx.Client() as client:
        resp = client.post(url=url, headers=headers, json=payload, timeout=60.0)
        resp.raise_for_status()
        return resp.json()

## 2. Creating the `Document` class

The `Document` class is designed to manage and process documents, particularly PDFs. It initializes with an optional source file path and provides three primary methods: 

- `parse`, which encodes the document to base64 and uses the `mistral-ocr-2503` model to extract text and images,
- `summarize`, which uses `mistral-small-2503` to summarize the document's content,
- `describe_figures`, which identifies and describes images within the parsed document using `mistral-small-2503`.

In [6]:
class Document:
    def __init__(self, source_file: str | None = None):
        self.source_file: str | None = source_file
        self.parsed_doc: str | None = None

    def parse(self):
        encoded_doc = _encode_document_to_base64(document_path=self.source_file)
        self.parsed_doc = _call_ocr_model(
            endpoint=AZURE_MISTRAL_OCR_ENDPOINT,
            api_key=AZURE_MISTRAL_OCR_API_KEY,
            base64_input_data=f"data:application/pdf;base64,{encoded_doc}",
        )

    def summarize(self) -> Dict[str, Any]:
        system_message = {"role": "system", "content": SUMMARIZATION_SYSTEM_MESSAGE}
        user_message_content: List[Dict[str, Any]] = []
        pages = self.parsed_doc["pages"]
        for page in pages:
            user_message_content.append({"type": "text", "text": page["markdown"]})
        user_message = {"role": "user", "content": user_message_content}
        vlm_resp = _call_vlm_model(
            endpoint=AZURE_MISTRAL_SMALL_ENDPOINT,
            api_key=AZURE_MISTRAL_SMALL_API_KEY,
            system_message=system_message,
            user_message=user_message,
        )
        return json.loads(vlm_resp["choices"][0]["message"]["content"])

    def describe_figures(self, pages: Optional[List[int]] = None) -> Dict[str, Any]:
        system_message = {"role": "system", "content": DESC_FIG_SYSTEM_MESSAGE}
        figures: List[Dict[str, Any]] = []
        for idx, page in enumerate(self.parsed_doc["pages"]):
            for img in page["images"]:
                user_message = {
                    "role": "user",
                    "content": [
                        {"type": "image_url", "image_url": {"url": img["image_base64"]}}
                    ],
                }
                vlm_resp = _call_vlm_model(
                    endpoint=AZURE_MISTRAL_SMALL_ENDPOINT,
                    api_key=AZURE_MISTRAL_SMALL_API_KEY,
                    system_message=system_message,
                    user_message=user_message,
                )
                desc_dict = json.loads(vlm_resp["choices"][0]["message"]["content"])
                fig_desc = {"page": idx, "desc": desc_dict}
                figures.append(fig_desc)
        return figures

## 3. Using the `Document` class

You can now analyze your documents with the `Document` class! Start by creating an instance and pass the path to the document you wish to study:

In [7]:
doc = Document(INPUT_DOCUMENT_PATH)
doc.parse()

Now that the document has been parsed into text and image blocks by the OCR model, you can run downstream processing on these blocks. For example, you can summarize the document's text:

In [8]:
doc.summarize()

{'summary': 'The text describes various chart patterns used in technical analysis, including double tops, double bottoms, flags, pennants, wedges, and triangles, with both bullish and bearish variations. Each pattern is accompanied by an image for visual reference.'}

You can also leverage the multimodal abilities of `mistral-small-2503` to annotate all the images and figures that are present in the document:

In [9]:
doc.describe_figures()

[{'page': 0,
  'desc': {'type': 'plot',
   'description': "A line plot with a single data series. The plot shows a sharp increase followed by a leveling off. There is a horizontal blue line indicating a specific value on the y-axis. The x-axis is labeled with the word 'terms'. The plot appears to be part of a larger image or document, with a green banner at the top left corner."}},
 {'page': 0,
  'desc': {'type': 'diagram',
   'description': 'This image depicts a simple line diagram representing a breakout pattern. The black lines form a wedge pattern, indicating a period of consolidation or convergence. A horizontal blue line extends from the right side of the wedge, representing a resistance level. A green arrow points upward from the blue line, suggesting a breakout above the resistance level and indicating a potential upward price movement or trend.'}},
 {'page': 0,
  'desc': {'type': 'diagram',
   'description': 'This diagram features a zigzag line (black) that is bounded by two p

## 4. Wrapping up

You now have a working example of how to perform advanced analysis on the text and images extracted from a PDF document in Azure AI!