In [None]:
import base64
import json
import requests
from IPython.display import Markdown, display

# Mistral Document AI Basics

Mistral Document AI offers enterprise-level document processing, combining cutting-edge OCR technology with advanced structured data extraction. This notebook showcases a few examples of basic OCR extraction for text and images.

We will be using the `mistral-document-ai-2505` model with a few documents and images to show the capabilities of the model.

> **Note**: The Document AI endpoint on Azure Foundry cannot process sources from external URLSs, instead we show you how to encode documents and images and call the API.

## 0. Setup

In [None]:
AZURE_MISTRAL_DOCUMENT_AI_ENDPOINT = ""
AZURE_MISTRAL_DOCUMENT_AI_KEY = ""
REQUEST_HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {AZURE_MISTRAL_DOCUMENT_AI_KEY}",
}

Get PDFs for samples

In [None]:
!wget https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/mistral7b.pdf

## 1. Helper Functions

In [None]:
def encode_image(image_path: str) -> str:
    try:
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    except FileNotFoundError:
        print(f"Error: The file {image_path} was not found.")
        return None


def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
    for img_name, base64_str in images_dict.items():
        markdown_str = markdown_str.replace(
            f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
        )
    return markdown_str


def simple_combined_markdown(responsePage: dict) -> str:
    markdowns: list[str] = []
    image_data = {}
    for img in responsePage["images"]:
        image_data[img["id"]] = img["image_base64"]
    markdowns.append(replace_images_in_markdown(responsePage["markdown"], image_data))

    return "\n\n".join(markdowns)

## 2. Basic OCR

In this example, we show how to extract text from a PDF document using Mistral Document AI. In addition to PDFs we support .docx, and .pptx file types, as well as many common image formats.

In [None]:
encodedDocument = encode_image("mistral7b.pdf")

Next we construct the JSON for the request.

In [None]:
documentPayload = {
    "model": "mistral-document-ai-2505",
    "document": {
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{encodedDocument}",
    },
}

Construct the request and parse the response

In [None]:
documentResponse = requests.post(
    url=AZURE_MISTRAL_DOCUMENT_AI_ENDPOINT,
    json=documentPayload,
    headers=REQUEST_HEADERS,
)

You will notice that for every page, the API returns text data in markdown format, along with information about detected images.

In [None]:
print(json.dumps(documentResponse.json(), indent=4))

In [None]:
print(documentResponse.json()["pages"][0]["markdown"])

In [None]:
display(Markdown(documentResponse.json()["pages"][0]["markdown"]))

If you are interested in just the text, this works fine. If you need the images along with the text we will need to return the images in the response, and then combine with the markdown text.

## 3. OCR with Images

In the previous section we saw how easy it is to get the text from a document. If you want to get the images along with the text, all you have to do is set the `include_image_base64` parameter in our request and handle the returned base64-encoded image in the response.

In [None]:
documentPayloadandImages = {
    "model": "mistral-document-ai-2505",
    "document": {
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{encodedDocument}",
    },
    "include_image_base64": "true",
}

In [None]:
docWithImagesResponse = requests.post(
    url=AZURE_MISTRAL_DOCUMENT_AI_ENDPOINT,
    json=documentPayloadandImages,
    headers=REQUEST_HEADERS,
)

In [None]:
display(Markdown(simple_combined_markdown(docWithImagesResponse.json()["pages"][0])))

And viola! We have the combined text and images. This is useful for converting documents into markdown format, and being able to programmatically extract the text and images

# 4. Tabular Example

Next we look at a document with tabular data, for this example we are using Microsoft's 8-K filing located here: https://www.microsoft.com/en-us/investor/sec-filings 

In [None]:
ms8kDocument = encode_image("0000950170-25-100226.pdf")

In [None]:
msRequestPayload = {
    "model": "mistral-document-ai-2505",
    "document": {
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{ms8kDocument}",
    },
}

In [None]:
ms8kResponse = requests.post(
    url=AZURE_MISTRAL_DOCUMENT_AI_ENDPOINT,
    json=msRequestPayload,
    headers=REQUEST_HEADERS,
)

Selecting a page with tables

In [None]:
display(Markdown(simple_combined_markdown(ms8kResponse.json()["pages"][5])))

As observed, the extracted text in the table is accurate and true to the original tabular representation.

# 4. Wrap-up

Documents and images are a wealth of information, but that information is only useful if it can be extracted accurately. Mistral Document AI on Azure is a powerful tool that can help you extract text and images from documents accurately, with ease. We hope you found this notebook helpful and we look forward to seeing what you build with it!