# Multimodal Parsing using GPT4o-mini

<a href="https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/gpt4o_mini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cookbook shows you how to use LlamaParse to parse any document with the multimodal capabilities of GPT4o-mini.

LlamaParse allows you to plug in external, multimodal model vendors for parsing - we handle the error correction, validation, and scalability/reliability for you.

Status:
| Last Executed | Version | State      |
|---------------|---------|------------|
| Aug-19-2025   | 0.6.61  | Maintained |

## Setup

Download the data - the blog post from Meta on Llama3.1, in PDF form.

In [None]:
!wget "https://www.dropbox.com/scl/fi/8iu23epvv3473im5rq19g/llama3.1_blog.pdf?rlkey=5u417tbdox4aip33fdubvni56&st=dzozd11e&dl=1" -O "data/llama3.1_blog.pdf"

--2025-08-20 09:01:29--  https://www.dropbox.com/scl/fi/8iu23epvv3473im5rq19g/llama3.1_blog.pdf?rlkey=5u417tbdox4aip33fdubvni56&st=dzozd11e&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6016:18::a27d:112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc29796f0b776076192093df7b2d.dl.dropboxusercontent.com/cd/0/inline/CvxiobAxsMsABs0DEDrx1mQ4P4l3JsmP2sR43DDeERGKF46mpTn7IFVWd4tKNsnH5ktPFJS_XYJG7jzY4B_-hCc9sXoVRVL74CYo95FjlLfLroFwdAtq-f00E7BrSfVABBwjXltHN2LtIXuyNWsRg0_t/file?dl=1# [following]
--2025-08-20 09:01:29--  https://uc29796f0b776076192093df7b2d.dl.dropboxusercontent.com/cd/0/inline/CvxiobAxsMsABs0DEDrx1mQ4P4l3JsmP2sR43DDeERGKF46mpTn7IFVWd4tKNsnH5ktPFJS_XYJG7jzY4B_-hCc9sXoVRVL74CYo95FjlLfLroFwdAtq-f00E7BrSfVABBwjXltHN2LtIXuyNWsRg0_t/file?dl=1
Resolving uc29796f0b776076192093df7b2d.dl.dropboxusercontent.com (uc29796f0b776076192093df7b2d.dl.dropboxusercont

![llama_blog_img](llama3.1-p5.png)

## Initialize LlamaParse

Initialize LlamaParse in multimodal mode, and specify the vendor.

In [None]:
from llama_cloud_services import LlamaParse

parser = LlamaParse(
    parse_mode="parse_page_with_lvm",
    vendor_multimodal_model_name="openai-gpt-4o-mini",
    # vendor_multimodal_api_key="fake",
    high_res_ocr=True,
    adaptive_long_table=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
    api_key="llx-...",
)

result = await parser.aparse("./data/llama3.1_blog.pdf")

Started parsing the file under job_id 5c002568-5fcb-4741-abb2-6cfe598646c1


## View Results

Let's visualize the results with gpt-4o-mini along with the original document page.

In [None]:
documents = result.get_markdown_documents(split_by_page=True)

In [None]:
print(documents[4].get_content(metadata_mode="all"))

page_number: 5
file_name: ./data/llama3.1_blog.pdf

  
Introducing Llama 3.1: Our most capable models to date  
  

# Category Benchmark

| Benchmark                     | Llama 3.1 8B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mixtral 8x228 Instruct | GPT 3.5 Turbo |
|-------------------------------|---------------|----------------|---------------------|----------------|------------------------|----------------|
| General                       |               |                |                     |                |                        |                |
| MMLU (0-shot, non-CoT)       | 73.0          | 72.3           | 60.5                | 86.0           | 79.9                   | 69.8           |
| MMLU PRO (5-shot, CoT)       | 48.3          | 36.9           | 36.9                | 66.4           | 56.3                   | 49.2           |
| IFEval                        | 80.4          | 73.6           | 57.6                | 87.5           | 72.7                  

## Setup RAG Pipeline

Let's setup a RAG pipeline over this data.

(we also use gpt-5-mini for the actual text synthesis step).

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-5-mini", api_key="sk-...")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large", api_key="sk-...")

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)

In [None]:
query = "How does Llama3.1 compare against gpt-4o and Claude 3.5 Sonnet in human evals?"

response = query_engine.query(query)

In [None]:
print(response)

Reported human-evaluation results for Llama 3.1 (405B):

- vs GPT-4-0125-Preview: Win 23.3%, Tie 52.2%, Loss 24.5%  
- vs GPT-4: Win 19.1%, Tie 51.7%, Loss 29.2%  
- vs Claude 3.5 Sonnet: Win 24.9%, Tie 50.8%, Loss 24.2%

There are no separate head-to-head human-eval numbers published specifically for GPT‑4o in the reported results.


In [None]:
print(response.source_nodes[0].text)

Introducing Llama 3.1: Our most capable models to date  
  

# Category Benchmark

| Benchmark                     | Llama 3.1 8B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mixtral 8x228 Instruct | GPT 3.5 Turbo |
|-------------------------------|---------------|----------------|---------------------|----------------|------------------------|----------------|
| General                       |               |                |                     |                |                        |                |
| MMLU (0-shot, non-CoT)       | 73.0          | 72.3           | 60.5                | 86.0           | 79.9                   | 69.8           |
| MMLU PRO (5-shot, CoT)       | 48.3          | 36.9           | 36.9                | 66.4           | 56.3                   | 49.2           |
| IFEval                        | 80.4          | 73.6           | 57.6                | 87.5           | 72.7                   | 69.9           |
| Code                          |  