# Multimodal Parsing using GPT4o-mini

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/gpt4o_mini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cookbook shows you how to use LlamaParse to parse any document with the multimodal capabilities of GPT4o-mini.

LlamaParse allows you to plug in external, multimodal model vendors for parsing - we handle the error correction, validation, and scalability/reliability for you.


## Setup

Download the data - the blog post from Meta on Llama3.1, in PDF form.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
!wget "https://www.dropbox.com/scl/fi/8iu23epvv3473im5rq19g/llama3.1_blog.pdf?rlkey=5u417tbdox4aip33fdubvni56&st=dzozd11e&dl=1" -O "data/llama3.1_blog.pdf"

![llama_blog_img](llama3.1-p5.png)

## Initialize LlamaParse

Initialize LlamaParse in multimodal mode, and specify the vendor.

**NOTE**: optionally you can specify the OpenAI API key. If you do so you will be charged our base LlamaParse price of 0.3c per page. If you don't then you will be charged 1.5c per page, as we will make the calls to gpt4o-mini for you and give you price predictability.

In [None]:
from llama_index.core.schema import TextNode
from typing import List
import json


def get_text_nodes(json_list: List[dict]):
    text_nodes = []
    for idx, page in enumerate(json_list):
        text_node = TextNode(text=page["md"], metadata={"page": page["page"]})
        text_nodes.append(text_node)
    return text_nodes


def save_jsonl(data_list, filename):
    """Save a list of dictionaries as JSON Lines."""
    with open(filename, "w") as file:
        for item in data_list:
            json.dump(item, file)
            file.write("\n")


def load_jsonl(filename):
    """Load a list of dictionaries from JSON Lines."""
    data_list = []
    with open(filename, "r") as file:
        for line in file:
            data_list.append(json.loads(line))
    return data_list

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt-4o-mini",
    invalidate_cache=True,
)
json_objs = parser.get_json_result("./data/llama3.1_blog.pdf")
json_list = json_objs[0]["pages"]
docs = get_text_nodes(json_list)

Started parsing the file under job_id bf3e7341-bb11-42d4-a5f7-bb5260ad792c


In [None]:
# Optional: Save
save_jsonl([d.dict() for d in docs], "docs.jsonl")

In [None]:
# Optional: Load
from llama_index.core import Document

docs_dicts = load_jsonl("docs.jsonl")
docs = [Document.parse_obj(d) for d in docs_dicts]

### Setup GPT-4o baseline

For comparison, we will also parse the document using GPT-4o (3c per page).

In [None]:
from llama_parse import LlamaParse

parser_gpt4o = LlamaParse(
    result_type="markdown",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model="openai-gpt4o",
    # invalidate_cache=True
)
json_objs_gpt4o = parser_gpt4o.get_json_result("./data/llama3.1_blog.pdf")
# json_objs_gpt4o = parser.get_json_result("./data/llama2-p33.pdf")
json_list_gpt4o = json_objs_gpt4o[0]["pages"]
docs_gpt4o = get_text_nodes(json_list_gpt4o)

Started parsing the file under job_id 391ff280-08e5-4143-85f2-90ada287e26c


In [None]:
# Optional: Save
save_jsonl([d.dict() for d in docs_gpt4o], "docs_gpt4o.jsonl")

In [None]:
# Optional: Load
from llama_index.core import Document

docs_gpt4o_dicts = load_jsonl("docs_gpt4o.jsonl")
docs_gpt4o = [Document.parse_obj(d) for d in docs_gpt4o_dicts]

## View Results

Let's visualize the results between GPT-4o-mini and GPT-4o along with the original document page.

We see that 

**NOTE**: If you're using llama2-p33, just use `docs[0]`

In [None]:
# using GPT4o-mini
print(docs[4].get_content(metadata_mode="all"))

page: 5

# Llama 3.1 Model Evaluation

## Category Benchmark

| Benchmark                     | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mistral 8x228B Instruct | GPT 3.5 Turbo |
|-------------------------------|----------------|----------------------|----------------|-------------------------|----------------|
| General                       |                |                      |                |                         |                |
| MMLU (0-shot, CoT)           | 73.0           | 72.3                 | 86.0           | 79.9                    | 69.8           |
| MMLU PRO (5-shot, CoT)       | 48.3           | 36.9                 | 66.4           | 56.3                    | 49.2           |
| IFEval                        | 80.4           | 73.6                 | 87.5           | 72.7                    | 69.9           |
| Code                          |                |                      |                |                         |                |
| Huma

In [None]:
# using GPT-4o
print(docs_gpt4o[4].get_content(metadata_mode="all"))

page: 5

# Introducing Llama 3.1: Our most capable models to date

## Meta

| Category | Benchmark | Llama 3.1 8B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mixtral 8x22B Instruct | GPT 3.5 Turbo |
|----------|-----------|--------------|---------------|---------------------|---------------|-----------------------|---------------|
| General  | MMLU (0-shot, CoT) | 73.0 | 72.3 (0-shot, non-CoT) | 60.5 | 86.0 | 79.9 | 69.8 |
|          | MMLU PRO (5-shot, CoT) | 48.3 | 71.7 | 36.9 | 66.4 | 56.3 | 49.2 |
|          | ITEval | 80.4 | 73.6 | 57.6 | 87.5 | 72.7 | 69.9 |
| Code     | HumanEval (0-shot) | 72.6 | 54.3 | 40.2 | 80.5 | 75.6 | 68.0 |
|          | MBPP EvalPlus (5-shot) (0-shot) | 72.8 | 71.7 | 49.5 | 86.0 | 78.6 | 82.0 |
| Math     | GSM8K | 84.5 | 76.7 | 53.2 | 95.1 | 88.2 | 81.6 |
|          | MATH (0-shot, CoT) | 51.9 | 44.3 | 13.0 | 68.0 | 54.1 | 43.1 |
| Reasoning | ARC Challenge (0-shot) | 83.4 | 87.6 | 74.2 | 94.8 | 88.7 | 83.7 |
|          | GOPA (0-shot) | 32.

## Setup RAG Pipeline

Let's setup a RAG pipeline over this data.

(we also use gpt4o-mini for the actual text synthesis step).

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")

In [None]:
# from llama_index.core import SummaryIndex
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI

index = VectorStoreIndex(docs)
query_engine = index.as_query_engine(similarity_top_k=5)

index_gpt4o = VectorStoreIndex(docs_gpt4o)
query_engine_gpt4o = index_gpt4o.as_query_engine(similarity_top_k=5)

In [None]:
query = "How does Llama3.1 compare against gpt-4o and Claude 3.5 Sonnet in human evals?"

response = query_engine.query(query)
response_gpt4o = query_engine_gpt4o.query(query)

In [None]:
print(response)

In human evaluations, Llama 3.1 405B has a win rate of 19.1% against GPT-4o and 24.9% against Claude 3.5 Sonnet. The tie rates for Llama 3.1 405B are 51.7% against GPT-4o and 50.8% against Claude 3.5 Sonnet, while the loss rates are 29.2% against GPT-4o and 24.2% against Claude 3.5 Sonnet. This indicates that Llama 3.1 performs competitively in comparison to both models, with a notable number of ties.


In [None]:
print(response.source_nodes[1].get_content())

# Llama 3.1 Model Evaluation

## Category Benchmark

| Benchmark                     | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mistral 8x228B Instruct | GPT 3.5 Turbo |
|-------------------------------|----------------|----------------------|----------------|-------------------------|----------------|
| General                       |                |                      |                |                         |                |
| MMLU (0-shot, CoT)           | 73.0           | 72.3                 | 86.0           | 79.9                    | 69.8           |
| MMLU PRO (5-shot, CoT)       | 48.3           | 36.9                 | 66.4           | 56.3                    | 49.2           |
| IFEval                        | 80.4           | 73.6                 | 87.5           | 72.7                    | 69.9           |
| Code                          |                |                      |                |                         |                |
| HumanEval (0-

In [None]:
print(response_gpt4o)

In human evaluations, Llama 3.1 405B shows competitive performance against GPT-4o and Claude 3.5 Sonnet. Specifically, when compared to GPT-4o, Llama 3.1 won 19.1% of the time, tied 51.7%, and lost 29.2%. Against Claude 3.5 Sonnet, it won 24.9% of the time, tied 50.8%, and lost 24.2%. This indicates that Llama 3.1 performs comparably in real-world scenarios against these leading models.


In [None]:
print(response_gpt4o.source_nodes[1].get_content())

# Introducing Llama 3.1: Our most capable models to date

## Meta

| Category | Benchmark | Llama 3.1 8B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mixtral 8x22B Instruct | GPT 3.5 Turbo |
|----------|-----------|--------------|---------------|---------------------|---------------|-----------------------|---------------|
| General  | MMLU (0-shot, CoT) | 73.0 | 72.3 (0-shot, non-CoT) | 60.5 | 86.0 | 79.9 | 69.8 |
|          | MMLU PRO (5-shot, CoT) | 48.3 | 71.7 | 36.9 | 66.4 | 56.3 | 49.2 |
|          | ITEval | 80.4 | 73.6 | 57.6 | 87.5 | 72.7 | 69.9 |
| Code     | HumanEval (0-shot) | 72.6 | 54.3 | 40.2 | 80.5 | 75.6 | 68.0 |
|          | MBPP EvalPlus (5-shot) (0-shot) | 72.8 | 71.7 | 49.5 | 86.0 | 78.6 | 82.0 |
| Math     | GSM8K | 84.5 | 76.7 | 53.2 | 95.1 | 88.2 | 81.6 |
|          | MATH (0-shot, CoT) | 51.9 | 44.3 | 13.0 | 68.0 | 54.1 | 43.1 |
| Reasoning | ARC Challenge (0-shot) | 83.4 | 87.6 | 74.2 | 94.8 | 88.7 | 83.7 |
|          | GOPA (0-shot) | 32.8 | 40.8 