# Multimodal Parsing using Anthropic Claude (Sonnet 4.0)

<a href="https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/claude_parse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cookbook shows you how to use LlamaParse to parse any document with the multimodal capabilities of Sonnet 4.0. 

LlamaParse allows you to plug in external, multimodal model vendors for parsing - we handle the error correction, validation, and scalability/reliability for you.

Status:
| Last Executed | Version | State      |
|---------------|---------|------------|
| Aug-19-2025   | 0.6.61  | Maintained |


In [None]:
%pip install llama-cloud-services "llama-index>=0.13.0<0.14.0" "llama-index-llms-anthropic>=0.8.4<0.9.0"

## Setup

Download the data. Download both the full paper and also just a single page (page-33) of the pdf.

Swap in `data/llama2-p33.pdf` for `data/llama2.pdf` in the code blocks below if you want to save on parsing tokens. 

An image of this page is shown below.

In [None]:
!mkdir -p data
!wget "https://arxiv.org/pdf/2307.09288" -O data/llama2.pdf
!wget "https://www.dropbox.com/scl/fi/wpql661uu98vf6e2of2i0/llama2-p33.pdf?rlkey=64weubzkwpmf73y58vbmc8pyi&st=khgx5161&dl=1" -O data/llama2-p33.pdf

![page_33](llama2-p33.png)

## Initialize LlamaParse

Initialize LlamaParse in multimodal mode, and specify the vendor.

**NOTE**: optionally you can specify the Anthropic API key. If you do so you will be charged less, since we will make the calls to Claude for you.

In [None]:
from llama_cloud_services import LlamaParse

parser = LlamaParse(
    parse_mode="parse_page_with_lvm",
    vendor_multimodal_model_name="anthropic-sonnet-4.0",
    # vendor_multimodal_api_key="fake",
    high_res_ocr=True,
    adaptive_long_table=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
    api_key="llx-...",
)

result = await parser.aparse("./data/llama2.pdf")
documents = result.get_markdown_documents(split_by_page=True)

### Setup gpt-4o-mini baseline

For comparison, we will also parse the document using gpt-4o-mini.

In [None]:
from llama_cloud_services import LlamaParse

parser = LlamaParse(
    parse_mode="parse_page_with_lvm",
    vendor_multimodal_model_name="openai-gpt-4o-mini",
    # vendor_multimodal_api_key="fake",
    high_res_ocr=True,
    adaptive_long_table=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
    api_key="llx-...",
)

result = await parser.aparse("./data/llama2.pdf")
gpt_4o_documents = result.get_markdown_documents(split_by_page=True)

## View Results

Let's visualize the results along with the original document page.

We see that Sonnet is able to extract complex visual elements like graphs in way more detail! 

**NOTE**: If you're using llama2-p33, just use `docs[0]`

In [None]:
# using Sonnet-4.0
print(documents[32].text)



**Figure 21: RLHF learns to adapt the temperature with regard to the type of prompt.** Lower Self-BLEU corresponds to more diversity: RLHF eliminates diversity in responses to factual prompts but retains more diversity when generating responses to creative prompts. We prompt each model with a diverse set of 10 creative and 10 factual instructions and sample 25 responses. This is repeated for the temperatures T ∈ {k/10 | k ∈ N : 1 ≤ k ≤ 15}. For each of the 25 responses we compute the Self-BLEU metric and report the mean and standard deviation against the temperature.

<table>
<thead>
<tr>
<th>Temperature</th>
<th>Factual Prompts - RLHF v3</th>
<th>Factual Prompts - RLHF v2</th>
<th>Factual Prompts - RLHF v1</th>
<th>Factual Prompts - SFT</th>
<th>Creative Prompts - RLHF v3</th>
<th>Creative Prompts - RLHF v2</th>
<th>Creative Prompts - RLHF v1</th>
<th>Creative Prompts - SFT</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.4</td>
<td>99</td>
<td>98</td>
<td>97</td>
<td>95</td>
<td>95</td>
<td>

In [None]:
# using gpt-4o-mini
print(gpt_4o_documents[32].text)



# Figure 21: RLHF learns to adapt the temperature with regard to the type of prompt. 
Lower Self-BLEU corresponds to more diversity: RLHF eliminates diversity in responses to factual prompts but retains more diversity when generating responses to creative prompts. We prompt each model with a diverse set of 10 creative and 10 factual instructions and sample 25 responses. This is repeated for the temperatures \( T \in \{k/10 | k \in \mathbb{N}: 1 \leq k \leq 15\} \). For each of the 25 responses we compute the Self-BLEU metric and report the mean and standard deviation against the temperature.

| Temperature | RLHF v3 | RLHF v2 | RLHF v1 | SFT |
|-------------|---------|---------|---------|-----|
| 0.0         | 95      | 90      | 85      | 80  |
| 0.6         | 90      | 85      | 80      | 75  |
| 0.8         | 85      | 80      | 75      | 70  |
| 1.0         | 80      | 75      | 70      | 65  |
| 1.2         | 75      | 70      | 65      | 60  |
| 1.4         | 70      | 65      

## Setup RAG Pipeline

These parsing capabilities translate to great RAG performance as well. Let's setup a RAG pipeline over this data.

(we'll use GPT-4o from OpenAI for the actual text synthesis step).

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-5-mini", api_key="sk-...")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large", api_key="sk-...")

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(documents)
query_engine = index.as_query_engine(similarity_top_k=5)

index_gpt4o = VectorStoreIndex(gpt_4o_documents)
query_engine_gpt4o = index_gpt4o.as_query_engine(similarity_top_k=5)

In [None]:
query = "Tell me more about all the values for each line in the 'RLHF learns to adapt the temperature with regard to the type of prompt' graph "

response = query_engine.query(query)
response_gpt4o = query_engine_gpt4o.query(query)

In [None]:
print(response)

Each line in that graph corresponds to the highest-scoring (reward_max) generation obtained when sampling with a particular softmax temperature. The plotted temperature values are:

- T = 0.6
- T = 0.8
- T = 0.9
- T = 1.0
- T = 1.1
- T = 1.2
- T = 1.3
- T = 1.4
- T = 1.5

What each line represents and how to interpret it
- Metric shown: reward_max — the top reward-model score among the set of sampled outputs for a given prompt and temperature.  
- Sampling regime: multiple outputs are sampled per prompt at each temperature and scored; the best-scoring sample defines the plotted point for that temperature.  
- Purpose: the lines show how the best attainable reward changes as sampling temperature varies.

Behavior by prompt type (what the lines reveal)
- Creative prompts (e.g., “Write a poem”): higher temperatures keep producing diverse outputs, and the curves for higher-T lines reflect that diversity remains usable — reward_max continues to benefit from sampling diversity. This is visib

In [None]:
print(response_gpt4o)

The chart reports mean Self-BLEU scores (lower = more diversity) at several temperatures for four models: RLHF v3, RLHF v2, RLHF v1, and the SFT model. The numeric values shown for each model at the listed temperatures are:

- Temperature 0.0
  - RLHF v3: 95
  - RLHF v2: 90
  - RLHF v1: 85
  - SFT:      80

- Temperature 0.6
  - RLHF v3: 90
  - RLHF v2: 85
  - RLHF v1: 80
  - SFT:      75

- Temperature 0.8
  - RLHF v3: 85
  - RLHF v2: 80
  - RLHF v1: 75
  - SFT:      70

- Temperature 1.0
  - RLHF v3: 80
  - RLHF v2: 75
  - RLHF v1: 70
  - SFT:      65

- Temperature 1.2
  - RLHF v3: 75
  - RLHF v2: 70
  - RLHF v1: 65
  - SFT:      60

- Temperature 1.4
  - RLHF v3: 70
  - RLHF v2: 65
  - RLHF v1: 60
  - SFT:      55

Experimental setup (how these numbers were produced): each model was prompted with 10 creative and 10 factual instructions; for each prompt 25 responses were sampled at a given temperature; Self-BLEU was computed over those responses and the reported values are the mean 