# LlamaParse

- Author: [syshin0116](https://github.com/syshin0116)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)


## Overview

`LlamaParse` is a document parsing service developed by **LlamaIndex** , specifically designed for large language models (LLMs).

Key Features:

- Support for various document formats, such as PDF, Word, PowerPoint, and Excel
- Customized output formats through natural language instructions
- Advanced table and image extraction capabilities
- Multilingual support
- Multiple output format support

LlamaParse is available as a standalone API and is also integrated into the LlamaCloud platform. This service aims to enhance the performance of LLM-based applications, such as RAG(Retrieval-Augmented Generation), by parsing and refining documents.

Users can process up to 1,000 pages per day for free, with additional capacity available through paid plans. LlamaParse is currently offered in public beta and is continuously expanding its features.

- Link: [https://cloud.llamaindex.ai](https://cloud.llamaindex.ai)


### Table of Contents
- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Data Preparation](#data-preparation)
- [LlamaParse Parameters](#llamaparse-parameters)
- [Simple Parsing](#simple-parsing)
- [Multomodal Model Parsing](#multimodal-model-parsing)
- [Custom Parsing Instructions](#custom-parsing-instructions)

### References

- [LlamaParse: Using in Python](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/python)
- [LlamaParse: Getting Started](https://docs.cloud.llamaindex.ai/llamaparse/getting_started)

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
!pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "llama-index-core",
        "llama-parse",
        "llama-index-readers-file",
    ],
    verbose=False,
    upgrade=False,
)

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-upstage 0.4.0 requires pypdf<5.0.0,>=4.2.0, but you have pypdf 5.1.0 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### API Key Configuration

To use LlamaParse, you need to [obtain a Llama Cloud API key](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key).

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "LlamaParse",
        "LLAMA_CLOUD_API_KEY": "",
    }
)

Environment variables have been set successfully.


In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

In [5]:
import os
import nest_asyncio

# Allow async
nest_asyncio.apply()

## Data Preparation

In this tutorial, we will use the following pdf file:

- Author: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- Download Link: [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- File name: "1706.03762v7.pdf"
- File path: "data/1706.03762v7.pdf"

In [6]:
# Download and save sample PDF file to ./data directory
import requests

def download_pdf(url, save_path):
    """
    Downloads a PDF file from the given URL and saves it to the specified path.

    Args:
        url (str): The URL of the PDF file to download.
        save_path (str): The full path (including file name) where the file will be saved.
    """
    try:
        # Ensure the directory exists
        os.makedirs(os.path.dirname(save_path), exist_ok=True)

        # Download the file
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an error for bad status codes

        # Save the file to the specified path
        with open(save_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

        print(f"PDF downloaded and saved to: {save_path}")
    except Exception as e:
        print(f"An error occurred while downloading the file: {e}")


# Configuration for the PDF file
pdf_url = "https://arxiv.org/pdf/1706.03762v7"
file_path = "./data/1706.03762v7.pdf"

# Download the PDF
download_pdf(pdf_url, file_path)

PDF downloaded and saved to: ./data/1706.03762v7.pdf


In [7]:
# Set file path
FILE_PATH = "data/1706.03762v7.pdf"  # modify to your file path

## LlamaParse Parameters

### Key Parameters

These are the core settings that most users will configure:

| Parameter        | Description                                                                   | Default Value      |
| ---------------- | ----------------------------------------------------------------------------- | ------------------ |
| `api_key`        | A string representing the API key for authenticating with the LlamaParse API  | Required           |
| `base_url`       | The base URL for the LlamaParse API                                           | "DEFAULT_BASE_URL" |
| `check_interval` | Specifies the time (in seconds) between checks for the parsing job status     | 1                  |
| `ignore_errors`  | Boolean indicating whether to skip errors during parsing                      | True               |
| `max_timeout`    | Maximum time (in seconds) to wait for the parsing job to finish               | 2000               |
| `num_workers`    | Number of parallel workers for API requests (Range: 1-9)                      | 4                  |
| `result_type`    | Format of the parsing result (e.g., "text", "markdown", "json", "structured") | "text"             |
| `show_progress`  | Displays progress for multi-file parsing                                      | True               |
| `split_by_page`  | Splits the output by pages                                                    | True               |
| `language`       | Specifies the language of the document text                                   | "en"               |
| `verbose`        | Enables verbose output to show detailed parsing progress                      | True               |


### Advanced Parameters

For specialized use cases, consider these options:

**Parsing Modes and Enhancements**

|Parameter|Description|Default Value|
|---|---|---|
|`auto_mode`|Automatically selects the optimal parsing mode|False|
|`auto_mode_trigger_on_image_in_page`|Upgrades pages with images to premium mode (if `auto_mode` is enabled)|False|
|`auto_mode_trigger_on_table_in_page`|Upgrades pages with tables to premium mode (if `auto_mode` is enabled)|False|
|`auto_mode_trigger_on_text_in_page`|Upgrades pages with specific text to premium mode (if `auto_mode` is enabled)|None|
|`auto_mode_trigger_on_regexp_in_page`|Upgrades pages matching a regex to premium mode (if `auto_mode` is enabled)|None|
|`premium_mode`|Uses the most advanced parsing capabilities|False|
|`fast_mode`|Enables faster parsing by skipping OCR and table reconstruction|False|



**Content Extraction**

|Parameter|Description|Default Value|
|---|---|---|
|`disable_ocr`|Disables OCR, extracting only selectable text|False|
|`disable_image_extraction`|Prevents image extraction to speed up the parsing process|False|
|`extract_charts`|Extracts or tags charts in the document|False|
|`extract_layout`|Includes layout information in the parsed output|False|
|`annotate_links`|Annotates links in the document for URL extraction|False|
|`continuous_mode`|Improves parsing quality for documents with multi-page tables|False|
|`guess_xlsx_sheet_names`|Infers sheet names when parsing Excel files|False|



**Output Customization**

| Parameter                            | Description                                        | Default Value |
| ------------------------------------ | -------------------------------------------------- | ------------- |
| `page_separator`                     | Specifies a custom string to separate parsed pages | None          |
| `structured_output`                  | Outputs data in structured formats (e.g., JSON)    | False         |
| `structured_output_json_schema`      | JSON schema for formatting structured output       | None          |
| `structured_output_json_schema_name` | Predefined schema name for formatting output       | None          |
| `parsing_instruction`                | Custom instructions for parsing behavior           | ""            |



**Targeting and Filtering**

| Parameter                                            | Description                                                                                    | Default Value |
| ---------------------------------------------------- | ---------------------------------------------------------------------------------------------- | ------------- |
| `target_pages`                                       | Comma-separated list of page numbers to parse                                                  | None          |
| `max_pages`                                          | Limits the number of pages to parse                                                            | None          |
| `bbox_top`, `bbox_bottom`, `bbox_left`, `bbox_right` | Defines margins for bounding boxes (0–1 range) for extracting specific regions of the document | None          |
| `skip_diagonal_text`                                 | Ignores text that appears diagonally (non-standard text rotations)                             | False         |



**Integration and Webhooks**

|Parameter|Description|Default Value|
|---|---|---|
|`webhook_url`|URL to be called upon completion of the parsing job|None|
|`output_s3_path_prefix`|S3 path for uploading parsed output|None|
|`custom_client`|Custom HTTPX client for sending requests|None|
|`invalidate_cache`|Ignores cached documents, forcing re-parsing|False|
|`do_not_cache`|Prevents caching of parsed documents|False|


## Simple Parsing

The default usage of LlamaParse demonstrates how to parse documents using its core functionality. This mode is optimized for simplicity and works well for standard document types.

In [8]:
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

# Configure the LlamaParse instance
parser = LlamaParse(
    result_type="markdown",  # Output format ("text", "markdown", "json", or "structured")
    num_workers=8,
    verbose=True,
    language="en",
    show_progress=True,
)

# Define a file extractor mapping file extensions to parsers
file_extractor = {".pdf": parser}

# Use SimpleDirectoryReader to parse the specified PDF file
documents = SimpleDirectoryReader(
    input_files=[FILE_PATH],  # List of files to process
    file_extractor=file_extractor,
).load_data()

Started parsing the file under job_id e1295923-31af-4a61-89b5-1ab16790363c


In [9]:
# Display the number of documents parsed
len(documents)

15

In [10]:
# Display the first document
documents[0]

Document(id_='ef5f3d01-7d20-4996-8052-cd2e88bbdda6', embedding=None, metadata={'file_path': 'data/1706.03762v7.pdf', 'file_name': '1706.03762v7.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2025-01-03', 'last_modified_date': '2025-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.\n\n# Attention Is All You Need\n\narXiv:1706.03762v7 · [cs.CL] · 2 Aug 2023\n\nAshish Vaswani∗ Noam Shazeer∗ Niki Parmar∗ Jakob Uszkoreit∗\n\nGoogle Brai

### Conversion to LangChain Documents

The parsed documents are converted to the LangChain document format for further processing.

In [11]:
# Convert LlamaIndex documents to LangChain document format
docs = [doc.to_langchain_format() for doc in documents]

In [12]:
# Display the content of a specific document (e.g., the 6th document)
print(docs[5].page_content)

# Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n is the sequence length, d is the representation dimension, k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention.

|Layer Type|Complexity per Layer|Sequential Operations|Maximum Path Length|
|---|---|---|---|
|Self-Attention|O(n2 · d)|O(1)|O(1)|
|Recurrent|O(k · n · d)|O(n)|O(logk(n))|
|Convolutional|O(r · n · d)|O(1)|O(n/r)|

# 3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choic

In [13]:
# Display metadata of the first document
docs[0].metadata

{'file_path': 'data/1706.03762v7.pdf',
 'file_name': '1706.03762v7.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2025-01-03',
 'last_modified_date': '2025-01-03'}

## MultiModal Model Parsing

Multimodal parsing in LlamaParse uses external AI models to process documents with complex content. Instead of extracting text directly, it processes screenshots of each page and generates a structured output based on visual interpretation. This method is particularly effective for non-standard layouts, scanned documents, or documents with embedded media.

### Key Features:

- Visual Processing: Operates on page screenshots, not raw text, to interpret document content.
- Advanced Models: Integrates with AI models like OpenAI's GPT-4o and others for enhanced document analysis.
- Customizable: Supports various models and optional API key usage for flexibility.

### Procedure:

1. Screenshot Generation: A screenshot is taken for each page of the document.
2. Model Processing: The page screenshots are sent to the selected multimodal model with instructions to process them visually.
3. Result Compilation: The model outputs the page content (e.g., as Markdown), which is then consolidated into the final result.

### Key Parameters

|Parameter|Description|Example Value|
|---|---|---|
|`use_vendor_multimodal_model`|Specifies whether to use an external vendor's multimodal model. Setting this to True enables multimodal parsing.|True|
|`vendor_multimodal_model_name`|Specifies the name of the multimodal model to use. In this case, "openai-gpt4o" is selected.|"openai-gpt4o"|
|`vendor_multimodal_api_key`|Sets the API key for the multimodal model. The OpenAI API key is retrieved from an environment variable.|"OPENAI_API_KEY"|
|`result_type`|Specifies the format of the parsing result. Here, it is set to "markdown", so the results are returned in Markdown format.|"markdown"|
|`language`|Specifies the language of the document to be parsed. |"en"|
|`skip_diagonal_text`|Determines whether to skip diagonal text during parsing.|True|
|`page_separator`|Specifies a custom page separator.|None|

In [14]:
# Configure the LlamaParse instance to use the vendor multimodal model
multimodal_parser = LlamaParse(
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
    vendor_multimodal_api_key=os.environ["OPENAI_API_KEY"],
    result_type="markdown",
    language="en",
)

If you encounter an `AttributeError` here, try re-running the code above.

In [15]:
# Parse the PDF file using the multimodal parser
parsed_docs = multimodal_parser.load_data(file_path=FILE_PATH)

Started parsing the file under job_id a67936dc-603f-4737-93b3-3bf1585da7e9


In [16]:
# Convert to langchain document format
docs = [doc.to_langchain_format() for doc in parsed_docs]
docs

[Document(id='abcb1ef8-703d-4193-81af-d5be5d851516', metadata={}, page_content='# Attention Is All You Need\n\nAshish Vaswani\\*  \nGoogle Brain  \navaswani@google.com  \n\nNoam Shazeer\\*  \nGoogle Brain  \nnoam@google.com  \n\nNiki Parmar\\*  \nGoogle Research  \nnikip@google.com  \n\nJakob Uszkoreit\\*  \nGoogle Research  \nusz@google.com  \n\nLlion Jones\\*  \nGoogle Research  \nllion@google.com  \n\nAidan N. Gomez\\* †  \nUniversity of Toronto  \naidan@cs.toronto.edu  \n\nŁukasz Kaiser\\*  \nGoogle Brain  \nlukaszkaiser@google.com  \n\nIlia Polosukhin\\* ‡  \nillia.polosukhin@gmail.com  \n\n## Abstract\n\nThe dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convoluti

In [17]:
# Display the content of the first document
print(docs[0].page_content)

# Attention Is All You Need

Ashish Vaswani\*  
Google Brain  
avaswani@google.com  

Noam Shazeer\*  
Google Brain  
noam@google.com  

Niki Parmar\*  
Google Research  
nikip@google.com  

Jakob Uszkoreit\*  
Google Research  
usz@google.com  

Llion Jones\*  
Google Research  
llion@google.com  

Aidan N. Gomez\* †  
University of Toronto  
aidan@cs.toronto.edu  

Łukasz Kaiser\*  
Google Brain  
lukaszkaiser@google.com  

Ilia Polosukhin\* ‡  
illia.polosukhin@gmail.com  

## Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more par

## Custom Parsing Instructions

You can also specify custom instructions for parsing. This allows you to fine-tune the parser’s behavior to meet specific requirements.

In [18]:
# Configure parsing instruction
parsing_instruction = (
    "You are parsing a research paper. Please extract tables in markdown format."
)

# LlamaParse configuration
parser = LlamaParse(
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
    vendor_multimodal_api_key=os.environ["OPENAI_API_KEY"],
    result_type="markdown",
    language="en",
    parsing_instruction=parsing_instruction,
)

# Parse pdf file
parsed_docs = parser.load_data(file_path=FILE_PATH)

# Convert to langchain documents
docs = [doc.to_langchain_format() for doc in parsed_docs]

Started parsing the file under job_id a49267a2-61c1-4480-9466-b98af53a7b65


In [19]:
# Display the content of the first document
print(docs[3].page_content)

The image contains diagrams of Scaled Dot-Product Attention and Multi-Head Attention, along with a description of these concepts.

### Scaled Dot-Product Attention
- **Input**: Queries and keys of dimension \(d_k\), and values of dimension \(d_v\).
- **Process**:
  1. Compute dot products of the query with all keys.
  2. Divide each by \(\sqrt{d_k}\).
  3. Apply a softmax function to obtain weights on the values.

### Multi-Head Attention
- **Process**:
  1. Linearly project queries, keys, and values \(h\) times with different learned projections to \(d_k\), \(d_k\), and \(d_v\) dimensions.
  2. Perform attention function in parallel, yielding \(d_v\)-dimensional output.

### Equation
\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

### Notes
- Dot-product attention is identical to the algorithm except for the scaling factor of \(\frac{1}{\sqrt{d_k}}\).
- Additive attention computes the compatibility function using a feed-forward network with a sin