# UpstageDocumentParseLoader 
- Author: [Taylor(Jihyun Kim)](https://github.com/Taylor0819)
- Design: 
- Peer Review : [JoonHo Kim](https://github.com/jhboyo), [Jaemin Hong](https://github.com/geminii01), [leebeanbin](https://github.com/leebeanbin), [Dooil Kwak](https://github.com/back2zion)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/11-UpstageDocumentParseLoader.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/3ad956cceef62c6e1adc831f6a11fac1977f8932/06-DocumentLoader/11-UpstageDocumentParseLoader.ipynb)


## Overview 

The `UpstageDocumentParseLoader` is a robust document analysis tool designed by Upstage that seamlessly integrates with the LangChain framework as a document loader. It specializes in transforming documents into structured HTML by analyzing their layout and content.

**Key Features** :

-	Comprehensive Layout Analysis : 
	Analyzes and identifies structural elements like headings, paragraphs, tables, and images across various document formats (e.g., PDFs, images).

-	Automated Structural Recognition : 
	Automatically detects and serializes document elements based on reading order for accurate conversion to HTML.

-	Optional OCR Support : 
	Includes optical character recognition for handling scanned or image-based documents. The OCR mode supports:
	
	`force` : Extracts text from images using OCR.
	
	`auto` : Extracts text from PDFs (throws an error if the input is not in PDF format).

By recognizing and preserving the relationships between document elements, the `UpstageDocumentParseLoader` enables precise and context-aware document analysis.

**Migration from Layout Analysis** :
Upstage has launched Document Parse to replace Layout Analysis! Document Parse now supports a wider range of document types, markdown output, chart detection, equation recognition, and additional features planned for upcoming releases. The last version of Layout Analysis, layout-analysis-0.4.0, will be officially discontinued by November 10, 2024.

### Table of Contents 

- [Overview](#overview)
- [Key Changes from Layout Analysis](#key-changes-from-layout-analysis)
- [Environment Setup](#environment-setup)
- [UpstageDocumentParseLoader Key Parameters](#upstagedocumentparseloader-key-parameters)
- [Usage Example](#usage-example)

### Key Changes from Layout Analysis

**Changes to Existing Options** :
1. `use_ocr` → `ocr` 
   
   `use_ocr` option has been replaced with `ocr` . Instead of `True/False` , it now accepts `force` or `auto` for more precise control.

2. `output_type` → `output_format` 
   
   `output_type` option has been renamed to `output_format` for specifying the format of the output.

3. `exclude` → `base64_encoding`

    The `exclude` option has been replaced with `base64_encoding` . While `exclude` was used to exclude specific elements from the output, `base64_encoding` specifies whether to encode elements of certain categories in Base64.
   

### References
- [UpstageDocumentParseLoader](https://python.langchain.com/api_reference/upstage/document_parse/langchain_upstage.document_parse.UpstageDocumentParseLoader.html)
- [UpstageLayoutAnalysisLoader](https://python.langchain.com/api_reference/upstage/layout_analysis/langchain_upstage.layout_analysis.UpstageLayoutAnalysisLoader.html)
- [Upstage Migrate to Document Parse from Layout Analysis](https://console.upstage.ai/docs/capabilities/document-parse/migration-dp)

----

## Environment Setup
Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]** 

- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.


### API Key Configuration
To use `UpstageDocumentParseLoader` , you need to [obtain a Upstage API key](https://console.upstage.ai/api-keys).

Once you have your API key, set it as the value for the variable `UPSTAGE_API_KEY` .


In [73]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [74]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_upstage",
    ],
    verbose=False,
    upgrade=False,
)

In [75]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "UPSTAGE_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "12-UpstageDocumentParseLoader",
    }
)

Environment variables have been set successfully.


You can alternatively set `UPSTAGE_API_KEY` in .env file and load it.

[Note] This is not necessary if you've already set `UPSTAGE_API_KEY` in previous steps.

In [76]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

In [77]:
import os
import nest_asyncio

# Allow async
nest_asyncio.apply()

## UpstageDocumentParseLoader Key Parameters

- `file_path` : Path(s) to the document(s) to be analyzed
- `split` : Document splitting mode [default: 'none', 'element', 'page']
- `model` : Model name for document parsing [default: 'document-parse']
- `ocr` : OCR mode ["force" (always OCR), "auto" (PDF-only)]
- `output_format` : Format of the analysis results [default: 'html', 'text', 'markdown']
- `coordinates` : Include OCR coordinates in the output [default: True]
- `base64_encoding` : List of element categories to be base64-encoded ['paragraph', 'table', 'figure', 'header', 'footer', 'list', 'chart', '...']

## Usage Example
Let's try running a code example here using `UpstageDocumentParseLoader` .

### Data Preparation

In this tutorial, we will use the following pdf file:

- Download Link: [Modular-RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks](https://arxiv.org/abs/2407.21059)
- File name: "2407.21059.pdf"
- File path: "./data/2407.21059.pdf"
 
After downloading the PDF file from the provided link, create a data folder in the current directory and save the PDF file into that folder.


In [78]:
# Download and save sample PDF file to ./data directory
import requests


def download_pdf(url, save_path):
    """
    Downloads a PDF file from the given URL and saves it to the specified path.

    Args:
        url (str): The URL of the PDF file to download.
        save_path (str): The full path (including file name) where the file will be saved.
    """
    try:
        # Ensure the directory exists
        os.makedirs(os.path.dirname(save_path), exist_ok=True)

        # Download the file
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an error for bad status codes

        # Save the file to the specified path
        with open(save_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

        print(f"PDF downloaded and saved to: {save_path}")
    except Exception as e:
        print(f"An error occurred while downloading the file: {e}")


# Configuration for the PDF file
pdf_url = "https://arxiv.org/pdf/2407.21059"
file_path = "./data/2407.21059.pdf"

# Download the PDF
download_pdf(pdf_url, file_path)

PDF downloaded and saved to: ./data/2407.21059.pdf


In [79]:
# Set file path
FILE_PATH = "data/2407.21059.pdf"  # modify to your file path

In [80]:
from langchain_upstage import UpstageDocumentParseLoader

# Configure the document loader
loader = UpstageDocumentParseLoader(
    FILE_PATH,
    output_format="html",
    split="page",
    ocr="auto",
    coordinates=True,
    base64_encoding=["chart"],
)

# Load the document
docs = loader.load()

# Print the results
for doc in docs[:2]:
    print(doc)

page_content='<p id='0' data-category='paragraph' style='font-size:14px'>1</p> <h1 id='1' style='font-size:20px'>Modular RAG: Transforming RAG Systems into<br>LEGO-like Reconfigurable Frameworks</h1> <br><p id='2' data-category='paragraph' style='font-size:18px'>Yunfan Gao, Yun Xiong, Meng Wang, Haofen Wang</p> <p id='3' data-category='paragraph' style='font-size:16px'>Abstract—Retrieval-augmented Generation (RAG) has<br>markedly enhanced the capabilities of Large Language Models<br>(LLMs) in tackling knowledge-intensive tasks. The increasing<br>demands of application scenarios have driven the evolution<br>of RAG, leading to the integration of advanced retrievers,<br>LLMs and other complementary technologies, which in turn<br>has amplified the intricacy of RAG systems. However, the rapid<br>advancements are outpacing the foundational RAG paradigm,<br>with many methods struggling to be unified under the process<br>of “retrieve-then-generate”. In this context, this paper examines<br>the 