# PyPDFium2Loader

本笔记本提供了 [`PyPDF`](https://python.langchain.com/docs/concepts/document_loaders) [文档加载器](https://python.langchain.com/docs/concepts/document_loaders) 的快速入门指南。有关所有 DocumentLoader 功能和配置的详细文档，请参阅 [API 参考](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyPDFium2Loader.html)。

## 概览
### 集成详情

| 类 | 包 | 本地 | 可序列化 | JS 支持 |
| :--- | :--- | :---: | :---: |  :---: |
| [PyPDFLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ✅ | ❌ | ❌ |   
   
---------   

### 加载器功能

| 源 | 文档延迟加载 | 原生异步支持 | 提取图像 | 提取表格 |
|:-----------:| :---: | :---: | :---: |:---: |
| PyPDFLoader | ✅ | ❌ | ✅ | ❌  |

## 设置

### 凭证

使用 `PyPDFLoader` 无需凭证。

要启用模型调用的自动跟踪，请设置您的 [LangSmith](https://docs.smith.langchain.com/) API 密钥：

In [1]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### 安装

安装 **langchain_community** 和 **pypdf**。

In [2]:
%pip install -qU langchain_community pypdfium2

Note: you may need to restart the kernel to use updated packages.


## 初始化

现在，我们可以实例化我们的模型对象并加载文档了：

In [3]:
from langchain_community.document_loaders import PyPDFium2Loader

file_path = "./example_data/layout-parser-paper.pdf"
loader = PyPDFium2Loader(file_path)

## 加载

In [4]:
docs = loader.load()
docs[0]

Document(metadata={'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationdate': '2021-06-22T01:27:10+00:00', 'moddate': '2021-06-22T01:27:10+00:00', 'source': './example_data/layout-parser-paper.pdf', 'total_pages': 16, 'page': 0}, page_content='LayoutParser: A Unified Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen\n1\n(), Ruochen Zhang\n2\n, Melissa Dell\n3\n, Benjamin Charles Germain\nLee\n4\n, Jacob Carlson\n3\n, and Weining Li\n5\n1 Allen Institute for AI\nshannons@allenai.org 2 Brown University\nruochen zhang@brown.edu 3 Harvard University\n{melissadell,jacob carlson\n}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu 5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and exten

In [5]:
import pprint

pprint.pp(docs[0].metadata)

{'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'LaTeX with hyperref',
 'producer': 'pdfTeX-1.40.21',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'moddate': '2021-06-22T01:27:10+00:00',
 'source': './example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'page': 0}


## 延迟加载

In [6]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pages = []
len(pages)

6

In [7]:
print(pages[0].page_content[:100])
pprint.pp(pages[0].metadata)

LayoutParser: A Unified Toolkit for DL-Based DIA 11
focuses on precision, efficiency, and robustness
{'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'LaTeX with hyperref',
 'producer': 'pdfTeX-1.40.21',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'moddate': '2021-06-22T01:27:10+00:00',
 'source': './example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'page': 10}


metadata 属性至少包含以下键：
- source
- page（如果在 *page* 模式下）
- total_page
- creationdate
- creator
- producer

其他元数据特定于每个解析器。
这些信息可能很有用（例如，用于对您的 PDF 进行分类）。

## 分割模式和自定义页面分隔符

This guide details how to use Splitting mode with custom `

加载 PDF 文件时，可以有两种不同的分割方式：
- 按页面分割
- 作为单一文本流

默认情况下，`PyPDFLoader` 会将 PDF 作为单一文本流进行分割。

### 按页提取 PDF。每页都提取为 langchain Document 对象：

In [8]:
loader = PyPDFium2Loader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

16
{'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'LaTeX with hyperref',
 'producer': 'pdfTeX-1.40.21',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'moddate': '2021-06-22T01:27:10+00:00',
 'source': './example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'page': 0}


在此模式下，PDF 会按页面分割，生成的 Documents 元数据包含页码。但在某些情况下，我们可能希望将 PDF 作为单一文本流进行处理（这样可以避免将某些段落分割），这时您可以使用 *single* 模式：

### 将整个 PDF 提取为单个 langchain Document 对象：

In [9]:
loader = PyPDFium2Loader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

1
{'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'LaTeX with hyperref',
 'producer': 'pdfTeX-1.40.21',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'moddate': '2021-06-22T01:27:10+00:00',
 'source': './example_data/layout-parser-paper.pdf',
 'total_pages': 16}


在逻辑上，在此模式下，“page_number”元数据会消失。以下是如何在文本流中清晰识别分页符位置的方法：

### 添加自定义的 *pages_delimiter* 来识别 *single* 模式下的分页符：

In [10]:
loader = PyPDFium2Loader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
    pages_delimiter="\n-------THIS IS A CUSTOM END OF PAGE-------\n",
)
docs = loader.load()
print(docs[0].page_content[:5780])

LayoutParser: A Unified Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen
1
(), Ruochen Zhang
2
, Melissa Dell
3
, Benjamin Charles Germain
Lee
4
, Jacob Carlson
3
, and Weining Li
5
1 Allen Institute for AI
shannons@allenai.org 2 Brown University
ruochen zhang@brown.edu 3 Harvard University
{melissadell,jacob carlson
}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu 5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going
efforts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural languag

这可以很简单地使用 \n，或者 \f 来清晰地指示换页，或者使用 \<!-- PAGE BREAK --> 来无缝地注入到 Markdown 查看器中，而不会产生视觉效果。

# 从 PDF 中提取图片

您可以使用以下三种解决方案从 PDF 中提取图像：
- rapidOCR（轻量级光学字符识别工具）
- Tesseract（高精度 OCR 工具）
- Multimodal language model

您可以调整这些函数来选择提取图像的输出格式，可以是 *html*、*markdown* 或 *text*

结果会插入到页面最后一段和倒数第二段文本之间。

### 使用 rapidOCR 从 PDF 中提取图像：

In [11]:
%pip install -qU rapidocr-onnxruntime

Note: you may need to restart the kernel to use updated packages.


In [12]:
from langchain_community.document_loaders.parsers import RapidOCRBlobParser

loader = PyPDFium2Loader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    images_inner_format="markdown-img",
    images_parser=RapidOCRBlobParser(),
)
docs = loader.load()

print(docs[5].page_content)

6 Z. Shen et al.
Fig. 2: The relationship between the three types of layout data structures.
Coordinate supports three kinds of variation; TextBlock consists of the coordinate information and extra features like block text, types, and reading orders;
a Layout object is a list of all possible layout elements, including other Layout
objects. They all support the same set of transformation and operation APIs for
maximum flexibility.
Shown in Table 1, LayoutParser currently hosts 9 pre-trained models trained
on 5 different datasets. Description of the training dataset is provided alongside
with the trained models such that users can quickly identify the most suitable
models for their tasks. Additionally, when such a model is not readily available,
LayoutParser also supports training customized layout models and community
sharing of the models (detailed in Section 3.5).
3.2 Layout Data Structures
A critical feature of LayoutParser is the implementation of a series of data
structures and op

请注意，RapidOCR 是为处理中文和英文设计的，不适用于其他语言。

### 使用 Tesseract 从 PDF 中提取图像：

In [13]:
%pip install -qU pytesseract

Note: you may need to restart the kernel to use updated packages.


In [14]:
from langchain_community.document_loaders.parsers import TesseractBlobParser

loader = PyPDFium2Loader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    images_inner_format="html-img",
    images_parser=TesseractBlobParser(),
)
docs = loader.load()
print(docs[5].page_content)

6 Z. Shen et al.
Fig. 2: The relationship between the three types of layout data structures.
Coordinate supports three kinds of variation; TextBlock consists of the coordinate information and extra features like block text, types, and reading orders;
a Layout object is a list of all possible layout elements, including other Layout
objects. They all support the same set of transformation and operation APIs for
maximum flexibility.
Shown in Table 1, LayoutParser currently hosts 9 pre-trained models trained
on 5 different datasets. Description of the training dataset is provided alongside
with the trained models such that users can quickly identify the most suitable
models for their tasks. Additionally, when such a model is not readily available,
LayoutParser also supports training customized layout models and community
sharing of the models (detailed in Section 3.5).
3.2 Layout Data Structures
A critical feature of LayoutParser is the implementation of a series of data
structures and op

### 使用多模态模型从 PDF 中提取图像：

In [15]:
%pip install -qU langchain_openai

Note: you may need to restart the kernel to use updated packages.


In [16]:
import os

from dotenv import load_dotenv

load_dotenv()

True

In [17]:
from getpass import getpass

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key =")

In [18]:
from langchain_community.document_loaders.parsers import LLMImageBlobParser
from langchain_openai import ChatOpenAI

loader = PyPDFium2Loader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    images_inner_format="markdown-img",
    images_parser=LLMImageBlobParser(model=ChatOpenAI(model="gpt-4o", max_tokens=1024)),
)
docs = loader.load()
print(docs[5].page_content)

6 Z. Shen et al.
Fig. 2: The relationship between the three types of layout data structures.
Coordinate supports three kinds of variation; TextBlock consists of the coordinate information and extra features like block text, types, and reading orders;
a Layout object is a list of all possible layout elements, including other Layout
objects. They all support the same set of transformation and operation APIs for
maximum flexibility.
Shown in Table 1, LayoutParser currently hosts 9 pre-trained models trained
on 5 different datasets. Description of the training dataset is provided alongside
with the trained models such that users can quickly identify the most suitable
models for their tasks. Additionally, when such a model is not readily available,
LayoutParser also supports training customized layout models and community
sharing of the models (detailed in Section 3.5).
3.2 Layout Data Structures
A critical feature of LayoutParser is the implementation of a series of data
structures and op

## 使用文件操作

许多文档加载器都涉及解析文件。这类加载器之间的区别通常源于文件的解析方式，而不是文件本身的加载方式。例如，你可以使用 `open` 来读取 PDF 或 markdown 文件的二进制内容，但需要不同的解析逻辑才能将这些二进制数据转换为文本。

因此，将解析逻辑与加载逻辑分离会很有帮助，这样可以更轻松地重用给定的解析器，而无需考虑数据加载方式。
你可以使用此策略，以相同的解析参数来分析不同的文件。

In [19]:
from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import PyPDFium2Parser

loader = GenericLoader(
    blob_loader=FileSystemBlobLoader(
        path="./example_data/",
        glob="*.pdf",
    ),
    blob_parser=PyPDFium2Parser(),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)

LayoutParser: A Unified Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen
1
(), Ruochen Zhang
2
, Melissa Dell
3
, Benjamin Charles Germain
Lee
4
, Jacob Carlson
3
, and Weining Li
5
1 Allen Institute for AI
shannons@allenai.org 2 Brown University
ruochen zhang@brown.edu 3 Harvard University
{melissadell,jacob carlson
}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu 5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going
efforts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural languag

可以与云存储中的文件进行协作。

In [None]:
from langchain_community.document_loaders import CloudBlobLoader
from langchain_community.document_loaders.generic import GenericLoader

loader = GenericLoader(
    blob_loader=CloudBlobLoader(
        url="s3://mybucket",  # Supports s3://, az://, gs://, file:// schemes.
        glob="*.pdf",
    ),
    blob_parser=PyPDFium2Parser(),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)

## API 参考

如需了解 `PyPDFium2Loader` 所有功能和配置的详细文档，请访问 API 参考：https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyPDFium2Loader.html