# 如何处理长文本进行提取

当处理像 PDF 这样的文件时，你可能会遇到文本长度超出语言模型上下文窗口的情况。要处理这类文本，可以考虑以下策略：

1. **更换 LLM** 选择一个支持更大上下文窗口的不同 LLM。
2. **暴力破解** 将文档分块，并从每个块中提取内容。
3. **RAG** 将文档分块，对分块进行索引，并且只从看起来“相关”的块子集中提取内容。

请记住，这些策略有不同的权衡取舍，最佳策略可能取决于你设计的应用程序！

本指南演示了如何实现策略 2 和 3。

## 设置

首先，我们将安装本指南所需的依赖项：

In [1]:
%pip install -qU langchain-community lxml faiss-cpu langchain-openai

Note: you may need to restart the kernel to use updated packages.


现在我们需要一些示例数据！让我们从 Wikipedia 下载一篇关于[汽车](https://en.wikipedia.org/wiki/Car)的文章，并将其加载为 LangChain 的 [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html)。

In [1]:
import re

import requests
from langchain_community.document_loaders import BSHTMLLoader

# Download the content
response = requests.get("https://en.wikipedia.org/wiki/Car")
# Write it to a file
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
# Load it with an HTML parser
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
# Clean up code
# Replace consecutive new lines with a single new line
document.page_content = re.sub("\n\n+", "\n", document.page_content)

In [2]:
print(len(document.page_content))

78865


## 定义模式

遵循[提取教程](/docs/tutorials/extraction)，我们将使用 Pydantic 来定义我们希望提取的信息的模式。在这种情况下，我们将提取一系列“关键进展”（例如重要的历史事件），其中包含年份和描述。

请注意，我们还包含了一个 `evidence` 键，并指示模型 verbatim 提供文章中的相关句子。这使我们能够将提取结果与（模型重建的）原始文档中的文本进行比较。

In [3]:
from typing import List, Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field


class KeyDevelopment(BaseModel):
    """Information about a development in the history of cars."""

    year: int = Field(
        ..., description="The year when there was an important historic development."
    )
    description: str = Field(
        ..., description="What happened in this year? What was the development?"
    )
    evidence: str = Field(
        ...,
        description="Repeat in verbatim the sentence(s) from which the year and description information were extracted",
    )


class ExtractionData(BaseModel):
    """Extracted information about key developments in the history of cars."""

    key_developments: List[KeyDevelopment]


# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert at identifying key historic development in text. "
            "Only extract important historic developments. Extract nothing if no important information can be found in the text.",
        ),
        ("human", "{text}"),
    ]
)

## 创建提取器

让我们选择一个 LLM。由于我们正在使用工具调用，因此需要一个支持工具调用功能的模型。请参阅[此表](/docs/integrations/chat)了解可用的 LLM。

import ChatModelTabs from "@theme/ChatModelTabs";

<ChatModelTabs
  customVarName="llm"
  overrideParams={{openai: {model: "gpt-4o", kwargs: "temperature=0"}}}
/>

In [4]:
# | output: false
# | echo: false

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

In [5]:
extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    include_raw=False,
)

## 暴力破解方法

将文档分割成块，使每个块都能放入 LLM 的上下文窗口。

In [6]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    # Controls the size of each chunk
    chunk_size=2000,
    # Controls overlap between chunks
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

使用 [batch](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html) 功能在**并行**模式下运行跨每个块的提取！

:::tip
您通常可以使用 .batch() 来并行化提取！`.batch` 在底层使用线程池来帮助您并行化工作负载。

如果您的模型是通过 API 公开的，这很可能会加快您的提取流程！
:::

In [7]:
# Limit just to the first 3 chunks
# so the code can be re-run quickly
first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # limit the concurrency by passing max concurrency!
)

### 合并结果

在从所有分块中提取数据之后，我们需要将这些提取结果合并在一起。

In [8]:
key_developments = []

for extraction in extractions:
    key_developments.extend(extraction.key_developments)

key_developments[:10]

[KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769, while the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769, while the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1886, description='Carl Benz invented the modern car, a practical, marketable automobile for everyday use, and patented his Benz Patent-Motorwagen.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, whe

## 基于 RAG 的方法

另一个简单的想法是将文本分块，但不是从每个块中提取信息，而是只关注最相关的块。

:::caution
确定哪些块是相关的可能很困难。

例如，在我们这里使用的 `car` 文章中，文章的大部分都包含关键的开发信息。因此，使用
**RAG**，我们可能会舍弃大量相关信息。

我们建议根据你的用例进行实验，并确定此方法是否有效。
:::

要实现基于 RAG 的方法：

1.  将你的文档分块并建立索引（例如，在向量数据库中）；
2.  在 `extractor` 链前面添加一个使用向量数据库的检索步骤。

下面是一个依赖于 `FAISS` 向量数据库的简单示例。

In [9]:
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # Only extract from first document

在这种情况下，RAG 提取器只查看顶层文档。

In [10]:
rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # fetch content of top doc
} | extractor

In [11]:
results = rag_extractor.invoke("Key developments associated with cars")

In [13]:
for key_development in results.key_developments:
    print(key_development)

year=2006 description='Car-sharing services in the US experienced double-digit growth in revenue and membership.' evidence='in the US, some car-sharing services have experienced double-digit growth in revenue and membership growth between 2006 and 2007.'
year=2020 description='56 million cars were manufactured worldwide, with China producing the most.' evidence='In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year. The automotive industry in China produces by far the most (20 million in 2020).'


## 常见问题

不同的方法在成本、速度和准确性方面都有各自的优缺点。

请注意以下问题：

*   分块处理内容意味着，如果信息分散在多个块中，大型语言模型（LLM）可能无法提取信息。
*   较大的块重叠可能会导致同一信息被提取两次，因此请做好去重的准备！
*   大型语言模型（LLM）可能会编造数据。如果寻找大段文本中的单个事实并使用暴力方法，您可能会得到更多编造的数据。