# 如何处理长文本

:::info 预备知识

本指南假定您熟悉以下内容：

- [抽取](/docs/tutorials/extraction)

:::

在处理文件（如PDF）时，您很可能会遇到超出语言模型上下文窗口的文本。为了处理这些文本，可以考虑以下策略：

1. **更换LLM** 选择一个支持更大上下文窗口的LLM。
2. **暴力方法** 将文档分块，并从每个块中抽取内容。
3. **RAG** 将文档分块，对块进行索引，并仅从看起来“相关”的一部分块中抽取内容。

请注意，这些策略各有不同的权衡，最佳策略可能取决于您正在设计的应用程序！

## 设置

首先，让我们安装一些必需的依赖项：

```{=mdx}
import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx";
import Npm2Yarn from "@theme/Npm2Yarn";

<IntegrationInstallTooltip></IntegrationInstallTooltip>

<Npm2Yarn>
  @langchain/openai @langchain/core zod cheerio
</Npm2Yarn>
```

接下来，我们需要一些示例数据！让我们下载一篇关于[维基百科上的汽车](https://en.wikipedia.org/wiki/Car)的文章，并将其加载为 LangChain 的 `Document`。

In [1]:
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
// Only required in a Deno notebook environment to load the peer dep.
import "cheerio";

const loader = new CheerioWebBaseLoader(
  "https://en.wikipedia.org/wiki/Car"
);

const docs = await loader.load();

docs[0].pageContent.length;

[33m97336[39m

## 定义模式

在此，我们将定义一个模式，用于从文本中提取关键发展信息。

In [3]:
import { z } from "zod";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";

const keyDevelopmentSchema = z.object({
  year: z.number().describe("The year when there was an important historic development."),
  description: z.string().describe("What happened in this year? What was the development?"),
  evidence: z.string().describe("Repeat verbatim the sentence(s) from which the year and description information were extracted"),
}).describe("Information about a development in the history of cars.");

const extractionDataSchema = z.object({
  key_developments: z.array(keyDevelopmentSchema),
}).describe("Extracted information about key developments in the history of cars");

const SYSTEM_PROMPT_TEMPLATE = [
  "You are an expert at identifying key historic development in text.",
  "Only extract important historic developments. Extract nothing if no important information can be found in the text."
].join("\n");

// Define a custom prompt to provide instructions and any additional context.
// 1) You can add examples into the prompt template to improve extraction quality
// 2) Introduce additional parameters to take context into account (e.g., include metadata
//    about the document from which the text was extracted.)
const prompt = ChatPromptTemplate.fromMessages([
  [
    "system",
    SYSTEM_PROMPT_TEMPLATE,
  ],
  // Keep on reading through this use case to see how to use examples to improve performance
  // MessagesPlaceholder('examples'),
  ["human", "{text}"],
]);

// We will be using tool calling mode, which
// requires a tool calling capable model.
const llm = new ChatOpenAI({
  model: "gpt-4-0125-preview",
  temperature: 0,
});

const extractionChain = prompt.pipe(llm.withStructuredOutput(extractionDataSchema));

## 暴力方法

将文档拆分为多个块，使每个块都适合LLM的上下文窗口。

In [4]:
import { TokenTextSplitter } from "langchain/text_splitter";

const textSplitter = new TokenTextSplitter({
  chunkSize: 2000,
  chunkOverlap: 20,
});

// Note that this method takes an array of docs
const splitDocs = await textSplitter.splitDocuments(docs);

对所有可运行对象上的 `.batch` 方法进行使用，以在每个块上**并行**运行提取操作！

:::{.callout-tip}
通常可以使用 `.batch()` 来并行化提取操作！

如果模型是通过 API 暴露的，则这可能会加快提取流程。
:::

In [5]:
// Limit just to the first 3 chunks
// so the code can be re-run quickly
const firstFewTexts = splitDocs.slice(0, 3).map((doc) => doc.pageContent);

const extractionChainParams = firstFewTexts.map((text) => {
  return { text };
});

const results = await extractionChain.batch(extractionChainParams, { maxConcurrency: 5 });

### 合并结果

从各个数据块中提取数据后，我们需要将这些提取结果合并在一起。

In [6]:
const keyDevelopments = results.flatMap((result) => result.key_developments);

keyDevelopments.slice(0, 20);

[
  { year: [33m0[39m, description: [32m""[39m, evidence: [32m""[39m },
  {
    year: [33m1769[39m,
    description: [32m"French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle."[39m,
    evidence: [32m"French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769."[39m
  },
  {
    year: [33m1808[39m,
    description: [32m"French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"[39m... 25 more characters,
    evidence: [32m"French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"[39m... 33 more characters
  },
  {
    year: [33m1886[39m,
    description: [32m"German inventor Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car—a practical,"[39m... 40 more characters,
    evidence: [32m"The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German"[39m... 56 more c

## 基于RAG的方法

另一个简单的思路是将文本分块，但与从每个文本块中提取信息不同，我们只需关注最相关的文本块。

:::{.callout-caution}
识别哪些文本块是相关的可能会有难度。

例如，在我们此处使用的`car`文章中，大部分文章内容都包含关键的发展信息。因此，通过使用
**RAG**，我们可能会遗漏大量相关信息。

我们建议您对自己的使用场景进行实验，以确定这种方法是否有效。
:::

下面是一个简单示例，该示例依赖于内存中的演示 `MemoryVectorStore` 向量存储。

In [7]:
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

// Only load the first 10 docs for speed in this demo use-case
const vectorstore = await MemoryVectorStore.fromDocuments(
  splitDocs.slice(0, 10),
  new OpenAIEmbeddings()
);

// Only extract from top document
const retriever = vectorstore.asRetriever({ k: 1 });

在这种情况下，RAG提取器仅查看最相关的文档。

In [9]:
import { RunnableSequence } from "@langchain/core/runnables";

const ragExtractor = RunnableSequence.from([
  {
    text: retriever.pipe((docs) => docs[0].pageContent)
  },
  extractionChain,
]);

In [12]:
const ragExtractorResults = await ragExtractor.invoke("Key developments associated with cars");

In [13]:
ragExtractorResults.key_developments;

[
  {
    year: [33m2020[39m,
    description: [32m"The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 million km (1."[39m... 33 more characters,
    evidence: [32m"The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 millionkm (1.2"[39m... 31 more characters
  },
  {
    year: [33m2030[39m,
    description: [32m"All fossil fuel vehicles will be banned in Amsterdam from 2030."[39m,
    evidence: [32m"all fossil fuel vehicles will be banned in Amsterdam from 2030."[39m
  },
  {
    year: [33m2020[39m,
    description: [32m"In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year."[39m,
    evidence: [32m"In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year."[39m
  }
]

## 常见问题

不同的方法在成本、速度和准确性方面各有优缺点。

请注意以下问题：

* 内容分块意味着如果信息分布在多个块中，LLM可能无法提取信息。
* 过大的块重叠可能导致相同信息被提取两次，因此要做好去重准备！
* LLM可能会生成虚假数据。如果在大段文本中查找单一事实并使用暴力方法，最终可能会得到更多伪造的数据。

## 下一步

现在你已经了解了如何通过少量示例提升信息提取质量。

接下来，请查看本节中其他指南，例如[一些通过示例提升信息提取质量的技巧](/docs/how_to/extraction_examples)。