# 通过上下文检索增强 RAG

我们将使用 LLM 为每个分块和文档生成一个上下文句子，以提高其检索准确性并在混合搜索中使用。

* [加载复杂文档数据集](#loading-a-complex-dataset-of-documents)
* [将文档切分为分块](#split-the-documents-into-chunks)
* [生成上下文句子](#generate-the-context-sentence)
* [用上下文丰富分块嵌入向量](#enrich-the-chunk-embedding-vectors-with-the-context)

### 可视化改进

我们将使用 [rich 库](https://github.com/Textualize/rich) 来使输出更具可读性，并抑制警告信息。

In [1]:
from rich.console import Console
from rich_theme_manager import Theme, ThemeManager
import pathlib

theme_dir = pathlib.Path("themes")
theme_manager = ThemeManager(theme_dir=theme_dir)
dark = theme_manager.get("dark")

# Create a console with the dark theme
console = Console(theme=dark)

In [2]:
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

## 加载复杂文档数据集

我们将加载一个来自 Arxiv 的复杂科学文档数据集。在这种文档上应用简单的分块方法在 RAG 应用中会导致较差的结果。

In [3]:
from datasets import load_dataset

dataset = load_dataset("jamescalam/ai-arxiv2", split="train")
console.print(dataset)

## 将文档切分为分块

我们将使用之前笔记本中使用的统计分块器。

In [4]:
from dotenv import load_dotenv

load_dotenv()

True

In [5]:
import os
from semantic_router.encoders import OpenAIEncoder

encoder = OpenAIEncoder(name="text-embedding-3-small")

In [6]:
from semantic_chunkers import StatisticalChunker
import logging

logging.disable(logging.CRITICAL)

chunker = StatisticalChunker(
    encoder=encoder,
    min_split_tokens=100,
    max_split_tokens=500,
)

In [7]:
chunks_0 = chunker(docs=[dataset["content"][0]])


In [8]:
from rich.text import Text
from rich.panel import Panel

chunk_0_0 = ' '.join(chunks_0[0][0].splits)

content = Text(chunk_0_0)
console.print(Panel(content, title=f"Chunk 0", expand=False, border_style="bold"))

## 生成上下文句子

我们将使用 Anthropic Claude 来生成上下文。它是最好的摘要生成 LLM 之一，并且引入了 [Prompt Caching](https://www.anthropic.com/news/prompt-caching)，这对于为同一文档的多个分块生成上下文非常有用。

In [9]:
from dotenv import load_dotenv

load_dotenv()

True

In [10]:
import anthropic

client = anthropic.Anthropic()


In [11]:
DOCUMENT_CONTEXT_PROMPT = """
<document>
{doc_content}
</document>
"""

CHUNK_CONTEXT_PROMPT = """
Here is the chunk we want to situate within the whole document
<chunk>
{chunk_content}
</chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else.
"""

def situate_context(doc: str, chunk: str) -> str:
    response = client.beta.prompt_caching.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1024,
        temperature=0.0,
        messages=[
            {
                "role": "user", 
                "content": [
                    {
                        "type": "text",
                        "text": DOCUMENT_CONTEXT_PROMPT.format(doc_content=doc),
                        "cache_control": {"type": "ephemeral"} #we will make use of prompt caching for the full documents
                    },
                    {
                        "type": "text",
                        "text": CHUNK_CONTEXT_PROMPT.format(chunk_content=chunk),
                    }
                ]
            }
        ],
        extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
    )
    return response

In [12]:
chunk_context = situate_context(dataset["content"][0], chunk_0_0)

In [13]:
console.print(chunk_context)

In [14]:
chunk_0_5 = ' '.join(chunks_0[0][5].splits)

In [15]:
second_chunk_context = situate_context(dataset["content"][0], chunk_0_5)

In [16]:
console.print(second_chunk_context)

## 用上下文丰富分块嵌入向量

### 将生成的上下文与分块文本连接

我们将遍历所有分块。根据分块的数量，这可能需要一些时间。

In [17]:
arxiv_id = dataset[0]["id"]
refs = list(dataset[0]["references"].values())
doc_text = dataset[0]["content"]
title = dataset[0]["title"]

from tqdm import tqdm

corpus_json = []
for i, chunk in tqdm(enumerate(chunks_0[0]), total=len(chunks_0[0]), desc="Processing chunks"):
    chunk_text = ' '.join(chunk.splits)
    contextualized_text = situate_context(doc_text, chunk_text).content[0].text
    corpus_json.append({
        "id": i,
        "text": f"{chunk_text}\n\n{contextualized_text}",
        "metadata" : {
            "title": title,
            "arxiv_id": arxiv_id,
            "references": refs
        }
    })

Processing chunks: 100%|██████████| 46/46 [26:50<00:00, 35.00s/it] 


In [18]:
console.print(corpus_json[:2])

### 将 corpus_json 保存到文件中

我们希望在下一个笔记本中使用它。

In [19]:
import json

with open('data/corpus.json', 'w') as f:
    json.dump(corpus_json, f)

