## 在简单 RAG 中评估块大小

选择合适的块大小对于提高检索增强生成（RAG）流程中的检索准确性至关重要。目标是在检索性能和响应质量之间取得平衡。

本节通过以下步骤评估不同的块大小：

1. 从 PDF 中提取文本。
2. 将文本分割成不同大小的块。
3. 为每个块创建嵌入。
4. 检索与查询相关的块。
5. 使用检索到的块生成响应。
6. 评估响应的忠实度和相关性。
7. 比较不同块大小的结果。

## 环境配置

In [1]:
# fitz库需要从pymudf那里安装
%pip install --quiet --force-reinstall pymupdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## 设置 OpenAI API 客户端
我们初始化 OpenAI 客户端以生成嵌入和响应。

In [3]:
# colab环境
from google.colab import userdata
# 使用火山引擎
api_key = userdata.get("ARK_API_KEY")
base_url = userdata.get("ARK_BASE_URL")

In [4]:
model_name = "doubao-lite-128k-240828"
embedding_model = "doubao-embedding-text-240715"

In [5]:
client = OpenAI(
    base_url=base_url,
    api_key=api_key
)

## 从 PDF 文件中提取文本
现在，我们加载 PDF 文件，提取文本

In [6]:
def extract_text_from_pdf(pdf_path):
    """
    从 PDF 文件中抽取文本.

    Args:
    pdf_path (str): PDF 文件的路径

    Returns:
    str: 抽取的文本.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page in mypdf:
        # Extract text from the current page and add spacing
        all_text += page.get_text("text") + " "

    # Return the extracted text, stripped of leading/trailing whitespace
    return all_text.strip()

# PDF 文件路径
pdf_path = "./data/AI_Information.pdf"

extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


## 对提取的文本进行分块
为了提高检索效果，我们将提取的文本分割成不同大小的重叠块。

In [7]:
def chunk_text(text: str, n: int, overlap: int):
    """
    将给定的文本分割成具有重叠的 n 个字符的段。

    Args:
    text (str): 要被分块的文本
    n (int): 每个分块的字符数量.
    overlap (int): 分块之间重叠的字符数量.

    Returns:
    List[str]: 文本分块列表.
    """
    chunks = []  # 初始化一个空列表来存储分块的内容。

    # 以 (n - overlap) 为步长遍历文本。
    for i in range(0, len(text), n - overlap):
        # 将从索引 i 到 i + n 的文本块追加到 chunks 列表中
        chunks.append(text[i:i + n])

    return chunks  # 返回文本分块列表

# 定义要评估的不同块大小
chunk_sizes = [128, 256, 512]

# 创建一个字典，用于存储每个块大小的文本块
text_chunks_dict = {size: chunk_text(extracted_text, size, size // 5) for size in chunk_sizes}

# Print the number of chunks created for each chunk size
for size, chunks in text_chunks_dict.items():
    print(f"Chunk Size: {size}, Number of Chunks: {len(chunks)}")

Chunk Size: 128, Number of Chunks: 326
Chunk Size: 256, Number of Chunks: 164
Chunk Size: 512, Number of Chunks: 82


## 为文本块创建嵌入
嵌入将文本转换为数值向量，从而可以高效地进行相似性搜索。

In [None]:
from tqdm import tqdm

def create_embeddings(texts, model="doubao-embedding-text-240715"):
    """
    生成文本列表对应的嵌入向量列表.

    Args:
    texts (List[str]): 输入文本列表.
    model (str): 嵌入模型，默认为 doubao-embedding-text-240715

    Returns:
    List[np.ndarray]:
    """
    response = client.embeddings.create(model=model, input=texts)
    return [np.array(embedding.embedding) for embedding in response.data]

def batch_read(lst, batch_size = 10):
    for i in range(0, len(lst), batch_size):
      yield lst[i : i + batch_size]


# 为每个块大小生成嵌入
# 遍历 text_chunks_dict 中的每个块大小及其对应的块
chunk_embeddings_dict = {}

for size, chunks in tqdm(text_chunks_dict.items(), desc="Generating Embeddings"):
  chunk_embeddings_dict[size] = []
  for batch in tqdm(batch_read(chunks, 10), desc="batch"):
    chunk_embeddings_dict[size].extend(create_embeddings(batch))

## 执行语义搜索
我们使用余弦相似度来找到与用户查询最相关的文本块。

In [15]:
def cosine_similarity(vec1, vec2):
    """
    计算两个向量之间的余弦相似度

    Args:
    vec1 (np.ndarray): 第一个向量.
    vec2 (np.ndarray): 第二个向量.

    Returns:
    float: 两个向量之间的余弦相似度.
    """
    # 计算两个向量的点积，并除以它们范数的乘积
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [16]:
def retrieve_relevant_chunks(query, text_chunks, chunk_embeddings, k=5):
    """
    检索最相关的前 k 个文本块

    Args:
    query (str): 语义搜索的查询
    text_chunks (List[str]): 要进行检索的文本块列表
    embeddings (List[dict]): 文本块对应的嵌入向量列表.
    k (int): 要返回的最相关的文本块的数量，默认为 5.

    Returns:
    List[str]: 基于查询的最相关的 k 个文本块的列表.
    """
    query_embedding = create_embeddings([query])[0]

    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]

    # 最相关的前 k 个文本块
    # 从小到大排序，所以是-k；
    # 反转序列成从大到小排序
    top_indices = np.argsort(similarities)[-k:][::-1]

    return [text_chunks[i] for i in top_indices]

In [17]:
# 加载验证数据集
with open('data/val.json') as f:
    data = json.load(f)

query = data[0]['question']

retrieved_chunks_dict = {size: retrieve_relevant_chunks(query, text_chunks_dict[size], chunk_embeddings_dict[size]) for size in chunk_sizes}

# Print retrieved chunks for chunk size 256
print(retrieved_chunks_dict[256])

['Transparency and Explainability \nTransparency and explainability are key to building trust in AI. Making AI systems understandable \nand providing insights into their decision-making processes helps users assess their reliability \nand fairness. \nRobustness ', 'd explainability is \ncrucial for building trust and accountability. \n \n  \nPrivacy and Security \nAI systems often rely on large amounts of data, raising concerns about privacy and data security. \nProtecting sensitive information and ensuring responsible dat', 'he world are developing AI strategies and policy frameworks to guide the \ndevelopment and deployment of AI. These frameworks address ethical considerations, promote \ninnovation, and ensure responsible AI practices. \nRegulation of AI \nThe regulation of AI i', ', applications, ethical implications, and future directions of \nAI, we can better navigate the opportunities and challenges presented by this transformative \ntechnology. Continued research, responsible 

## 基于检索到的块生成响应
让我们根据块大小为 `256` 的检索到的文本生成一个响应。

In [18]:
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(query, system_prompt, retrieved_chunks, model="doubao-lite-128k-240828"):
    """
    根据 system prompt 和 user message 从 AI 模型生成响应。

    Args:
    system_prompt (str): system prompt
    user_message (str): 用户消息或查询
    model (str): LLM, 默认为qwen2.5-7b-instruct-1m.


    Returns:
    str: AI 响应消息.
    """
    # Combine retrieved chunks into a single context string
    context = "\n".join([f"Context {i+1}:\n{chunk}" for i, chunk in enumerate(retrieved_chunks)])

    # Create the user prompt by combining the context and the query
    user_prompt = f"{context}\n\nQuestion: {query}"

    # Generate the AI response using the specified model
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )

    # Return the content of the AI response
    return response.choices[0].message.content

# Generate AI responses for each chunk size
ai_responses_dict = {size: generate_response(query, system_prompt, retrieved_chunks_dict[size]) for size in chunk_sizes}

# Print the response for chunk size 256
print(ai_responses_dict[256])

Explainable AI (XAI) techniques aim to make AI decisions more understandable. It is considered important because transparency and explainability are key to building trust in AI. Making AI systems understandable and providing insights into their decision-making processes helps users assess their reliability and fairness. It is crucial for building trust and accountability.


## 评估 AI 响应
我们使用强大的大语言模型根据忠实度和相关性对响应进行评分

In [20]:
# 定义评估评分系统常量
SCORE_FULL = 1.0     # 完全匹配或完全令人满意
SCORE_PARTIAL = 0.5  # 部分匹配或有些令人满意
SCORE_NONE = 0.0     # 不匹配或令人不满意

In [21]:
FAITHFULNESS_PROMPT_TEMPLATE = """
Evaluate the faithfulness of the AI response compared to the true answer.
User Query: {question}
AI Response: {response}
True Answer: {true_answer}

Faithfulness measures how well the AI response aligns with facts in the true answer, without hallucinations.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely faithful, no contradictions with true answer
    * {partial} = Partially faithful, minor contradictions
    * {none} = Not faithful, major contradictions or hallucinations
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

In [22]:
RELEVANCY_PROMPT_TEMPLATE = """
Evaluate the relevancy of the AI response to the user query.
User Query: {question}
AI Response: {response}

Relevancy measures how well the response addresses the user's question.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely relevant, directly addresses the query
    * {partial} = Partially relevant, addresses some aspects
    * {none} = Not relevant, fails to address the query
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

In [24]:
def evaluate_response(question, response, true_answer):
        """
        根据忠实度和相关性评估 AI 生成响应的质量。

        Args:
        question (str): 用户初始问题.
        response (str): 要被评估的 AI 生成的响应.
        true_answer (str): 用作真实答案的正确答案h.

        Returns:
        Tuple[float, float]: A tuple containing (faithfulness_score, relevancy_score).
                                                Each score is one of: 1.0 (full), 0.5 (partial), or 0.0 (none).
        """
        # 格式化，填充信息
        faithfulness_prompt = FAITHFULNESS_PROMPT_TEMPLATE.format(
                question=question,
                response=response,
                true_answer=true_answer,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )

        relevancy_prompt = RELEVANCY_PROMPT_TEMPLATE.format(
                question=question,
                response=response,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )

        # Request faithfulness evaluation from the model
        faithfulness_response = client.chat.completions.create(
               model="deepseek-v3-250324",
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": faithfulness_prompt}
                ]
        )

        # Request relevancy evaluation from the model
        relevancy_response = client.chat.completions.create(
                model="deepseek-v3-250324",
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": relevancy_prompt}
                ]
        )

        # Extract scores and handle potential parsing errors
        try:
                faithfulness_score = float(faithfulness_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse faithfulness score, defaulting to 0")
                faithfulness_score = 0.0

        try:
                relevancy_score = float(relevancy_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse relevancy score, defaulting to 0")
                relevancy_score = 0.0

        return faithfulness_score, relevancy_score

true_answer = data[0]['ideal_answer']

faithfulness, relevancy = evaluate_response(query, ai_responses_dict[256], true_answer)
faithfulness2, relevancy2 = evaluate_response(query, ai_responses_dict[128], true_answer)
faithfulness3, relevancy3 = evaluate_response(query, ai_responses_dict[512], true_answer)


# print the evaluation scores
print(f"Faithfulness Score (Chunk Size 256): {faithfulness}")
print(f"Relevancy Score (Chunk Size 256): {relevancy}")

print(f"\n")

print(f"Faithfulness Score (Chunk Size 128): {faithfulness2}")
print(f"Relevancy Score (Chunk Size 128): {relevancy2}")

print(f"\n")

print(f"Faithfulness Score (Chunk Size 512): {faithfulness3}")
print(f"Relevancy Score (Chunk Size 512): {relevancy3}")

Faithfulness Score (Chunk Size 256): 1.0
Relevancy Score (Chunk Size 256): 1.0


Faithfulness Score (Chunk Size 128): 1.0
Relevancy Score (Chunk Size 128): 1.0


Faithfulness Score (Chunk Size 512): 1.0
Relevancy Score (Chunk Size 512): 1.0
