## 语义分块简介
文本分块是检索增强生成（RAG）中的一个关键步骤，其中将大量文本划分为有意义的段落，以提高检索的准确性。与固定长度分块不同，语义分块根据句子之间的内容相似性来分割文本。

### 分割点方法：
- **百分位数**：找到所有相似性差异的第 X 百分位数，并在下降值大于此值的位置分割块。
- **标准差**：在相似性下降超过平均值以下 X 个标准差的位置进行分割。
- **四分位距（IQR）**：使用四分位距（Q3 - Q1）来确定分割点。

本笔记本实现了三种语义分块方法，并在示例文本上评估**百分位数**方法。

## 环境配置

In [1]:
# fitz库需要从pymudf那里安装
%pip install --quiet --force-reinstall pymupdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m68.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## 对提取的文本进行分块
在提取文本后，我们将文本分割为更小的、有重叠的部分，以提高检索的准确性。

In [3]:
def extract_text_from_pdf(pdf_path):
    """
    从 PDF 文件中抽取文本.

    Args:
    pdf_path (str): PDF 文件的路径

    Returns:
    str: 抽取的文本.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page in mypdf:
        # Extract text from the current page and add spacing
        all_text += page.get_text("text") + " "

    # Return the extracted text, stripped of leading/trailing whitespace
    return all_text.strip()

# PDF 文件路径
pdf_path = "./data/AI_Information.pdf"

extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


## 设置 OpenAI API 客户端

In [4]:
# colab环境
from google.colab import userdata
# 使用阿里云百炼平台
api_key = userdata.get("DASHSCOPE_API_KEY")
base_url = userdata.get("DASHSCOPE_BASE_URL")

In [5]:
client = OpenAI(
    base_url=base_url,
    api_key=api_key
)

## 创建句子级嵌入
我们将文本拆分为句子并生成嵌入。

In [6]:
def get_embedding(input: str, model: str="text-embedding-v2"):
    """
    使用向量模型为给定的文本创建嵌入向量

    Args:
    input (str): 输入文本.
    model (str): 向量模型名称，默认为dashscope的text-embedding-v2.

    Returns:
    np.ndarray: The embedding vector.
    """
    response = client.embeddings.create(model=model, input=text)
    return np.array(response.data[0].embedding)

# 将文本拆分为句子（基础拆分）
sentences = extracted_text.split(". ")

# Generate embeddings for each sentence
embeddings = [get_embedding(sentence) for sentence in sentences]

print(f"Generated {len(embeddings)} sentence embeddings.")

Generated 257 sentence embeddings.


## 计算相似性差异
我们计算相邻句子之间的余弦相似度。

In [7]:
def cosine_similarity(vec1, vec2):
    """
    计算两个向量之间的余弦相似度

    Args:
    vec1 (np.ndarray): 第一个向量.
    vec2 (np.ndarray): 第二个向量.

    Returns:
    float: 两个向量之间的余弦相似度.
    """
    # 计算两个向量的点积，并除以它们范数的乘积
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarities = [cosine_similarity(embeddings[i], embeddings[i + 1]) for i in range(len(embeddings) - 1)]

## 实现语义分块
我们实现了三种不同的方法来寻找分割点：
- **百分位数**：找到所有相似性差异的第 X 百分位数，并在下降值大于此值的位置分割块。
- **标准差**：在相似性下降超过平均值以下 X 个标准差的位置进行分割。
- **四分位距（IQR）**：使用四分位距（Q3 - Q1）来确定分割点。

In [8]:
def compute_breakpoints(similarities, method="percentile", threshold=90):
    """
    根据相似性下降计算分块的分割点。

    Args:
    similarities (List[float]): 句子两两之间的相似度列表
    method (str): 'percentile', 'standard_deviation', or 'interquartile'.
    threshold (float): 阈值 (percentile for 'percentile', std devs for 'standard_deviation').

    Returns:
    List[int]: 分块应该发生的索引位置。
    """
    # 根据 method 决定阈值
    if method == "percentile":
        # 计算相似度分数的第 X 百分位数
        threshold_value = np.percentile(similarities, threshold)
    elif method == "standard_deviation":
        # 计算相似度分数的均值和标准差
        mean = np.mean(similarities)
        std_dev = np.std(similarities)
        # 将阈值设置为均值减去 X 个标准差
        threshold_value = mean - (threshold * std_dev)
    elif method == "interquartile":
        # 计算第一四分位数和第三四分位数（Q1 和 Q3）
        q1, q3 = np.percentile(similarities, [25, 75])
        # 使用 IQR 规则设置阈值为四分位距的下限
        threshold_value = q1 - 1.5 * (q3 - q1)
    else:
        # 不支持的方法
        raise ValueError("Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'.")

    # 识别相似度下降低于阈值的索引
    return [i for i, sim in enumerate(similarities) if sim < threshold_value]

breakpoints = compute_breakpoints(similarities, method="percentile", threshold=90)

## 将文本分割成语义块
我们根据计算出的分割点将文本分割成块。

In [9]:
def split_into_chunks(sentences, breakpoints):
    """
    分割句子为语义块。

    Args:
    sentences (List[str]): 句子列表
    breakpoints (List[int]): 要进行分割的索引.

    Returns:
    List[str]: 文本块列表.
    """
    chunks = []
    start = 0  # 开始索引

    # 迭代每个分割点
    for bp in breakpoints:
        # 将从起始位置到当前分割点的句子块追加到列表中
        chunks.append(". ".join(sentences[start:bp + 1]) + ".")
        start = bp + 1  # 将开始索引更新为分割点之后的下一个句子

    # 将剩余的句子作为最后一个块追加
    chunks.append(". ".join(sentences[start:]))
    return chunks

text_chunks = split_into_chunks(sentences, breakpoints)

# Print the number of chunks created
print(f"Number of semantic chunks: {len(text_chunks)}")

# Print the first chunk to verify the result
print("\nFirst text chunk:")
print(text_chunks[0])


Number of semantic chunks: 231

First text chunk:
Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings.


## 为语义块创建嵌入
我们为每个块创建嵌入，以便后续检索。

In [10]:
def create_embeddings(text_chunks: list[str]):
    """
    为每个文本块创建嵌入。

    Args:
    text_chunks (List[str]): 文本块列表.

    Returns:
    List[np.ndarray]: 嵌入向量列表.
    """
    return [get_embedding(chunk) for chunk in text_chunks]

chunk_embeddings = create_embeddings(text_chunks)

## 执行语义搜索
我们实现余弦相似度，以检索最相关的块。

In [11]:
def semantic_search(query, text_chunks, chunk_embeddings, k=5):
    """
    根据查询找到最相关的文本块

    Args:
    query (str): 查询.
    text_chunks (List[str]): 文本块列表.
    chunk_embeddings (List[np.ndarray]): 文本块嵌入向量列表.
    k (int): 要返回的最相关结果的数量.

    Returns:
    List[str]: Top-k 相关文本块.
    """
    # 创建查询的嵌入向量
    query_embedding = get_embedding(query)

    # 计算文本块与查询的余弦相似度
    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]

    # 获取最相似的 k 个块的索引
    # 递增排序，所以是-k
    # 最后要反转成递减
    top_indices = np.argsort(similarities)[-k:][::-1]

    return [text_chunks[i] for i in top_indices]

In [12]:
# 加载验证数据集
with open('./data/val.json') as f:
    data = json.load(f)

# 使用第一个问题
query = data[0]['question']

top_chunks = semantic_search(query, text_chunks, chunk_embeddings, k=2)

# Print the query
print(f"Query: {query}")

# Print the top 2 most relevant text chunks
for i, chunk in enumerate(top_chunks):
    print(f"Context {i+1}:\n{chunk}\n{'='*40}")

Query: What is 'Explainable AI' and why is it considered important?
Context 1:

Explainable AI (XAI) 
Explainable AI (XAI) aims to make AI systems more transparent and understandable. Research in 
XAI focuses on developing methods for explaining AI decisions, enhancing trust, and improving 
accountability.
Context 2:

Transparency and Explainability 
Transparency and explainability are essential for building trust in AI systems. Explainable AI (XAI) 
techniques aim to make AI decisions more understandable, enabling users to assess their 
fairness and accuracy.


## 基于检索到的块生成响应

In [13]:
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="qwen2.5-7b-instruct-1m"):
    """
    根据 system prompt 和 user message 从 AI 模型生成响应。

    Args:
    system_prompt (str): system prompt
    user_message (str): 用户消息或查询
    model (str): LLM, 默认为qwen2.5-7b-instruct-1m.

    Returns:
    dict: AI 模型响应
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# 根据最相关的块创建用户提示
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

ai_response = generate_response(system_prompt, user_prompt)

## 评估 AI 响应
我们将 AI 响应与预期答案进行比较，并分配一个分数。

In [14]:
# 评估用的 system prompt
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# 通过组合 query 、AI respone、真实响应和 system prompt 来创建评估提示
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

print(evaluation_response.choices[0].message.content)

Score: 1

Explanation: The AI response accurately captures the essence of Explainable AI (XAI), mentioning its goals of transparency and understanding, as well as its importance for trust, accountability, and fairness. The response closely aligns with the true response without any significant discrepancies, making it very close in meaning and scope. Therefore, a score of 1 is appropriate.
