## RAG 中的上下文增强检索
检索增强生成（RAG）通过从外部来源检索相关知识来增强 AI 响应。传统检索方法返回孤立的文本块，可能导致答案不完整。

为了解决这一问题，我们引入了上下文增强检索，它确保检索到的信息包括相邻的块，以提高连贯性。

本笔记本中的步骤：
- 数据摄取：从 PDF 中提取文本。
- 带上下文重叠的分块：将文本分割成带有重叠的块，以保留上下文。
- 嵌入创建：将文本块转换为数值表示。
- 上下文感知检索：检索相关块及其相邻块，以提高完整性。
- 响应生成：使用语言模型根据检索到的上下文生成响应。
- 评估：评估模型的响应准确性。

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
# fitz库需要从pymudf那里安装
%pip install --quiet --force-reinstall pymupdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m57.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## 从 PDF 文件中提取文本

In [3]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # Get the page
        text = page.get_text("text")  # Extract text from the page
        all_text += text  # Append the extracted text to the all_text string

    return all_text  # Return the extracted text

## 对提取的文本进行分块
在提取文本后，我们将文本划分为更小的、有重叠的部分，以提高检索的准确性。

In [4]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks

    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## OpenAI 客户端

In [6]:
# colab环境
from google.colab import userdata
# 使用火山引擎
api_key = userdata.get("ARK_API_KEY")
base_url = userdata.get("ARK_BASE_URL")

In [7]:
# Initialize the OpenAI client with the base URL and API key
client = OpenAI(
    base_url=base_url,
    api_key=api_key
)

## 从 PDF 文件中提取和分块文本

In [8]:
# Define the path to the PDF file
pdf_path = "./data/AI_Information.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 42

First text chunk:
Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving 
and 

## 创建文本嵌入

In [9]:
from tqdm import tqdm

def create_embeddings(texts, model="doubao-embedding-text-240715"):
    """
    生成文本列表对应的嵌入向量列表.

    Args:
    texts (List[str]): 输入文本列表.
    model (str): 嵌入模型，默认为 doubao-embedding-text-240715

    Returns:
    List[np.ndarray]:
    """
    response = client.embeddings.create(model=model, input=texts)
    return [np.array(embedding.embedding) for embedding in response.data]

def batch_read(lst, batch_size = 10):
    for i in range(0, len(lst), batch_size):
      yield lst[i : i + batch_size]

embeddings = []
for batch in tqdm(batch_read(text_chunks), desc="batch"):
  embeddings.extend(create_embeddings(batch))

batch: 5it [00:07,  1.47s/it]


## 实现上下文感知语义搜索
我们修改检索方式，包括相邻块以提供更好的上下文。

In [10]:
def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): The first vector.
    vec2 (np.ndarray): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    # Compute the dot product of the two vectors and divide by the product of their norms
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [11]:
def context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1):
    """
    检索最相关的块及其相邻块。

    Args:
    query (str): 查询.
    text_chunks (List[str]): 文本块列表.
    embeddings (List[np.ndarray]): 文本块嵌入列表.
    k (int): 需要检索的相关文本块数量.
    context_size (int): 要包含的相邻块数量，前后各取context_size的文本块

    Returns:
    List[str]: 带有上下文信息的相关文本块。
    """
    query_embedding = create_embeddings([query])[0]
    similarity_scores = []

    # 计算查询与每个文本块嵌入之间的相似度分数
    for i, chunk_embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(query_embedding, chunk_embedding)
        similarity_scores.append((i, similarity_score))

    # 按相似度分数降序排序（相似度最高的在前）
    similarity_scores.sort(key=lambda x: x[1], reverse=True)

    # 获取最相关的文本块的索引
    top_index = similarity_scores[0][0]

    # 定义上下文包含的范围
    # 确保不会超出文本块的边界
    start = max(0, top_index - context_size)
    end = min(len(text_chunks), top_index + context_size + 1)

    return [text_chunks[i] for i in range(start, end)]

## 使用上下文检索运行查询
我们现在测试上下文增强检索。

In [12]:
# Load the validation dataset from a JSON file
with open('./data/val.json') as f:
    data = json.load(f)

# Extract the first question from the dataset to use as our query
query = data[0]['question']

# 检索最相关的块及其相邻块以提供上下文
top_chunks = context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1)

# Print the query for reference
print("Query:", query)
# Print each retrieved chunk with a heading and separator
for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")

Query: What is 'Explainable AI' and why is it considered important?
Context 1:
inability 
Many AI systems, particularly deep learning models, are "black boxes," making it difficult to 
understand how they arrive at their decisions. Enhancing transparency and explainability is 
crucial for building trust and accountability. 
 
 
Privacy and Security 
AI systems often rely on large amounts of data, raising concerns about privacy and data security. 
Protecting sensitive information and ensuring responsible data handling are essential. 
Job Displacement 
The automation capabilities of AI have raised concerns about job displacement, particularly in 
industries with repetitive or routine tasks. Addressing the potential economic and social impacts 
of AI-driven automation is a key challenge. 
Autonomy and Control 
As AI systems become more autonomous, questions arise about control, accountability, and the 
potential for unintended consequences. Establishing clear guidelines and ethical framew

## 使用检索到的上下文生成响应
我们现在使用大语言模型生成响应。

In [13]:
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="doubao-lite-128k-240828"):
    """
    Generates a response from the AI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "doubao-lite-128k-240828".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)

## 评估 AI 响应
我们将 AI 响应与预期答案进行比较，并分配一个分数。

In [14]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt, "deepseek-v3-250324")

# Print the evaluation response
print(evaluation_response.choices[0].message.content)

The AI assistant's response is very close to the true response, covering all the key points about Explainable AI (XAI) and its importance. Both responses emphasize transparency, understandability, trust, and accountability. The AI assistant's response is slightly more detailed, mentioning "black boxes," which adds value but does not detract from the alignment with the true response.

Score: 1
