# 简单 RAG 介绍

检索增强生成（RAG）是一种混合方法，结合了信息检索和生成模型。它通过整合外部知识来提升语言模型的性能，从而提高准确性和事实正确性。

在简单 RAG 设置中，我们遵循以下步骤：

1. **数据摄取**：加载和预处理文本数据。
2. **分块**：将数据拆分成更小的块，以提高检索性能。
3. **嵌入创建**：使用嵌入模型将文本块转换为数值表示。
4. **语义搜索**：根据用户查询检索相关块。
5. **响应生成**：使用语言模型根据检索到的文本生成响应。

本笔记本实现了简单 RAG 方法，评估了模型的响应，并探索了各种改进方法。

## 环境设置
安装并导入必要的库

In [1]:
# fitz库需要从pymudf那里安装
%pip install --quiet --force-reinstall pymupdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## 从 PDF 文件中提取文本

要实现 RAG，我们首先需要一个文本数据源。在本例中，我们使用 PyMuPDF 库（也被称为 fitz ）从 PDF 文件中提取文本。

In [3]:
def extract_text_from_pdf(pdf_path: str):
    """
    从 PDF 文件中提取文本并打印前 `num_chars` 个字符。

    Args:
    pdf_path (str): PDF 文件的路径。

    Returns:
    str: 从 PDF 中提取的文本。
    """
    # 打开 PDF 文件
    mypdf = fitz.open(pdf_path)
    all_text = ""  # 初始化一个空字符串来存储提取的文本

    # 遍历 PDF 中的每一页
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # 获取页面
        text = page.get_text("text")  # 提取页面的文本
        all_text += text  # 将提取的文本追加到 `all_text` 字符串中

    return all_text  # 返回提取的文本

## 对提取的文本进行分块
在提取文本后，我们将文本分割为更小的、有重叠的部分，以提高检索的准确性。

In [4]:
def chunk_text(text: str, n: int, overlap: int):
    """
    将给定的文本分割成具有重叠的 n 个字符的段。

    Args:
    text (str): 要被分块的文本
    n (int): 每个分块的字符数量.
    overlap (int): 分块之间重叠的字符数量.

    Returns:
    List[str]: 文本分块列表.
    """
    chunks = []  # 初始化一个空列表来存储分块的内容。

    # 以 (n - overlap) 为步长遍历文本。
    for i in range(0, len(text), n - overlap):
        # 将从索引 i 到 i + n 的文本块追加到 chunks 列表中
        chunks.append(text[i:i + n])

    return chunks  # 返回文本分块列表

## 设置 OpenAI API 客户端
我们初始化 OpenAI 客户端以生成嵌入和响应。

In [8]:
# colab环境
from google.colab import userdata
# 使用阿里云百炼平台
api_key = userdata.get("DASHSCOPE_API_KEY")
base_url = userdata.get("DASHSCOPE_BASE_URL")

In [13]:
client = OpenAI(
    base_url=base_url,
    api_key=api_key
)

## 从 PDF 文件中提取和分块文本
现在，我们加载 PDF 文件，提取文本并将其分割成块。

In [7]:
# pdf文件地址（colab要上传）
pdf_path = "./data/AI_Information.pdf"

# 提取文本
extracted_text = extract_text_from_pdf(pdf_path)

# 将提取的文本分割成具有 200 个字符重叠的 1000 个字符的分块。
text_chunks = chunk_text(extracted_text, 1000, 200)

# 打印分块数量
print("Number of text chunks:", len(text_chunks))

print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 42

First text chunk:
Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving 
and 

## 为文本块创建嵌入
嵌入将文本转换为数值向量，从而可以高效地进行相似性搜索。

In [31]:
def create_embeddings(text, model="text-embedding-v2"):
    """
    Creates embeddings for the given text using the specified OpenAI model.
    使用向量模型为给定的文本创建嵌入向量

    Args:
    text (str): 需要创建嵌入向量的文本
    model (str): 向量模型，默认是dashscope的`text-embedding-v2`

    Returns:
    dict: 来自 OpenAI API 的响应，包含嵌入。
    """
    response = client.embeddings.create(
        model=model,
        input=text,
    )

    return response

def batch_read(lst, batch_size=10):
  for i in range(0, len(lst), batch_size):
        yield lst[i:i + batch_size]

response = None

for batch in batch_read(text_chunks):
  if not response:
    response = create_embeddings(batch)
  else:
    response.data.extend(create_embeddings(batch).data)

## 执行语义搜索
我们实现余弦相似度，以找到与用户查询最相关的文本块。

In [26]:
def cosine_similarity(vec1, vec2):
    """
    计算两个向量之间的余弦相似度

    Args:
    vec1 (np.ndarray): 第一个向量.
    vec2 (np.ndarray): 第二个向量.

    Returns:
    float: 两个向量之间的余弦相似度.
    """
    # 计算两个向量的点积，并除以它们范数的乘积
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [27]:
def semantic_search(query, text_chunks, embeddings, k=5):
    """
    使用给定的查询和嵌入对文本块执行语义搜索。

    Args:
    query (str): 语义搜索的查询
    text_chunks (List[str]): 要进行检索的文本块列表
    embeddings (List[dict]): 文本块对应的嵌入向量列表.
    k (int): 要返回的最相关的文本块的数量，默认为 5.

    Returns:
    List[str]: 基于查询的最相关的 k 个文本块的列表.
    """
    # 创建查询的嵌入向量
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []  # 初始化相似度列表

    # 计算查询嵌入与每个文本块嵌入之间的相似度分数
    for i, chunk_embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        similarity_scores.append((i, similarity_score))

    # 降序排序
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # 获取最相似的 k 个文本块的索引
    top_indices = [index for index, _ in similarity_scores[:k]]

    return [text_chunks[index] for index in top_indices]


## 在提取的文本块上运行查询

In [32]:
# 导入验证数据
with open('./data/val.json') as f:
    data = json.load(f)

# 从验证数据中提取第一个查询
query = data[0]['question']

# 执行语义搜索，以找到与查询最相关的前 2 个文本块
top_chunks = semantic_search(query, text_chunks, response.data, k=2)

print("Query:", query)

# 打印文本块
for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")

Query: What is 'Explainable AI' and why is it considered important?
Context 1:
systems. Explainable AI (XAI) 
techniques aim to make AI decisions more understandable, enabling users to assess their 
fairness and accuracy. 
Privacy and Data Protection 
AI systems often rely on large amounts of data, raising concerns about privacy and data 
protection. Ensuring responsible data handling, implementing privacy-preserving techniques, 
and complying with data protection regulations are crucial. 
Accountability and Responsibility 
Establishing accountability and responsibility for AI systems is essential for addressing potential 
harms and ensuring ethical behavior. This includes defining roles and responsibilities for 
developers, deployers, and users of AI systems. 
Chapter 20: Building Trust in AI 
Transparency and Explainability 
Transparency and explainability are key to building trust in AI. Making AI systems understandable 
and providing insights into their decision-making processes he

## 基于检索到的块生成响应

In [35]:
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="qwen2.5-7b-instruct-1m"):
    """
    根据 system prompt 和 user message 从 AI 模型生成响应。

    Args:
    system_prompt (str): system prompt
    user_message (str): 用户消息或查询
    model (str): LLM, 默认为qwen2.5-7b-instruct-1m.

    Returns:
    dict: AI 模型响应
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# 根据最相关的块创建用户提示
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

ai_response = generate_response(system_prompt, user_prompt, model="qwen-turbo")

## 评估 AI 响应
我们将 AI 响应与预期答案进行比较，并分配一个分数。

In [36]:
# 评估用的 system prompt
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# 通过组合 query 、AI respone、真实响应和 system prompt 来创建评估提示
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

print(evaluation_response.choices[0].message.content)

Score: 1

The AI response accurately captures the essence of Explainable AI (XAI), its definition, and its importance as stated in the true response. The explanation about building trust, accountability, ensuring fairness, and aligning with ethical standards is spot-on and closely matches the true response. There are no inaccuracies or significant omissions in the AI's response.
