# 使用语义内核实现完整的 RAG 系统

这本综合性笔记本演示了如何使用微软的语义内核构建一个检索增强生成（RAG）系统。我们将从展示没有特定数据访问权限的 AI 模型的局限性开始，然后构建一个包含分块策略、向量数据库和评估方法的完整 RAG 系统。

## 安装与设置

在本模块中，我们将利用语义内核。

语义内核是一个开源 SDK，可让您轻松构建 AI 应用程序。它支持 C#、Python 和 Java。它已为生产环境准备就绪，并被许多大型企业所采用。

它被设计为模块化的，因此您可以轻松更换模型而无需重写整个代码库。

像语义内核这样的 SDK 之所以流行，是因为 LLM 本身只能处理数据和生成响应。它无法访问您的数据库、调用您的 API、执行代码或与外部系统交互。因此，语义内核管理与 AI 服务（如 OpenAI）的连接，提供一个插件系统，您可以在其中编写 AI 可以调用的函数，并管理对话历史和上下文。

语义内核的核心是内核协调器。在 AI 应用程序中，您需要协调多个活动部分，如 AI 服务、数据库、API、日志系统等。内核是所有这些部分的中央协调器。它包含服务（如 AI 服务、登录服务、身份验证服务）和插件（AI 可以调用的自定义函数，如访问您的数据库）。考虑一个真实的企业场景，其中 AI 助手需要查询 CRM、检查库存水平、生成提案并记录所有交互以符合合规性要求。没有内核，每段代码都需要知道如何连接到所有这些服务。有了内核，一切只需配置一次。因为所有 AI 操作都通过内核进行，所以您有了一个用于日志记录和管理的单点控制。内核现在支持 MCP，它将内核包装在一个支持网络的服务器中，该服务器使用 MCP 语言，以便其他内核（或代理）可以发现此内核（或代理）。

语义内核组件：

1.  AI 服务连接器：当然，我们生活在一个使用多种 AI 模型的世界中，这些模型使用不同的 API 和身份验证方法。SK 中的连接器是一个抽象层，可防止供应商锁定（允许您在多个模型之间切换）。例如，AzureChatCompletion 是一个服务，GoogleAIChatCompletion 是另一个服务。内核负责调用这些连接器。
2.  向量存储连接器：这是 RAG 的核心。这是向量存储和内核之间的桥梁。
3.  函数和插件：插件允许您让这些 LLM 访问工具。函数是您向 LLM 公开的单个功能（例如一个 python 函数），而插件是一组相关函数（DatabasePlugin 可能包含 GetUser、UpdateUser、GetOrders 等函数）。
4.  提示模板：在我们的代码中使用多行字符串编写有效的提示可能会变得混乱且难以维护。对于像提示工程师这样的非开发人员来说，这也是不可能的。实际上，它要么是一个文本文件，要么是一个混合了静态指令和动态占位符的字符串。静态指令可以是：“你是一位专业的金融分析师，总结以下报告”，动态占位符可以是 user_input，或获取股票价格的函数。
5.  过滤器：在关键时刻（如调用函数之前或呈现提示之后）拦截内核执行的代码片段。您可以用它来过滤 PII 数据（确保用户的信用卡号永远不会发送到外部 LLM）。

**首先，安装所需的包：**

In [None]:
# Run this cell first to install all required packages
!pip install semantic-kernel openai numpy scikit-learn faiss-cpu

## 环境设置

In [None]:
from typing import List, Dict
import numpy as np
from dataclasses import dataclass
from dotenv import load_dotenv

load_dotenv()


# Semantic Kernel core imports (verified working June 2025)
import semantic_kernel as sk
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion, OpenAITextEmbedding
from semantic_kernel.contents import ChatHistory
from semantic_kernel.connectors.ai.open_ai import OpenAIChatPromptExecutionSettings


# For vector storage - we'll build our own simple system
import faiss
from sklearn.metrics.pairwise import cosine_similarity

print("Semantic Kernel environment setup complete!")

## 创建我们的文档模型和向量存储

我们将使用 FAISS 向量数据库。FAISS 是一个完全在您机器上运行的本地向量数据库库。在生产场景中，我们通常使用像 Azure Search 这样的云向量数据库。但概念是相同的，向量数据库允许我们存储向量。

代码说明：

1.  我们将创建一个类（从中创建对象，即类的实例）来表示文档块。将有详细的注释来描述这些块对象中将包含的内容。
2.  然后我们将创建另一个类，其中包含允许我们使用向量搜索（而不是精确的单词匹配）来搜索这些块的函数（方法）。
    -   初始化函数：一个初始化我们的向量存储的函数。我们将使用以下属性初始化我们的向量存储：
        -   嵌入维度：每个文档块在转换为向量时将有多少个数字。OpenAI 的文本嵌入模型（我们将使用）将任何文本转换为恰好 1536 个数字。所以“Hello world”变成 [0.123, -0.456, 0.789, ... 总共 1536 个数字]。每一段文本都得到恰好 1536 个数字，无论是一个词还是一个段落。
        -   索引：把索引想象成目录，而不是搜索数据库中的每一行来找到“John Smith”，索引会告诉你 John 在哪一行。在向量数据库中，索引是一种组织向量的数据结构，以便您可以快速找到相似的向量。没有索引，找到相似的向量将需要将您的查询与每个存储的向量进行比较（逐个）。有了索引，FAISS 会预先组织向量，以便可以快速跳转到最相似的向量。我们正在初始化该索引。
        -   文档：将实际的文档块（原始文本加元数据）存储在常规的 Python 列表中。
        -   ID 到索引：这是一个将文档 ID 映射到其在列表中位置的字典。因此，我们可以快速找到文档块 5，而无需搜索整个列表。
    -   添加文档函数：这需要一批文档块并将它们存储在我们的向量存储中。我们处理每个文档（已经通过我们的嵌入模型向量化），然后将它们添加到 FAISS 索引中以进行快速相似性搜索。我们将原始文本和元数据保留在我们的文档列表中，以便以后可以检索它。
    -   搜索函数：这需要一个用户的问题（已转换为 1536 个数字的向量）并找到最相似的文档块。
3.  我们现在有了我们自定义构建的向量存储。现在我们需要将一个 LLM 连接到它，这就是语义内核的用武之地。我们将使用语义内核来利用嵌入功能（将文本转换为向量）和聊天完成功能（LLM 推理引擎）。

In [None]:


@dataclass
# We will create a DocumentChunk class to represent each piece of a document
class DocumentChunk:

   # Required fields - every chunk must have these
   id: str                    # Unique name for this chunk, like "policy_doc_chunk_1"
   content: str              # The actual text content of this chunk 
   source_doc_id: str        # Which original document this came from
   title: str                # Human-readable title of the original document
   chunk_index: int          # Which piece is this? (0=first chunk, 1=second, etc.)
   
   # Optional fields - these have default values
   department: str = ""      # Which team owns this document (optional)
   doc_type: str = ""        # What kind of doc is this - policy, guide, etc. (optional)
   embedding: List[float] = None  # The vector representation (list of numbers) for this text

# This is a class with methods to search for documents. Instead of exact word matches, it finds documents with similar meanings. 
class SimpleVectorStore:
  
   
   # This is a function to initialize our vector store. We will use FAISS, where we can store and search through many document chunks quickly. 
   def __init__(self, embedding_dimension: int = 1536):
       # How many numbers are in each embedding vector? What this means is that each document chunk will be represented by a list of 1536 numbers capturing its meaning.
       # OpenAI's text-embedding-ada-002 model (if we use it) gives us 1536 numbers for each piece of text.
       self.embedding_dimension = embedding_dimension
       
       # Create a FAISS index to store our document embeddings. Index FlatIP means we will use inner product similarity (like cosine similarity) to find similar documents.
       self.index = faiss.IndexFlatIP(embedding_dimension)
       
       # Store the actual document chunks (the text and metadata) in a list called documents
       self.documents: List[DocumentChunk] = []
       
       # This dictionary maps document IDs to their position in the documents list, so we can quickly find a document by its ID. for example, if we have a document with ID "policy_doc_chunk_1", we can find it in our list of documents by looking up "policy_doc_chunk_1" in this dictionary.
       # This is like a quick lookup table - if we have a document with ID "policy_doc_chunk_1", we can find it in our list of documents by looking up "policy_doc_chunk_1" in this dictionary.
       self.id_to_index = {}
   
   #This function adds a batch of documents to our vector store.
   def add_documents(self, documents: List[DocumentChunk]):
      
       embeddings = []  # This is where we will store the vector representations (embeddings) of each document chunk
       
       # Process each document one by one
       for doc in documents:
           # Every document must have its embedding (vector representation) already calculated
           if doc.embedding is None:
               raise ValueError(f"Document {doc.id} missing embedding")
           
           # CRITICAL: Normalize the embedding vector
           # Why? So we can use cosine similarity (comparing angles, not lengths)
           # Think of vectors as arrows - we want to compare which direction they point
           embedding_array = np.array(doc.embedding)  # Convert list to numpy array for math
           normalized_embedding = embedding_array / np.linalg.norm(embedding_array)  # Make length = 1
           embeddings.append(normalized_embedding) # This is the normalized vector representation of the document chunk stored in our embeddings list
           
           # These are the metadata fields we will use to identify and retrieve the document later
           doc_index = len(self.documents)           # What position will this doc be at?
           self.documents.append(doc)                # Add document to our storage
           self.id_to_index[doc.id] = doc_index     # Remember: this ID is at this position
       
       # Add all the normalized vectors to FAISS for lightning-fast search
       embeddings_array = np.array(embeddings).astype('float32')  # FAISS requires float32 type
       self.index.add(embeddings_array)
       
       print(f"Added {len(documents)} documents to vector store")
   
   # This function searches for documents similar to a user's question. 
   def search(self, query_embedding: List[float], k: int = 3, score_threshold: float = 0.0):
       """
       Find documents most similar to a query
       
       How it works:
       1. Take the query's embedding (vector representation)
       2. Compare it to all stored document embeddings
       3. Return the k most similar ones above the threshold
       
       Args:
           query_embedding: The vector representation of user's question
           k: How many results to return (top 3, top 5, etc.)
           score_threshold: Only return docs with similarity above this score
       
       Returns:
           List of (document, similarity_score) pairs, sorted by similarity
       """
       # Edge case: if no documents stored, return empty
       if self.index.ntotal == 0:
           return []
       
       # Normalize the query embedding just like we did for stored documents
       # This ensures fair comparison - we're comparing directions, not magnitudes
       query_array = np.array(query_embedding)
       normalized_query = query_array / np.linalg.norm(query_array)
       
       # Ask FAISS to find the k most similar vectors
       # reshape(1, -1) because FAISS expects 2D array (rows=queries, cols=dimensions)
       # Even though we only have 1 query, we need to format it as [[1, 2, 3, ...]]
       scores, indices = self.index.search(normalized_query.reshape(1, -1).astype('float32'), k)
       
       # Convert FAISS results into our format
       results = []
       for score, idx in zip(scores[0], indices[0]):  # [0] because we only sent 1 query
           # FAISS returns -1 if it runs out of documents before reaching k
           # Also filter by minimum similarity score
           if idx >= 0 and score >= score_threshold:
               # Use the index to get the actual document from our storage
               document = self.documents[idx]
               results.append((document, float(score)))
       
       return results

# Set up the AI services we'll use
kernel = Kernel()  # Semantic Kernel is our AI orchestration framework

# Service 1: Chat completion (generates responses to questions)
chat_service = OpenAIChatCompletion(
   ai_model_id="gpt-3.5-turbo"  # Which OpenAI model to use for chat
)
kernel.add_service(chat_service)  # Register this service with the kernel. This allows us to use the chat service in our semantic kernel for generating responses to user queries.

# Service 2: Text embedding (converts text into vector representations)
embedding_service = OpenAITextEmbedding(
   ai_model_id="text-embedding-ada-002"  # OpenAI's embedding model
)
kernel.add_service(embedding_service)  # Register this service with the kernel

#Now our kernel has both chat and embedding services ready to use. So we can ask questions and get answers, as well as convert text into vectors for similarity search.
# We also set up a simple vector store using FAISS to store and search through document chunks based on their meanings.

print("Semantic Kernel initialized with OpenAI services")
print("Using simple vector store with FAISS for document storage")

---

# 第 1 部分：演示问题 - 无法访问私有数据

让我们首先展示当我们向一个 AI 模型询问它未曾训练过的信息时会发生什么。

下面的代码很简单，我们有一系列存储在数组中的“文档”。我们将对这些文档中的每一个提出问题（但我们不会实际实现一个合适的 RAG 系统），所以我们应该预料到模型不知道我们在说什么。

In [None]:
# Sample company data that the model wouldn't know about.
# This represents the private, "ground-truth" information.
company_documents = [
    {
        "id": "product_001",
        "title": "CloudSync Pro Enterprise Plan",
        "content": """CloudSync Pro Enterprise offers unlimited storage, advanced encryption, 
        real-time collaboration for up to 500 users, priority support, and custom integrations. 
        Pricing: $49/month per user with annual commitment. Features include: automatic backup, 
        version control, audit logs, SSO integration, and 99.9% uptime SLA.""",
        "metadata": {"department": "product", "type": "pricing"}
    },
    {
        "id": "policy_001", 
        "title": "Remote Work Policy 2024",
        "content": """Effective January 2024: All employees may work remotely up to 3 days per week. 
        Remote work requires approval from direct manager. Equipment stipend of $500 annually 
        for home office setup. Mandatory video calls for team meetings. Core hours: 10 AM - 3 PM 
        local time for collaboration.""",
        "metadata": {"department": "hr", "type": "policy"}
    },
    {
        "id": "process_001",
        "title": "Customer Refund Process",
        "content": """Step 1: Customer submits refund request through support portal. 
        Step 2: Support agent reviews within 24 hours. Step 3: For amounts under $100, 
        automatic approval. Step 4: For amounts over $100, requires manager approval. 
        Step 5: Refunds processed within 3-5 business days to original payment method. 
        Full refunds available within 30 days of purchase.""",
        "metadata": {"department": "support", "type": "process"}
    },
    {
        "id": "guide_001",
        "title": "New Employee Onboarding Checklist",
        "content": """Day 1: IT setup and system access. Day 2: Department orientation and mentor assignment. 
        Week 1: Complete mandatory training modules (security, compliance, company culture). 
        Week 2: Shadow team members and review project documentation. Month 1: Complete 
        probationary review and set 90-day goals.""",
        "metadata": {"department": "hr", "type": "guide"}
    }
]

# The questions we want to test against the model's base knowledge.
test_questions = [
    "What is the pricing for CloudSync Pro Enterprise?",
    "How many days per week can employees work remotely?",
    "What is the refund approval process for purchases over $100?",
    "What happens during the first week of employee onboarding?"
]


async def run_direct_to_model_test():
    """
    Tests questions directly against the base AI model to demonstrate
    its lack of knowledge about our private company data.
    """
    print("TESTING MODEL WITHOUT RAG - Questions about private company data:")
    print("=" * 70)

    chat_service = kernel.get_service(type=OpenAIChatCompletion)

    # ***FIX 1: Create a default settings object for OpenAI chat models.***
    execution_settings = OpenAIChatPromptExecutionSettings()

    for i, question in enumerate(test_questions, 1):
        print(f"\nQuestion {i}: {question}")
        
        chat_history = ChatHistory()
        chat_history.add_user_message(question)

        # ***FIX 2: Pass the 'settings' object into the function call.***
        response = await chat_service.get_chat_message_content(
            chat_history=chat_history,
            settings=execution_settings  # This argument is now required
        )
        
        print(f"Model Response: {str(response)}")
        print("-" * 50)


# NOTE: This code assumes your Kernel is initialized and the OpenAI API key 
# is configured in your environment before running.

print("✅ Starting test...")
# Run the single, simplified test function.
await run_direct_to_model_test()

## 我们刚才观察到了什么

模型要么：
1.  **无法回答**，因为它无法访问这家公司的特定信息
2.  **提供通用响应**，可能与您的实际政策不符
3.  **做出假设**，对于您的特定情境可能是错误的

这正是我们需要 RAG 的原因——让模型能够访问您的特定数据，同时保留其推理能力。

---

# 第 2 部分：文档分块策略

In [None]:
# In this section, we will explore different text chunking strategies. 
# We can either do a simple character-based split or a more semantic split that respects paragraphs and sentences. For example, on the latter, we will try to keep sentences together and avoid breaking them in the middle.

# At a high level, what this function does is take a long piece of text and break it into smaller pieces (chunks) that are easier to work with.
def simple_text_splitter(text: str, chunk_size: int = 300, overlap: int = 50) -> List[str]:
    """
    Simple character-based text splitter with overlap
    """
    chunks = [] # This will hold our text chunks
    start = 0 # Starting position in the text
    
    #Loop until we reach the end of the text.
    while start < len(text):
        # Calculate where this chunk should end
        end = min(start + chunk_size, len(text))
        
        # Try to end at a sentence boundary (but only if we're not at the very end)
        if end < len(text):
            last_period = text.rfind('.', start, end)
            if last_period > start + chunk_size // 2:
                end = last_period + 1
        
        # Extract the chunk
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        
        # Move to next position
        # FIXED: Ensure we always move forward, even with large overlap
        next_start = end - overlap
        start = max(next_start, start + 1)  # Always move at least 1 character forward
        
        # If we've reached the end, break
        if end >= len(text):
            break
    
    return chunks

# What this function does is take a long piece of text and break it into smaller pieces (chunks) that respect paragraph and sentence boundaries. So instead of just cutting it off at a certain number of characters, it tries to keep whole sentences together and avoid breaking them in the middle. This is useful because it helps preserve the meaning of the text and makes it easier to understand.
def semantic_text_splitter(text: str, max_chunk_size: int = 400) -> List[str]:
    """
    Split text respecting paragraph and sentence boundaries
    """
    # Split by paragraphs first (handle both \n\n and single \n)
    paragraphs = [p.strip() for p in text.split('\n') if p.strip()]
    
    chunks = []
    current_chunk = ""
    
    for paragraph in paragraphs:
        # If this paragraph alone is too big, split it by sentences
        if len(paragraph) > max_chunk_size:
            sentences = [s.strip() for s in paragraph.split('.') if s.strip()]
            
            for sentence in sentences:
                sentence_with_period = sentence + '.' if not sentence.endswith('.') else sentence
                
                # Check if adding this sentence would exceed our limit
                if current_chunk and len(current_chunk) + len(sentence_with_period) + 1 > max_chunk_size:
                    chunks.append(current_chunk.strip())
                    current_chunk = sentence_with_period
                else:
                    current_chunk += " " + sentence_with_period if current_chunk else sentence_with_period
        else:
            # Try to add the whole paragraph
            if current_chunk and len(current_chunk) + len(paragraph) + 2 > max_chunk_size:
                chunks.append(current_chunk.strip())
                current_chunk = paragraph
            else:
                current_chunk += "\n" + paragraph if current_chunk else paragraph
    
    # Don't forget the last chunk
    if current_chunk.strip():
        chunks.append(current_chunk.strip())
    
    return chunks

# Test different chunking strategies
sample_doc = company_documents[0]
print("CHUNKING STRATEGY COMPARISON:")
print("=" * 35)

print(f"Original document: {sample_doc['title']}")
print(f"Length: {len(sample_doc['content'])} characters")

# Test simple chunking with safer parameters
print("\n1. SIMPLE CHARACTER-BASED CHUNKING:")
simple_chunks = simple_text_splitter(sample_doc['content'], chunk_size=200, overlap=30)  # Reduced overlap
for i, chunk in enumerate(simple_chunks):
    print(f"Chunk {i+1} ({len(chunk)} chars): {chunk}")

# Test semantic chunking
print("\n2. SEMANTIC CHUNKING (respects paragraphs):")
semantic_chunks = semantic_text_splitter(sample_doc['content'], max_chunk_size=250)
for i, chunk in enumerate(semantic_chunks):
    print(f"Chunk {i+1} ({len(chunk)} chars): {chunk}")

print("\nTRADE-OFFS:")
print("- Simple chunking: Predictable sizes, may break mid-sentence")
print("- Semantic chunking: Preserves meaning, variable sizes")

## 测试完整的 RAG 管道

现在，我们将实现一个简单的 RAG 系统，该系统使用语义内核来处理文档检索和答案生成。

1.  semantic_chunker：它的工作是接收一串文本并将其分割成更小的字符串（块）。它首先在看到双换行符 (\n\n) 时按段落分割文本 - 这确保了属于一起的句子保持在一起。然后，它遍历这些段落，根据最大块大小将它们分组到一个块中。通过尊重文本中的自然断点，它确保了相关句子保持在一起，从而创建了高质量、集中的信息块。这极大地提高了我们数据的“信噪比”。
2.  ingest_documents_semantic：此函数旨在解决任何 RAG 系统的第一个主要问题。为 AI 准备数据。它接受一个文档列表、一个向量存储（我们之前构建的自定义数据库）和 embedding_service。它接收这些文档并将其转换为向量。它遍历每个文档，调用 embedding_service 并创建一个 DocumentChunk 对象（包含向量、原始文本和元数据）。
3.  ask_with_semantic_rag：RAG 的核心引擎。它接受用户的问题、内核和 vector_store（我们的知识库）。它通过嵌入服务传递用户问题以获取向量表示，然后使用向量搜索方法查找向量最接近的文档块。然后我们增强提示并最终生成答案。

In [None]:
# --- Helper Function for Semantic Chunking ---

def semantic_chunker(text: str, max_chunk_size: int = 300) -> List[str]:
    """
    Splits text into chunks, respecting paragraph boundaries to keep related sentences together.
    
    This is a pure utility function; it doesn't need any AI services or state.
    Its only job is to intelligently split text based on structure.
    """
    # First, split the text into paragraphs based on double newlines.
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    
    chunks = []
    current_chunk = ""
    
    # Iterate through each paragraph to build chunks up to the max size.
    for paragraph in paragraphs:
        # If adding the next paragraph would make the current chunk too large...
        if current_chunk and (len(current_chunk) + len(paragraph) + 2) > max_chunk_size:
            # ...finalize the current chunk...
            chunks.append(current_chunk)
            # ...and start a new chunk with the current paragraph.
            current_chunk = paragraph
        else:
            # Otherwise, add the paragraph to the current chunk.
            current_chunk += ("\n\n" + paragraph) if current_chunk else paragraph
            
    # Add the last remaining chunk to the list.
    if current_chunk:
        chunks.append(current_chunk)
        
    return chunks

# --- Core RAG Functions ---

async def ingest_documents_semantic(
    documents: List[Dict], 
    vector_store: SimpleVectorStore, 
    embedding_service: OpenAITextEmbedding
) -> None:
    """
    Processes and ingests documents using the semantic chunking strategy.
    """
    print(f"Ingesting {len(documents)} documents with semantic chunking...")
    all_chunks_to_add = []
    
    for doc in documents:
        # Use our standalone helper function to get semantically coherent chunks.
        text_chunks = semantic_chunker(doc["content"])
        
        for i, chunk_text in enumerate(text_chunks):
            # Skip chunks that are too short to have meaningful content.
            if len(chunk_text) < 20:
                continue

            # Generate the embedding vector for the chunk's content.
            embedding = (await embedding_service.generate_embeddings([chunk_text]))[0]
            
            # Create the DocumentChunk object.
            chunk = DocumentChunk(
                id=f"{doc['id']}_chunk_{i}",
                content=chunk_text,
                source_doc_id=doc["id"],
                title=doc["title"],
                chunk_index=i,
                embedding=embedding
            )
            all_chunks_to_add.append(chunk)

    vector_store.add_documents(all_chunks_to_add)
    print(f"Added {len(all_chunks_to_add)} new chunks to the vector store.")


async def ask_with_semantic_rag(
    question: str, 
    kernel: Kernel, 
    vector_store: SimpleVectorStore
) -> str:
    """
    Asks a question using the RAG pattern with the semantically chunked documents.
    """
    # Get the necessary AI services from the kernel.
    embedding_service = kernel.get_service(type=OpenAITextEmbedding)
    chat_service = kernel.get_service(type=OpenAIChatCompletion)
    
    # 1. RETRIEVE: Convert the question to an embedding and search the vector store.
    query_embedding = (await embedding_service.generate_embeddings([question]))[0]
    search_results = vector_store.search(query_embedding, k=3, score_threshold=0.3)
    
    if not search_results:
        return "I could not find any relevant information in the documents to answer that question."
        
    # 2. AUGMENT: Build the context string from the retrieved document chunks.
    context = "\n\n---\n\n".join([result.content for result, score in search_results])
    
    # Create the final prompt that instructs the AI and provides the context.
    prompt = f"""
Answer the following question based ONLY on the context provided below.

CONTEXT:
---
{context}
---

QUESTION: {question}

ANSWER:
"""

    # 3. GENERATE: Send the augmented prompt to the chat model to get the final answer.
    chat_history = ChatHistory()
    chat_history.add_user_message(prompt)
    
    # Define execution settings for the AI call.
    settings = OpenAIChatPromptExecutionSettings(max_tokens=200, temperature=0.1)
    
    response = await chat_service.get_chat_message_content(chat_history, settings)
    
    return str(response)

# --- Main Execution Block ---

# 1. Initialize our vector store for this RAG process.
semantic_vector_store = SimpleVectorStore()

# 2. Get the embedding service from the kernel, as it's needed for ingestion.
embedding_service = kernel.get_service(type=OpenAITextEmbedding)

# 3. Call the ingestion function to process documents and populate the vector store.
await ingest_documents_semantic(company_documents, semantic_vector_store, embedding_service)

# 4. Ask a question using the populated vector store.
print("\n" + "="*50)
print("TESTING RAG SYSTEM WITH SEMANTIC CHUNKING:")
question_to_ask = "What is the pricing for CloudSync Pro Enterprise?"
answer = await ask_with_semantic_rag(question_to_ask, kernel, semantic_vector_store)

print(f"\nQ: {question_to_ask}")
print(f"A: {answer}")

---

# 第 4 部分：高级配置和调优

我们正在测试两种简单的方法来让我们的 RAG 系统给出更好的答案——首先，通过改变我们要求 AI 回应的方式（友好型 vs 专业型），其次，通过调整我们对包含哪些文档的挑剔程度（严格匹配 vs 宽松匹配）。

后者被称为相似度阈值。相似度阈值就像为“相关”搜索结果设定标准。它是一个介于 0 和 1 之间的数字，决定了文档块与您的问题的相似度必须达到多少，我们才会将其包含在答案中。

当阈值太低时：如果您以 0.2 的阈值询问“我们的休假政策是什么？”，您可能会得到关于休假政策、员工福利、时间跟踪和公司假期的结果。虽然都与人力资源相关，但这会给用户带来大量并非直接回答其问题的信息。然后，AI 必须筛选所有这些额外的上下文，这可能会稀释最终答案的质量。

当阈值太高时：如果您以 0.7 的阈值询问“我如何申请休假？”，您可能根本得不到任何结果，因为没有文档包含该确切短语，即使您的休假政策文档清楚地解释了该过程。当答案实际上存在于您的知识库中时，用户最终会因“未找到信息”的响应而感到沮
丧。

找到最佳点：目标是找到一个阈值，既能为您提供足够的相关信息，又不会包含噪音。对于大多数商业文档，0.3-0.5 之间的阈值效果很好——足够高以过滤掉不相关的内容，但又足够低以捕捉可能使用与您的问题不同措辞的相关信息。

In [None]:
class OptimizedRAG(SimpleRAG):
    """Simple RAG with optimization features"""
    
    async def ask_with_custom_prompt(self, question, prompt_template):
        """Ask question with a custom prompt"""
        # Search for relevant chunks
        query_embedding = await self.embedding_service.generate_embeddings([question])
        results = self.vector_store.search(query_embedding[0], k=3, score_threshold=0.3)
        
        if not results:
            return "No relevant information found."
        
        # Build context
        context = "\n".join([doc.content for doc, score in results])
        
        # Use custom prompt template
        prompt = prompt_template.format(context=context, question=question)
        
        # Generate answer
        from semantic_kernel.connectors.ai.open_ai import OpenAIChatPromptExecutionSettings
        
        chat_history = ChatHistory()
        chat_history.add_user_message(prompt)
        settings = OpenAIChatPromptExecutionSettings(max_tokens=200, temperature=0.1)
        
        response = await self.chat_service.get_chat_message_content(chat_history, settings)
        return str(response)
    
    async def test_thresholds(self, query, thresholds=[0.1, 0.3, 0.5, 0.7]):
        """Test different similarity thresholds"""
        query_embedding = await self.embedding_service.generate_embeddings([query])
        
        print(f"Testing thresholds for: '{query}'")
        
        for threshold in thresholds:
            results = self.vector_store.search(query_embedding[0], k=5, score_threshold=threshold)
            
            print(f"\nThreshold {threshold}: {len(results)} results")
            if results:
                scores = [score for _, score in results]
                print(f"  Score range: {min(scores):.3f} - {max(scores):.3f}")
                print(f"  Documents: {', '.join([doc.title for doc, _ in results[:2]])}")

# Create optimized RAG system
opt_rag = OptimizedRAG(kernel)
await opt_rag.add_documents(company_documents)

# Test custom prompts
customer_prompt = """You are a helpful customer service agent. 

Context: {context}

Customer question: {question}

Friendly response:"""

employee_prompt = """You are an internal HR assistant.

Context: {context}

Employee question: {question}

Professional response:"""

# Test different prompt styles
question = "What is our remote work policy?"

print("CUSTOMER SERVICE STYLE:")
customer_answer = await opt_rag.ask_with_custom_prompt(question, customer_prompt)
print(customer_answer)

print("\nHR ASSISTANT STYLE:")
hr_answer = await opt_rag.ask_with_custom_prompt(question, employee_prompt)
print(hr_answer)

# Test similarity thresholds
print("\n" + "="*40)
await opt_rag.test_thresholds("employee remote work")



## 最佳实践摘要

### 文档处理
- **使用语义分块**，尊重段落和句子边界
- **最佳块大小：300-400 个字符**，适用于大多数商业文档
- **包含有意义的重叠**（50-80 个字符）以保留上下文
- **保留丰富的元数据**用于过滤和来源归属

### 向量搜索配置
- **从 FAISS 开始**进行本地开发和小型生产
- **使用 0.3 左右的相似度阈值**以平衡精度/召回率
- **检索 3-5 个文档**以提供足够的上下文而没有噪音
- **规范化嵌入**以进行一致的相似度计算

### 提示工程
- **为不同的用户类型创建特定角色的提示**（客户、员工、高管）
- **包含明确的指令**以处理信息不可用的情况
- **使用结构化模板**将上下文与问题分开
- **测试提示变体**以针对您的特定用例进行优化

## 后续步骤

1.  **从核心功能开始** - 让基本的 RAG 与您的文档一起工作
2.  **尽早添加监控** - 实施日志记录和指标收集
3.  **为您的领域进行定制** - 为您的内容量身定制提示和分块
4.  **根据反馈进行迭代** - 使用真实的用户交互来改进系统
5.  **为生产做计划** - 考虑可伸缩性、监控和维护