# Complete RAG System Implementation with Semantic Kernel

This comprehensive notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using Microsoft's Semantic Kernel. We'll start by showing the limitations of AI models without access to specific data, then build a complete RAG system with chunking strategies, vector databases, and evaluation methods.

## Installation and Setup

In this module, we will leverage Semantic Kernel. 

Semantic Kernel is an open source SDK that allows you to easily build Ai applications. It supports C#, Python, and Java. It is production ready and many large enterprises leverage it. 

It is designed to be modular, so you can easily change models without rewriting your entire codebase. 

SDKs like Semantic Kernel became popular because LLMs themselves can only process data and generate responses. It can’t access your database, call your APIs, execute code, or interact with external systems. So Semantic Kernel manages connections to AI services (like Open AI), provides a plugin system where you can write functions that the AI can call, and manages conversation history and context.

At the heart of semantic kernel, is the Kernel orchestrator. In AI applications, you need to coordinate multiple moving parts like AI services, databases, APIs, logging systems, etc. Kernel is the central orchestrator that holds all of these together. It contains Services (like AI Services, login services, authentication services) and Plugins (custom functions the AI can call, like accessing your database). Consider a real enterprise scenario, where an AI assistant that needs to query CRM, check inventory levels, generate proposal, and log all interactions for compliance. Without a kernel, every piece of code will need to know how to connect to all these services. With a kernel, everything is configured once. Because all AI operations flow through the kernel, you have a single point of control for logging and management. Kernel now supports MCP, which wraps the kernel in a network aware sever that speaks the MCP language so that this kernel (or agent) is discoverable by others. 

Semantic Kernel Components:

1. AI Service Connectors: Of course we live in a world where we use multiple AI models, and these models use different APIs and authentication methods. A connector in SK is an abstraction layer to prevent vendor lock-in (allows you to change between multiple models). For instance, AzureChatCompletion is one Service, GoogleAIChatCompletion is another service. The kernel is responsible for calling these connectors.
2. Vector Store Connectors: This is the core of RAG. This is the bridge between vectors stores and the kernel.
3. Functions and Plugins: Plugins is what allows you to allow these LLMs to have access to tools. A function is a single capability you expose to the LLM (e.g. a python function) and a plugin is a group of related functions (DatabasePlugin might contain functions like GetUser, UpdateUser, GetOrders). 
4. Prompt Templates: Writing effective prompts with multi-line strings inside our code can get messy and hard to maintain. It is also impossible for non-developers like a prompt engineer to work with. Practically, it is either a text file or a string that mixes static instructions with dynamic placeholder. The static instruction can be: “You are an expert financial analyst, summarize the following report” and the dynamic placeholders could be user_input, or a function to get stock price. 
5. Filters: Piece of code to intercept the kernel execution at key moments, like before a function is called or after a prompt is rendered. You can do this to filter PII data (ensuring no user credit card number is ever sent to the external LLM).


**First, install the required packages:**

In [None]:
# Run this cell first to install all required packages
!pip install semantic-kernel openai numpy scikit-learn faiss-cpu

## Environment Setup

In [None]:
from typing import List, Dict
import numpy as np
from dataclasses import dataclass
from dotenv import load_dotenv

load_dotenv()


# Semantic Kernel core imports (verified working June 2025)
import semantic_kernel as sk
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion, OpenAITextEmbedding
from semantic_kernel.contents import ChatHistory
from semantic_kernel.connectors.ai.open_ai import OpenAIChatPromptExecutionSettings


# For vector storage - we'll build our own simple system
import faiss
from sklearn.metrics.pairwise import cosine_similarity

print("Semantic Kernel environment setup complete!")

## Creating Our Document Model and Vector Store

We will use FAISS vector database. FAISS is a local vector database library that runs entirely on your machine. In production scenarios, we typically use a cloud vector database like Azure Search. But the concepts are the same, a vector database allows us to store vectors in it. 

Code explanation: 

1. We will create a class (to create an object out of it i.e. instance of class) to represent document chunks. There will be detailed comments to capture what will be inside these chunk objects. 
2. Then we will create another class that contains functions (methods) which allow us to search for these chunks using vector search (instead of exact word matches). 
    - Initialization function: A function to initialize our vector store. We will initiaize our vector store with these properties:
        - Embedding Dimension: Embedding Dimension: How many numbers each document chunk will have when converter to vectors. Open AI’s text embedding model (which we will use) converts any text into exactly 1536 numbers. So “Hello world” becomes [0.123, -0.456, 0.789, … 1536 numbers in total). Every piece of text gets exactly 1536 numbers, whether it’s one word or a paragraph. 
        - Index: Think of an index like table of contents, instead of searching every row in a database to find “John Smith”, the index tells you which row John is in. In vector databases, an index is a data structure that organizes vectors so you can quickly find similar ones. Without an index, finding similar vectors would require comparing your query against every single stored vector (one by one). With an index, FAISS pre-organizes the vectors so ti can quickly jump to the most similar ones. We are initializing that index. 
        - Documents: Stores the actual document chunks (original text plus metadata) in a regular Python list.
        - ID to Index: This is a dictionary that maps document IDs to their position in our list. So we can quickly find document chunk 5 as an example without searching through the entire list. 
    - Add documents function: This takes a batch of document chunks and stores them in our vector store. We process each document (that is already vectorized via our embedding model) and then add them to the FAISS index for fast similarity search. We keep the original text and metadata in our documents list so we can retrieve it later. 
    - Search function: This takes a user’s question (that is converted to a 1536 number vector) and finds the most similar document chunks. 
3. We now have our custom built vector store. Now we need to connect an LLM to it and this is where Semantic Kernel comes in. We will use Semantic Kernel to tap into embedding capabilities (converting text into vectors) and chat completion capabilities (the LLM reasoning engine). 

In [None]:


@dataclass
# We will create a DocumentChunk class to represent each piece of a document
class DocumentChunk:

   # Required fields - every chunk must have these
   id: str                    # Unique name for this chunk, like "policy_doc_chunk_1"
   content: str              # The actual text content of this chunk 
   source_doc_id: str        # Which original document this came from
   title: str                # Human-readable title of the original document
   chunk_index: int          # Which piece is this? (0=first chunk, 1=second, etc.)
   
   # Optional fields - these have default values
   department: str = ""      # Which team owns this document (optional)
   doc_type: str = ""        # What kind of doc is this - policy, guide, etc. (optional)
   embedding: List[float] = None  # The vector representation (list of numbers) for this text

# This is a class with methods to search for documents. Instead of exact word matches, it finds documents with similar meanings. 
class SimpleVectorStore:
  
   
   # This is a function to initialize our vector store. We will use FAISS, where we can store and search through many document chunks quickly. 
   def __init__(self, embedding_dimension: int = 1536):
       # How many numbers are in each embedding vector? What this means is that each document chunk will be represented by a list of 1536 numbers capturing its meaning.
       # OpenAI's text-embedding-ada-002 model (if we use it) gives us 1536 numbers for each piece of text.
       self.embedding_dimension = embedding_dimension
       
       # Create a FAISS index to store our document embeddings. Index FlatIP means we will use inner product similarity (like cosine similarity) to find similar documents.
       self.index = faiss.IndexFlatIP(embedding_dimension)
       
       # Store the actual document chunks (the text and metadata) in a list called documents
       self.documents: List[DocumentChunk] = []
       
       # This dictionary maps document IDs to their position in the documents list, so we can quickly find a document by its ID. for example, if we have a document with ID "policy_doc_chunk_1", we can find it in our list of documents by looking up "policy_doc_chunk_1" in this dictionary.
       # This is like a quick lookup table - if we have a document with ID "policy_doc_chunk_1", we can find it in our list of documents by looking up "policy_doc_chunk_1" in this dictionary.
       self.id_to_index = {}
   
   #This function adds a batch of documents to our vector store.
   def add_documents(self, documents: List[DocumentChunk]):
      
       embeddings = []  # This is where we will store the vector representations (embeddings) of each document chunk
       
       # Process each document one by one
       for doc in documents:
           # Every document must have its embedding (vector representation) already calculated
           if doc.embedding is None:
               raise ValueError(f"Document {doc.id} missing embedding")
           
           # CRITICAL: Normalize the embedding vector
           # Why? So we can use cosine similarity (comparing angles, not lengths)
           # Think of vectors as arrows - we want to compare which direction they point
           embedding_array = np.array(doc.embedding)  # Convert list to numpy array for math
           normalized_embedding = embedding_array / np.linalg.norm(embedding_array)  # Make length = 1
           embeddings.append(normalized_embedding) # This is the normalized vector representation of the document chunk stored in our embeddings list
           
           # These are the metadata fields we will use to identify and retrieve the document later
           doc_index = len(self.documents)           # What position will this doc be at?
           self.documents.append(doc)                # Add document to our storage
           self.id_to_index[doc.id] = doc_index     # Remember: this ID is at this position
       
       # Add all the normalized vectors to FAISS for lightning-fast search
       embeddings_array = np.array(embeddings).astype('float32')  # FAISS requires float32 type
       self.index.add(embeddings_array)
       
       print(f"Added {len(documents)} documents to vector store")
   
   # This function searches for documents similar to a user's question. 
   def search(self, query_embedding: List[float], k: int = 3, score_threshold: float = 0.0):
       """
       Find documents most similar to a query
       
       How it works:
       1. Take the query's embedding (vector representation)
       2. Compare it to all stored document embeddings
       3. Return the k most similar ones above the threshold
       
       Args:
           query_embedding: The vector representation of user's question
           k: How many results to return (top 3, top 5, etc.)
           score_threshold: Only return docs with similarity above this score
       
       Returns:
           List of (document, similarity_score) pairs, sorted by similarity
       """
       # Edge case: if no documents stored, return empty
       if self.index.ntotal == 0:
           return []
       
       # Normalize the query embedding just like we did for stored documents
       # This ensures fair comparison - we're comparing directions, not magnitudes
       query_array = np.array(query_embedding)
       normalized_query = query_array / np.linalg.norm(query_array)
       
       # Ask FAISS to find the k most similar vectors
       # reshape(1, -1) because FAISS expects 2D array (rows=queries, cols=dimensions)
       # Even though we only have 1 query, we need to format it as [[1, 2, 3, ...]]
       scores, indices = self.index.search(normalized_query.reshape(1, -1).astype('float32'), k)
       
       # Convert FAISS results into our format
       results = []
       for score, idx in zip(scores[0], indices[0]):  # [0] because we only sent 1 query
           # FAISS returns -1 if it runs out of documents before reaching k
           # Also filter by minimum similarity score
           if idx >= 0 and score >= score_threshold:
               # Use the index to get the actual document from our storage
               document = self.documents[idx]
               results.append((document, float(score)))
       
       return results

# Set up the AI services we'll use
kernel = Kernel()  # Semantic Kernel is our AI orchestration framework

# Service 1: Chat completion (generates responses to questions)
chat_service = OpenAIChatCompletion(
   ai_model_id="gpt-3.5-turbo"  # Which OpenAI model to use for chat
)
kernel.add_service(chat_service)  # Register this service with the kernel. This allows us to use the chat service in our semantic kernel for generating responses to user queries.

# Service 2: Text embedding (converts text into vector representations)
embedding_service = OpenAITextEmbedding(
   ai_model_id="text-embedding-ada-002"  # OpenAI's embedding model
)
kernel.add_service(embedding_service)  # Register this service with the kernel

#Now our kernel has both chat and embedding services ready to use. So we can ask questions and get answers, as well as convert text into vectors for similarity search.
# We also set up a simple vector store using FAISS to store and search through document chunks based on their meanings.

print("Semantic Kernel initialized with OpenAI services")
print("Using simple vector store with FAISS for document storage")

---

# Part 1: Demonstrating the Problem - No Access to Private Data

Let's start by showing what happens when we ask an AI model about information it wasn't trained on.

The code below is simple, we have a series of "documents" stored in an array. We will ask questions about each of these documents (but we won't actually implement a proper RAG system) so we should expect the model not to know what we're talking about.



In [None]:
# Sample company data that the model wouldn't know about.
# This represents the private, "ground-truth" information.
company_documents = [
    {
        "id": "product_001",
        "title": "CloudSync Pro Enterprise Plan",
        "content": """CloudSync Pro Enterprise offers unlimited storage, advanced encryption, 
        real-time collaboration for up to 500 users, priority support, and custom integrations. 
        Pricing: $49/month per user with annual commitment. Features include: automatic backup, 
        version control, audit logs, SSO integration, and 99.9% uptime SLA.""",
        "metadata": {"department": "product", "type": "pricing"}
    },
    {
        "id": "policy_001", 
        "title": "Remote Work Policy 2024",
        "content": """Effective January 2024: All employees may work remotely up to 3 days per week. 
        Remote work requires approval from direct manager. Equipment stipend of $500 annually 
        for home office setup. Mandatory video calls for team meetings. Core hours: 10 AM - 3 PM 
        local time for collaboration.""",
        "metadata": {"department": "hr", "type": "policy"}
    },
    {
        "id": "process_001",
        "title": "Customer Refund Process",
        "content": """Step 1: Customer submits refund request through support portal. 
        Step 2: Support agent reviews within 24 hours. Step 3: For amounts under $100, 
        automatic approval. Step 4: For amounts over $100, requires manager approval. 
        Step 5: Refunds processed within 3-5 business days to original payment method. 
        Full refunds available within 30 days of purchase.""",
        "metadata": {"department": "support", "type": "process"}
    },
    {
        "id": "guide_001",
        "title": "New Employee Onboarding Checklist",
        "content": """Day 1: IT setup and system access. Day 2: Department orientation and mentor assignment. 
        Week 1: Complete mandatory training modules (security, compliance, company culture). 
        Week 2: Shadow team members and review project documentation. Month 1: Complete 
        probationary review and set 90-day goals.""",
        "metadata": {"department": "hr", "type": "guide"}
    }
]

# The questions we want to test against the model's base knowledge.
test_questions = [
    "What is the pricing for CloudSync Pro Enterprise?",
    "How many days per week can employees work remotely?",
    "What is the refund approval process for purchases over $100?",
    "What happens during the first week of employee onboarding?"
]


async def run_direct_to_model_test():
    """
    Tests questions directly against the base AI model to demonstrate
    its lack of knowledge about our private company data.
    """
    print("TESTING MODEL WITHOUT RAG - Questions about private company data:")
    print("=" * 70)

    chat_service = kernel.get_service(type=OpenAIChatCompletion)

    # ***FIX 1: Create a default settings object for OpenAI chat models.***
    execution_settings = OpenAIChatPromptExecutionSettings()

    for i, question in enumerate(test_questions, 1):
        print(f"\nQuestion {i}: {question}")
        
        chat_history = ChatHistory()
        chat_history.add_user_message(question)

        # ***FIX 2: Pass the 'settings' object into the function call.***
        response = await chat_service.get_chat_message_content(
            chat_history=chat_history,
            settings=execution_settings  # This argument is now required
        )
        
        print(f"Model Response: {str(response)}")
        print("-" * 50)


# NOTE: This code assumes your Kernel is initialized and the OpenAI API key 
# is configured in your environment before running.

print("✅ Starting test...")
# Run the single, simplified test function.
await run_direct_to_model_test()

## What We Just Observed

The model either:
1. **Cannot answer** because it doesn't have access to this specific company information
2. **Provides generic responses** that might not match your actual policies
3. **Makes assumptions** that could be incorrect for your specific context

This is exactly why we need RAG - to give the model access to your specific data while preserving its reasoning capabilities.

---

# Part 2: Document Chunking Strategies

In [None]:
# In this section, we will explore different text chunking strategies. 
# We can either do a simple character-based split or a more semantic split that respects paragraphs and sentences. For example, on the latter, we will try to keep sentences together and avoid breaking them in the middle.

# At a high level, what this function does is take a long piece of text and break it into smaller pieces (chunks) that are easier to work with.
def simple_text_splitter(text: str, chunk_size: int = 300, overlap: int = 50) -> List[str]:
    """
    Simple character-based text splitter with overlap
    """
    chunks = [] # This will hold our text chunks
    start = 0 # Starting position in the text
    
    #Loop until we reach the end of the text.
    while start < len(text):
        # Calculate where this chunk should end
        end = min(start + chunk_size, len(text))
        
        # Try to end at a sentence boundary (but only if we're not at the very end)
        if end < len(text):
            last_period = text.rfind('.', start, end)
            if last_period > start + chunk_size // 2:
                end = last_period + 1
        
        # Extract the chunk
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        
        # Move to next position
        # FIXED: Ensure we always move forward, even with large overlap
        next_start = end - overlap
        start = max(next_start, start + 1)  # Always move at least 1 character forward
        
        # If we've reached the end, break
        if end >= len(text):
            break
    
    return chunks

# What this function does is take a long piece of text and break it into smaller pieces (chunks) that respect paragraph and sentence boundaries. So instead of just cutting it off at a certain number of characters, it tries to keep whole sentences together and avoid breaking them in the middle. This is useful because it helps preserve the meaning of the text and makes it easier to understand.
def semantic_text_splitter(text: str, max_chunk_size: int = 400) -> List[str]:
    """
    Split text respecting paragraph and sentence boundaries
    """
    # Split by paragraphs first (handle both \n\n and single \n)
    paragraphs = [p.strip() for p in text.split('\n') if p.strip()]
    
    chunks = []
    current_chunk = ""
    
    for paragraph in paragraphs:
        # If this paragraph alone is too big, split it by sentences
        if len(paragraph) > max_chunk_size:
            sentences = [s.strip() for s in paragraph.split('.') if s.strip()]
            
            for sentence in sentences:
                sentence_with_period = sentence + '.' if not sentence.endswith('.') else sentence
                
                # Check if adding this sentence would exceed our limit
                if current_chunk and len(current_chunk) + len(sentence_with_period) + 1 > max_chunk_size:
                    chunks.append(current_chunk.strip())
                    current_chunk = sentence_with_period
                else:
                    current_chunk += " " + sentence_with_period if current_chunk else sentence_with_period
        else:
            # Try to add the whole paragraph
            if current_chunk and len(current_chunk) + len(paragraph) + 2 > max_chunk_size:
                chunks.append(current_chunk.strip())
                current_chunk = paragraph
            else:
                current_chunk += "\n" + paragraph if current_chunk else paragraph
    
    # Don't forget the last chunk
    if current_chunk.strip():
        chunks.append(current_chunk.strip())
    
    return chunks

# Test different chunking strategies
sample_doc = company_documents[0]
print("CHUNKING STRATEGY COMPARISON:")
print("=" * 35)

print(f"Original document: {sample_doc['title']}")
print(f"Length: {len(sample_doc['content'])} characters")

# Test simple chunking with safer parameters
print("\n1. SIMPLE CHARACTER-BASED CHUNKING:")
simple_chunks = simple_text_splitter(sample_doc['content'], chunk_size=200, overlap=30)  # Reduced overlap
for i, chunk in enumerate(simple_chunks):
    print(f"Chunk {i+1} ({len(chunk)} chars): {chunk}")

# Test semantic chunking
print("\n2. SEMANTIC CHUNKING (respects paragraphs):")
semantic_chunks = semantic_text_splitter(sample_doc['content'], max_chunk_size=250)
for i, chunk in enumerate(semantic_chunks):
    print(f"Chunk {i+1} ({len(chunk)} chars): {chunk}")

print("\nTRADE-OFFS:")
print("- Simple chunking: Predictable sizes, may break mid-sentence")
print("- Semantic chunking: Preserves meaning, variable sizes")

## Testing the Complete RAG Pipeline

Now we will implement a simple RAG system that uses the Semantic Kernel to handle document retrieval and answer generation.

1. semantic_chunker: It’s job is to take a string of text and split it into smaller strings (chunks). It first splits text by paragraphs whenever it sees a double newline (\n\n) - this ensures that sentences that belong together stay together. Then it iterates through these paragraphs to group them into a chunk as per the max chunk size. By respecting natural breaks in the text, it ensures that related sentences stay together, creating high-quality, focused chunks of information. This dramatically increases the "signal-to-noise" ratio of our data.
2. ingest_documents_semantic: This function is to solve the first major problem of any RAG system. Preparing data for AI. It accepts a list of documents, a vector)store (the custom built database we built earlier), and the embedding_service. It takes these documents and transforms them into vectors. It loops through each document, calls the embedding_service and creates a DocumentChunk object (with the vector, original text, and metadata).
3. ask_with_semantic_rag: Core engine of RAG. It accepts the user’s question, the kernel, and the vector_store (our knowledge base). It passes the user question via an embedding service to get vector representation, then uses vector search method to find the document chunks whose vectors are closest. Then we augment the prompt and finally generate an answer. 

In [None]:
# --- Helper Function for Semantic Chunking ---

def semantic_chunker(text: str, max_chunk_size: int = 300) -> List[str]:
    """
    Splits text into chunks, respecting paragraph boundaries to keep related sentences together.
    
    This is a pure utility function; it doesn't need any AI services or state.
    Its only job is to intelligently split text based on structure.
    """
    # First, split the text into paragraphs based on double newlines.
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    
    chunks = []
    current_chunk = ""
    
    # Iterate through each paragraph to build chunks up to the max size.
    for paragraph in paragraphs:
        # If adding the next paragraph would make the current chunk too large...
        if current_chunk and (len(current_chunk) + len(paragraph) + 2) > max_chunk_size:
            # ...finalize the current chunk...
            chunks.append(current_chunk)
            # ...and start a new chunk with the current paragraph.
            current_chunk = paragraph
        else:
            # Otherwise, add the paragraph to the current chunk.
            current_chunk += ("\n\n" + paragraph) if current_chunk else paragraph
            
    # Add the last remaining chunk to the list.
    if current_chunk:
        chunks.append(current_chunk)
        
    return chunks

# --- Core RAG Functions ---

async def ingest_documents_semantic(
    documents: List[Dict], 
    vector_store: SimpleVectorStore, 
    embedding_service: OpenAITextEmbedding
) -> None:
    """
    Processes and ingests documents using the semantic chunking strategy.
    """
    print(f"Ingesting {len(documents)} documents with semantic chunking...")
    all_chunks_to_add = []
    
    for doc in documents:
        # Use our standalone helper function to get semantically coherent chunks.
        text_chunks = semantic_chunker(doc["content"])
        
        for i, chunk_text in enumerate(text_chunks):
            # Skip chunks that are too short to have meaningful content.
            if len(chunk_text) < 20:
                continue

            # Generate the embedding vector for the chunk's content.
            embedding = (await embedding_service.generate_embeddings([chunk_text]))[0]
            
            # Create the DocumentChunk object.
            chunk = DocumentChunk(
                id=f"{doc['id']}_chunk_{i}",
                content=chunk_text,
                source_doc_id=doc["id"],
                title=doc["title"],
                chunk_index=i,
                embedding=embedding
            )
            all_chunks_to_add.append(chunk)

    vector_store.add_documents(all_chunks_to_add)
    print(f"Added {len(all_chunks_to_add)} new chunks to the vector store.")


async def ask_with_semantic_rag(
    question: str, 
    kernel: Kernel, 
    vector_store: SimpleVectorStore
) -> str:
    """
    Asks a question using the RAG pattern with the semantically chunked documents.
    """
    # Get the necessary AI services from the kernel.
    embedding_service = kernel.get_service(type=OpenAITextEmbedding)
    chat_service = kernel.get_service(type=OpenAIChatCompletion)
    
    # 1. RETRIEVE: Convert the question to an embedding and search the vector store.
    query_embedding = (await embedding_service.generate_embeddings([question]))[0]
    search_results = vector_store.search(query_embedding, k=3, score_threshold=0.3)
    
    if not search_results:
        return "I could not find any relevant information in the documents to answer that question."
        
    # 2. AUGMENT: Build the context string from the retrieved document chunks.
    context = "\n\n---\n\n".join([result.content for result, score in search_results])
    
    # Create the final prompt that instructs the AI and provides the context.
    prompt = f"""
Answer the following question based ONLY on the context provided below.

CONTEXT:
---
{context}
---

QUESTION: {question}

ANSWER:
"""

    # 3. GENERATE: Send the augmented prompt to the chat model to get the final answer.
    chat_history = ChatHistory()
    chat_history.add_user_message(prompt)
    
    # Define execution settings for the AI call.
    settings = OpenAIChatPromptExecutionSettings(max_tokens=200, temperature=0.1)
    
    response = await chat_service.get_chat_message_content(chat_history, settings)
    
    return str(response)

# --- Main Execution Block ---

# 1. Initialize our vector store for this RAG process.
semantic_vector_store = SimpleVectorStore()

# 2. Get the embedding service from the kernel, as it's needed for ingestion.
embedding_service = kernel.get_service(type=OpenAITextEmbedding)

# 3. Call the ingestion function to process documents and populate the vector store.
await ingest_documents_semantic(company_documents, semantic_vector_store, embedding_service)

# 4. Ask a question using the populated vector store.
print("\n" + "="*50)
print("TESTING RAG SYSTEM WITH SEMANTIC CHUNKING:")
question_to_ask = "What is the pricing for CloudSync Pro Enterprise?"
answer = await ask_with_semantic_rag(question_to_ask, kernel, semantic_vector_store)

print(f"\nQ: {question_to_ask}")
print(f"A: {answer}")

---

# Part 4: Advanced Configuration and Tuning

We're testing two simple ways to make our RAG system give better answers - first, by changing how we ask the AI to respond (friendly vs professional style), and second, by adjusting how picky we are about which documents to include (strict matching vs loose matching).

The latter is known as similarity threshold. A similarity threshold is like setting the bar for "relevant" search results. It's a number between 0 and 1 that determines how similar a document chunk must be to your question before we include it in the answer.

When thresholds are too low: If you ask "What is our vacation policy?" with a threshold of 0.2, you might get results about vacation policy, employee benefits, time tracking, and company holidays. While all HR-related, this floods the user with information that's not directly answering their question. The AI then has to sift through all this extra context, potentially diluting the quality of the final answer.

When thresholds are too high: If you ask "How do I request time off?" with a threshold of 0.7, you might get no results at all because no document contains that exact phrase, even though your vacation policy document clearly explains the process. Users end up frustrated with "no information found" responses when the answer actually exists in your knowledge base.

Finding the sweet spot: The goal is to find a threshold that gives you enough relevant information without including noise. For most business documents, thresholds between 0.3-0.5 work well - high enough to filter out unrelated content, but low enough to catch relevant information that might use different wording than your question.

In [None]:
class OptimizedRAG(SimpleRAG):
    """Simple RAG with optimization features"""
    
    async def ask_with_custom_prompt(self, question, prompt_template):
        """Ask question with a custom prompt"""
        # Search for relevant chunks
        query_embedding = await self.embedding_service.generate_embeddings([question])
        results = self.vector_store.search(query_embedding[0], k=3, score_threshold=0.3)
        
        if not results:
            return "No relevant information found."
        
        # Build context
        context = "\n".join([doc.content for doc, score in results])
        
        # Use custom prompt template
        prompt = prompt_template.format(context=context, question=question)
        
        # Generate answer
        from semantic_kernel.connectors.ai.open_ai import OpenAIChatPromptExecutionSettings
        
        chat_history = ChatHistory()
        chat_history.add_user_message(prompt)
        settings = OpenAIChatPromptExecutionSettings(max_tokens=200, temperature=0.1)
        
        response = await self.chat_service.get_chat_message_content(chat_history, settings)
        return str(response)
    
    async def test_thresholds(self, query, thresholds=[0.1, 0.3, 0.5, 0.7]):
        """Test different similarity thresholds"""
        query_embedding = await self.embedding_service.generate_embeddings([query])
        
        print(f"Testing thresholds for: '{query}'")
        
        for threshold in thresholds:
            results = self.vector_store.search(query_embedding[0], k=5, score_threshold=threshold)
            
            print(f"\nThreshold {threshold}: {len(results)} results")
            if results:
                scores = [score for _, score in results]
                print(f"  Score range: {min(scores):.3f} - {max(scores):.3f}")
                print(f"  Documents: {', '.join([doc.title for doc, _ in results[:2]])}")

# Create optimized RAG system
opt_rag = OptimizedRAG(kernel)
await opt_rag.add_documents(company_documents)

# Test custom prompts
customer_prompt = """You are a helpful customer service agent. 

Context: {context}

Customer question: {question}

Friendly response:"""

employee_prompt = """You are an internal HR assistant.

Context: {context}

Employee question: {question}

Professional response:"""

# Test different prompt styles
question = "What is our remote work policy?"

print("CUSTOMER SERVICE STYLE:")
customer_answer = await opt_rag.ask_with_custom_prompt(question, customer_prompt)
print(customer_answer)

print("\nHR ASSISTANT STYLE:")
hr_answer = await opt_rag.ask_with_custom_prompt(question, employee_prompt)
print(hr_answer)

# Test similarity thresholds
print("\n" + "="*40)
await opt_rag.test_thresholds("employee remote work")



## Best Practices Summary

### Document Processing
- **Use semantic chunking** that respects paragraph and sentence boundaries
- **Optimal chunk size: 300-400 characters** for most business documents
- **Include meaningful overlap** (50-80 characters) to preserve context
- **Preserve rich metadata** for filtering and source attribution

### Vector Search Configuration
- **Start with FAISS** for local development and small-scale production
- **Use similarity thresholds** around 0.3 for balanced precision/recall
- **Retrieve 3-5 documents** to provide sufficient context without noise
- **Normalize embeddings** for consistent similarity calculations

### Prompt Engineering
- **Create role-specific prompts** for different user types (customers, employees, executives)
- **Include clear instructions** for handling cases where information isn't available
- **Use structured templates** that separate context from questions
- **Test prompt variations** to optimize for your specific use cases

## Next Steps

1. **Start with core functionality** - Get basic RAG working with your documents
2. **Add monitoring early** - Implement logging and metrics collection
3. **Customize for your domain** - Tailor prompts and chunking for your content
4. **Iterate based on feedback** - Use real user interactions to improve the system
5. **Plan for production** - Consider scalability, monitoring, and maintenance