# Building the RAG Collection

With chunked data from previous lesson, we now need a way to store and search it efficiently. A vector database like **ChromaDB** lets us store text chunks as vectors and quickly find similar ones.

Here’s how to build a collection, with detailed comments explaining each step:
```python
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions

def build_chroma_collection(chunks, collection_name="rag_collection"):
    # Choose a pre-trained embedding model to convert text to vectors
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    
    # Create a ChromaDB client and collection (creates if not exists)
    client = Client(Settings())
    collection = client.get_or_create_collection(
        name=collection_name, 
        embedding_function=embed_func
    )

    # Prepare the data for insertion
    texts = [c["text"] for c in chunks]  # The actual text chunks
    ids = [f"chunk_{c['id']}_{c['chunk_id']}" for c in chunks]  # Unique IDs for each chunk
    metadatas = [{"id": c["id"], "chunk_id": c["chunk_id"]} for c in chunks]  # Extra info for each chunk

    # Add all chunks to the collection for fast vector search
    collection.add(documents=texts, metadatas=metadatas, ids=ids)
    return collection

rag_collection = build_chroma_collection(chunked_data)
# This function sets up the collection but does not print output
```

- We use a pre-trained embedding model to convert text into vectors.
- Each chunk gets a unique ID and metadata.
- All chunks are added to the collection for fast retrieval.

## Retrieving Relevant Chunks and Constructing Prompts

When a user asks a question, we want to find the most relevant chunks. This is **semantic search**: we look for text with similar meaning, not just matching words.

Here’s how to retrieve the top relevant chunks, with detailed comments:
```python
def retrieve_top_chunks(query, collection, top_k=1):
    # Use the embedding model to convert the query to a vector and search for similar chunks
    results = collection.query(query_texts=[query], n_results=top_k)
    # Format the results for easy use
    return [
        {
            "chunk": results['documents'][0][i],  # The retrieved text chunk
            "id": results['metadatas'][0][i]['id'],  # The original document ID
            "distance": results['distances'][0][i]  # Similarity score (lower is more similar)
        }
        for i in range(len(results['documents'][0]))
    ]

query = "What are my learning plans for SQL?"
retrieved_chunks = retrieve_top_chunks(query, rag_collection, top_k=2)
print(retrieved_chunks)  # Example output: [{'chunk': 'Review different types of SQL joins — especially LEFT and FULL OUTER joins.', 'id': 2, 'distance': 0.12345}, ...]
```

- The function converts the query into a vector and finds the most similar chunks in the collection.
- The results include the chunk text, its ID, and a similarity score.

To help the agent answer accurately, we build a prompt with the user’s question and the retrieved context. This way, the agent “sees” the most relevant information when generating a response.

Here’s a function to build the prompt, with comments:

```python
def build_prompt(user_prompt, retrieved_chunks=[]):
    # Combine the user's question with the most relevant context chunks
    prompt = f"Question: {user_prompt}\nContext:\n"
    for rc in retrieved_chunks:
        prompt += f"- {rc['chunk']}\n"
    prompt += "Answer:"
    return prompt

prompt = build_prompt(query, retrieved_chunks)
print(prompt)  
# Example output:
# Question: What are my learning plans for SQL?
# Context:
# - Review different types of SQL joins — especially LEFT and FULL OUTER joins.
# - ...
# Answer:
```

- This function creates a prompt that includes the user’s question and the most relevant context chunks.
- The agent can now use this prompt to generate a more accurate answer.

## Lesson Summary and Practice Introduction

You’ve learned how to build and optimize a RAG collection:

- Load and prepare your data
- Chunk documents for better retrieval
- Store chunks in a vector database
- Retrieve the most relevant information for a query
- Build prompts that combine user questions with context

These steps are key to creating AI agents that use your knowledge base to answer questions effectively.

Now it’s your turn! In the next section, you’ll practice building and optimizing your own RAG collection — loading data, chunking it, storing it in a vector database, and retrieving relevant information to answer questions. Let’s put your new skills to work!

Great job on learning how to build and optimize a RAG collection! Now, let's make a meaningful improvement to our setup by working with the build_chroma_collection function. In this task, you will:

Create an embedding function using the model 'sentence-transformers/all-MiniLM-L6-v2'.
Initialize a ChromaDB client and get (or create) a collection with the given name and embedding function.
Add all the provided chunks to the collection, using the chunk text as the document, and including unique IDs and metadata for each chunk.
The function should return the created collection.

Why does this matter?
A well-implemented build_chroma_collection function ensures your RAG system can store and retrieve relevant information efficiently.

In [None]:
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions

def load_data(file_name):
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", file_name)
    with open(dataset_file, 'r') as file:
        return json.load(file)

def chunk_text(text, chunk_size=30):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])
    return chunks

def chunk_dataset(data, chunk_size=30):
    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "id": doc_id,
                "chunk_id": chunk_id,
                "text": chunk_str,
            })
    return all_chunks

def build_chroma_collection(chunks, collection_name="rag_collection"):
    # TODO: Implement this function:
    # 1. Create an embedding function using 'sentence-transformers/all-MiniLM-L6-v2'
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    
    # 2. Initialize a ChromaDB client and get or create a collection with the given name and embedding function
    client = Client(Settings())
    collection = client.get_or_create_collection(
        name = collection_name,
        embedding_function = embed_func
    )
    
    # 3. Add all chunks to the collection with their text, unique IDs, and metadata
    texts = [c["text"] for c in chunks]  # The actual text chunks
    ids = [f"chunk_{c['id']}_{c['chunk_id']}" for c in chunks]  # Unique IDs for each chunk
    metadatas = [{"id": c["id"], "chunk_id": c["chunk_id"]} for c in chunks]  # Extra info for each chunk

    # 4. Return the collection
    collection.add(documents=texts, metadatas=metadatas, ids=ids)
    return collection


def retrieve_top_chunks(query, collection, top_k=1):
    results = collection.query(
        query_texts=[query],
        n_results=top_k
    )
    retrieved_chunks = []
    for i in range(len(results['documents'][0])):
        retrieved_chunks.append({
            "chunk": results['documents'][0][i],
            "id": results['metadatas'][0][i]['id'],
            "distance": results['distances'][0][i]
        })
    return retrieved_chunks

def build_prompt(user_prompt, retrieved_chunks=[]):
    prompt = f"Question: {user_prompt}\nContext:\n"
    for rc in retrieved_chunks:
        prompt += f"- {rc['chunk']}\n"
    prompt += "Answer:"
    return prompt

def main():
    data = load_data("data.json")
    chunked_data = chunk_dataset(data)

    rag_collection = build_chroma_collection(chunked_data)

    print(rag_collection)

if __name__ == "__main__":
    main()

You've made great progress in building and optimizing a RAG collection. Now, let's make a small but meaningful change to our RAG retrieval process.

Currently, the retrieve_top_chunks function is set to retrieve only one top chunk. Modify the main function to change the top_k parameter from 1 to 2, allowing the retrieval of two top chunks. This will help you understand how retrieving more context can affect the agent's responses.

In [None]:
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions

def load_data(file_name):
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", file_name)
    with open(dataset_file, 'r') as file:
        return json.load(file)

def chunk_text(text, chunk_size=60):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])
    return chunks

def chunk_dataset(data, chunk_size=60):
    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "id": doc_id,
                "chunk_id": chunk_id,
                "text": chunk_str,
            })
    return all_chunks

def build_chroma_collection(chunks, collection_name="rag_collection"):
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())
    collection = client.get_or_create_collection(name=collection_name, embedding_function=embed_func)

    texts = [c["text"] for c in chunks]
    ids = [f"chunk_{c['id']}_{c['chunk_id']}" for c in chunks]
    metadatas = [{"id": c["id"], "chunk_id": c["chunk_id"]} for c in chunks]

    collection.add(documents=texts, metadatas=metadatas, ids=ids)

    return collection

def retrieve_top_chunks(query, collection, top_k=1):
    results = collection.query(
        query_texts=[query],
        n_results=top_k
    )
    retrieved_chunks = []
    for i in range(len(results['documents'][0])):
        retrieved_chunks.append({
            "chunk": results['documents'][0][i],
            "id": results['metadatas'][0][i]['id'],
            "distance": results['distances'][0][i]
        })
    return retrieved_chunks

def main():
    data = load_data("data.json")
    chunked_data = chunk_dataset(data)

    rag_collection = build_chroma_collection(chunked_data)

    query = "What are my learning plans for React"
    # TODO: Change the top_k parameter from 1 to 2
    retrieved_chunks = retrieve_top_chunks(query, rag_collection, top_k=2)

    print(retrieved_chunks)

if __name__ == "__main__":
    main()

You've done well in learning how to build and optimize a RAG collection. Now, let's enhance the retrieve_top_chunks function to complete the querying process.

In [None]:
# exercise 3

import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions

def load_data(file_name):
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", file_name)
    with open(dataset_file, 'r') as file:
        return json.load(file)

def chunk_text(text, chunk_size=60):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])
    return chunks

def chunk_dataset(data, chunk_size=60):
    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "id": doc_id,
                "chunk_id": chunk_id,
                "text": chunk_str,
            })
    return all_chunks

def build_chroma_collection(chunks, collection_name="rag_collection"):
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())
    collection = client.get_or_create_collection(name=collection_name, embedding_function=embed_func)

    texts = [c["text"] for c in chunks]
    ids = [f"chunk_{c['id']}_{c['chunk_id']}" for c in chunks]
    metadatas = [{"id": c["id"], "chunk_id": c["chunk_id"]} for c in chunks]

    collection.add(documents=texts, metadatas=metadatas, ids=ids)

    return collection

def retrieve_top_chunks(query, collection, top_k=1):
    # TODO: Pass the query and top_k to the collection's query method
    results = collection.query(
        query_texts=[query],
        n_results=top_k
    )
    retrieved_chunks = []
    for i in range(len(results['documents'][0])):
        retrieved_chunks.append({
            "chunk": results['documents'][0][i],
            "id": results['metadatas'][0][i]['id'],
            "distance": results['distances'][0][i]
        })
    return retrieved_chunks

def main():
    data = load_data("data.json")
    chunked_data = chunk_dataset(data)

    rag_collection = build_chroma_collection(chunked_data)

    query = "What are my learning plans for React"
    retrieved_chunks = retrieve_top_chunks(query, rag_collection)
    print(retrieved_chunks)

if __name__ == "__main__":
    main()

You've done well in enhancing the RAG retrieval process. Now, let's take it a step further by integrating an agent to handle user queries.

In this task, you will create a new file, rag_agent.py, to define an agent that processes prompts and returns responses. Then, integrate this agent into the main script to handle user queries effectively.

Implement rag_agent.py and define an Agent with the name "Learning Assistant" and instructions to use the provided context to answer questions.

Implement the ask_agent function to process prompts using the Runner and return the final output.

In the main function of solution.py, use the ask_agent function to get a response to the query and print it.

In [None]:
from agents import Agent, Runner

# TODO: Define an agent that processes prompts and returns responses with the following properties

Agent = Agent(
        name = "Learning Assistant",
        instructions = ("You are a personal learning assistant." 
                        "Whenever asked a question about learning plans, use the context provided to answer questions."
                        )
    )

# - Name: "Learning Assistant"
# - Instructions: "You are a personal learning assistant. Whenever asked a question about learning plans, use the context provided to answer questions."
def ask_agent(prompt):
    try:
        result = Runner.run_sync(Agent, prompt)
        return result.final_output
    except Exception as e:
        print(f"Agent error:{e}")
        raise

# TODO: Implement the ask_agent function to process prompts using the Runner and return the final output

In [None]:
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from rag_agent import ask_agent

def load_data(file_name):
    """Load sample knowledge base content from JSON file."""
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", file_name)
    with open(dataset_file, 'r') as file:
        return json.load(file)

def chunk_text(text, chunk_size=60):
    """Chunk text into smaller pieces for better processing."""
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])
    return chunks

def chunk_dataset(data, chunk_size=60):
    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "id": doc_id,
                "chunk_id": chunk_id,
                "text": chunk_str,
            })
    return all_chunks

def build_chroma_collection(chunks, collection_name="rag_collection"):
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())
    collection = client.get_or_create_collection(name=collection_name, embedding_function=embed_func)

    texts = [c["text"] for c in chunks]
    ids = [f"chunk_{c['id']}_{c['chunk_id']}" for c in chunks]
    metadatas = [{"id": c["id"], "chunk_id": c["chunk_id"]} for c in chunks]

    # Add documents to the collection
    collection.add(documents=texts, metadatas=metadatas, ids=ids)

    return collection

def retrieve_top_chunks(query, collection, top_k=1):
    results = collection.query(
        query_texts=[query],
        n_results=top_k
    )
    retrieved_chunks = []
    for i in range(len(results['documents'][0])):
        retrieved_chunks.append({
            "chunk": results['documents'][0][i],
            "id": results['metadatas'][0][i]['id'],
            "distance": results['distances'][0][i]
        })
    return retrieved_chunks

def build_prompt(user_prompt, retrieved_chunks=[]):
    prompt = f"Question: {user_prompt}\nContext:\n"
    # Combine multiple chunks
    for rc in retrieved_chunks:
        prompt += f"- {rc['chunk']}\n"
    prompt += "Answer:"
    return prompt

def main():
    # Make sure data/data.json exists with the expected format
    data = load_data("data.json")
    chunked_data = chunk_dataset(data)

    rag_collection = build_chroma_collection(chunked_data)

    query = "What are my learning plans for SQL?"
    retrieved_chunks = retrieve_top_chunks(query, rag_collection, top_k=2)

    prompt = build_prompt(query, retrieved_chunks)

    # TODO: Use the ask_agent function to get a response for the prompt and print it
    answer = ask_agent(prompt)
    print(answer)

if __name__ == "__main__":
    main()

You've come a long way in building and optimizing a RAG collection. Now, it's time to put all the pieces together.

Your task is to complete the Python script from that includes building a ChromaDB collection, retrieving relevant chunks, and constructing prompts for the agent.

This exercise will help solidify your understanding of the entire process and ensure you can implement a RAG system effectively.

In [None]:
from agents import Agent, Runner
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from rag_agent import ask_agent

def load_data(file_name):
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", file_name)
    with open(dataset_file, 'r') as file:
        return json.load(file)

def chunk_text(text, chunk_size=30):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])
    return chunks

def chunk_dataset(data, chunk_size=30):
    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "id": doc_id,
                "chunk_id": chunk_id,
                "text": chunk_str,
            })
    return all_chunks


def build_chroma_collection(chunks, collection_name="rag_collection"):
    # TODO: Create an embedding function and collection, then add all chunks to the collection
    pass

def retrieve_top_chunks(query, collection, top_k=1):
    # TODO: Query the collection for top-k relevant chunks and return them
    pass

def build_prompt(user_prompt, retrieved_chunks=[]):
    # TODO: Build a prompt with the user question and retrieved context
    pass

def main():
    # TODO: Load data, chunk it, build the collection, retrieve chunks, build the prompt, and ask the agent
    pass

if __name__ == "__main__":
    main()

In [None]:
from agents import Agent, Runner
import os
import json
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from rag_agent import ask_agent

def load_data(file_name):
    current_dir = os.path.dirname(__file__)
    dataset_file = os.path.join(current_dir, "data", file_name)
    with open(dataset_file, 'r') as file:
        return json.load(file)

def chunk_text(text, chunk_size=30):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])
    return chunks

def chunk_dataset(data, chunk_size=30):
    all_chunks = []
    for doc_id, doc in enumerate(data):
        doc_text = doc["content"]
        doc_chunks = chunk_text(doc_text, chunk_size)
        for chunk_id, chunk_str in enumerate(doc_chunks):
            all_chunks.append({
                "id": doc_id,
                "chunk_id": chunk_id,
                "text": chunk_str,
            })
    return all_chunks


def build_chroma_collection(chunks, collection_name="rag_collection"):
    # TODO: Create an embedding function and collection, then add all chunks to the collection
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    client = Client(Settings())
    collection = client.get_or_create_collection(name=collection_name, embedding_function=embed_func)

    texts = [c["text"] for c in chunks]
    ids = [f"chunk_{c['id']}_{c['chunk_id']}" for c in chunks]
    metadatas = [{"id": c["id"], "chunk_id": c["chunk_id"]} for c in chunks]

    # Add documents to the collection
    collection.add(documents=texts, metadatas=metadatas, ids=ids)

    return collection

def retrieve_top_chunks(query, collection, top_k=1):
    # TODO: Query the collection for top-k relevant chunks and return them
    results = collection.query(
        query_texts=[query],
        n_results=top_k
    )
    retrieved_chunks = []
    for i in range(len(results['documents'][0])):
        retrieved_chunks.append({
            "chunk": results['documents'][0][i],
            "id": results['metadatas'][0][i]['id'],
            "distance": results['distances'][0][i]
        })
    return retrieved_chunks

def build_prompt(user_prompt, retrieved_chunks=[]):
    # TODO: Build a prompt with the user question and retrieved context
    prompt = f"Question: {user_prompt}\nContext:\n"
    # Combine multiple chunks
    for rc in retrieved_chunks:
        prompt += f"- {rc['chunk']}\n"
    prompt += "Answer:"
    return prompt

def main():
    # TODO: Load data, chunk it, build the collection, retrieve chunks, build the prompt, and ask the agent
    pass

if __name__ == "__main__":
    main()