# Building the RAG Collection

With chunked data from previous lesson, we now need a way to store and search it efficiently. A vector database like **ChromaDB** lets us store text chunks as vectors and quickly find similar ones.

Here’s how to build a collection, with detailed comments explaining each step:
```python
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions

def build_chroma_collection(chunks, collection_name="rag_collection"):
    # Choose a pre-trained embedding model to convert text to vectors
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    
    # Create a ChromaDB client and collection (creates if not exists)
    client = Client(Settings())
    collection = client.get_or_create_collection(
        name=collection_name, 
        embedding_function=embed_func
    )

    # Prepare the data for insertion
    texts = [c["text"] for c in chunks]  # The actual text chunks
    ids = [f"chunk_{c['id']}_{c['chunk_id']}" for c in chunks]  # Unique IDs for each chunk
    metadatas = [{"id": c["id"], "chunk_id": c["chunk_id"]} for c in chunks]  # Extra info for each chunk

    # Add all chunks to the collection for fast vector search
    collection.add(documents=texts, metadatas=metadatas, ids=ids)
    return collection

rag_collection = build_chroma_collection(chunked_data)
# This function sets up the collection but does not print output
```

- We use a pre-trained embedding model to convert text into vectors.
- Each chunk gets a unique ID and metadata.
- All chunks are added to the collection for fast retrieval.

## Retrieving Relevant Chunks and Constructing Prompts

When a user asks a question, we want to find the most relevant chunks. This is **semantic search**: we look for text with similar meaning, not just matching words.

Here’s how to retrieve the top relevant chunks, with detailed comments:
```python
def retrieve_top_chunks(query, collection, top_k=1):
    # Use the embedding model to convert the query to a vector and search for similar chunks
    results = collection.query(query_texts=[query], n_results=top_k)
    # Format the results for easy use
    return [
        {
            "chunk": results['documents'][0][i],  # The retrieved text chunk
            "id": results['metadatas'][0][i]['id'],  # The original document ID
            "distance": results['distances'][0][i]  # Similarity score (lower is more similar)
        }
        for i in range(len(results['documents'][0]))
    ]

query = "What are my learning plans for SQL?"
retrieved_chunks = retrieve_top_chunks(query, rag_collection, top_k=2)
print(retrieved_chunks)  # Example output: [{'chunk': 'Review different types of SQL joins — especially LEFT and FULL OUTER joins.', 'id': 2, 'distance': 0.12345}, ...]
```

- The function converts the query into a vector and finds the most similar chunks in the collection.
- The results include the chunk text, its ID, and a similarity score.

To help the agent answer accurately, we build a prompt with the user’s question and the retrieved context. This way, the agent “sees” the most relevant information when generating a response.

Here’s a function to build the prompt, with comments:

```python
def build_prompt(user_prompt, retrieved_chunks=[]):
    # Combine the user's question with the most relevant context chunks
    prompt = f"Question: {user_prompt}\nContext:\n"
    for rc in retrieved_chunks:
        prompt += f"- {rc['chunk']}\n"
    prompt += "Answer:"
    return prompt

prompt = build_prompt(query, retrieved_chunks)
print(prompt)  
# Example output:
# Question: What are my learning plans for SQL?
# Context:
# - Review different types of SQL joins — especially LEFT and FULL OUTER joins.
# - ...
# Answer:
```

- This function creates a prompt that includes the user’s question and the most relevant context chunks.
- The agent can now use this prompt to generate a more accurate answer.

## Lesson Summary and Practice Introduction

You’ve learned how to build and optimize a RAG collection:

- Load and prepare your data
- Chunk documents for better retrieval
- Store chunks in a vector database
- Retrieve the most relevant information for a query
- Build prompts that combine user questions with context

These steps are key to creating AI agents that use your knowledge base to answer questions effectively.

Now it’s your turn! In the next section, you’ll practice building and optimizing your own RAG collection — loading data, chunking it, storing it in a vector database, and retrieving relevant information to answer questions. Let’s put your new skills to work!