# Introduction

In the previous notebook, we learned how to create embeddings from text data and store them in a persistent Chroma database. That gave us a powerful foundation: a searchable knowledge base where meaning, not just keywords, is captured.

Now, we’re going to build on that foundation. The notebook is separated into 2 sections:

**1. Semantic Search with ChromaDB**:

We’ll explore how to **query our database and retrieve the most relevant results based on semantic similarity**. Instead of simply matching words, we’ll be able to find passages that mean the same thing, even if they’re phrased differently. This is the first step toward building truly intelligent search systems.
  
**2. Retrieval-Augmented Generation (RAG)**:

Once we understand how to perform semantic search, we’ll continue with building a Retrieval-Augmented Generation (RAG) pipeline. This is where **we combine information retrieval (finding the right documents) with large language models** that can generate natural, human-like answers.

We’ll implement RAG in two different ways:

- **Using OpenAI’s model**: We’ll connect our Chroma database to an OpenAI model, letting it use retrieved context to answer questions in a conversational way.
- **Using a local Hugging Face model**: We’ll replicate the same process using a local model, showing how you can achieve RAG workflows without relying on cloud APIs.




# 1. Semantic Search with ChromaDB

Let's jump straight into using ChromaDB to perform searches against the database we built earlier.

In [None]:
# Importing
import chromadb
import os

First, we reconnect to the same on-disk database by creating a PersistentClient with the exact path we used before. This doesn’t recreate anything. It just points the new notebook at the existing Chroma files so all our collections and records are available again.

In [None]:
# Reconnecting to the persistent DB
chroma_client = chromadb.PersistentClient(path="./db/chroma_persist")

Next, we reopen the collection we created earlier, called "my_documents_locally":

In [None]:
# Reopening the collection
local_collection = chroma_client.get_collection("my_documents_locally")

Now we can run a semantic search using `query()` function. When we call it, Chroma takes 2 steps:
1. It **embeds our query text** using the same embedding function that was used for the documents in the collection.
2. It **compares the query embedding with all stored embeddings** and then returns the most relevant results.

In this example, we tell Chroma to return the top 2 most relevant documents:

In [None]:
query_text = "What are the foundational principles and technologies used to secure modern internet traffic?"

results = local_collection.query(
    query_texts = [query_text],
    n_results = 2,
)

To make the output easier to read, let’s loop through the results and print a short preview of each one. The output includes **the matching documents, their IDs and the distance scores** which is a measure of similarity (the smaller the distance, the closer the match).

In [None]:
# Displaying results in readable format
for rank, (document, document_id, distance_score) in enumerate(
    zip(results["documents"][0], results["ids"][0], results["distances"][0]),
    start=1):
    preview = document[:600]
    print(f"Result {rank}")
    print(f"• ID: {document_id}")
    print(f"• Distance: {distance_score:.4f}")
    print(f"• Preview: {preview}…")
    print("-" * 80)

### 📝 EXERCISE 1: Try Your Own Semantic Search (5-7 minutes)

**What you'll practice:** Running semantic searches and understanding how query phrasing affects results.

**Your task:**
1. Create your own query about a cybersecurity topic (e.g., "How do hackers steal passwords?", "What is modern network security?", "How do viruses work?")
2. Use `local_collection.query()` to search for the top 3 most relevant documents
3. Print the results showing the document IDs and distance scores
4. Experiment: Try rephrasing your question differently - do you get similar results?

**Hint:** Use the same structure as the example above. Remember to set `n_results=3` to get 3 documents.

**Expected outcome:** You should see that semantically similar questions return similar documents, even if worded differently. Lower distance scores mean better matches.

In [None]:
# YOUR CODE HERE
# Example solution structure:
# my_query = "Your question here"
# results = local_collection.query(
#     query_texts=[my_query],
#     n_results=3
# )
# 
# for rank, (doc, doc_id, distance) in enumerate(
#     zip(results["documents"][0], results["ids"][0], results["distances"][0]),
#     start=1):
#     print(f"Result {rank}: {doc_id}, Distance: {distance:.4f}")
#     print(f"Preview: {doc[:200]}...\n")

# 2. RAG (Retrieval-Augmented-Generation)

We’ll keep doing the same search, but now we’ll hand the retrieved text to an LLM and have it compose an answer while staying grounded in our documents.

## 2.1 OpenAI API

We’ll access LLM through the OpenAI API. To do this, we first load the API key from the environment and then create a client object that we will use to send requests to the model.

In [None]:
import os

# Configure OpenAI API key
OPENAI_API_KEY = None

try:
    from google.colab import userdata  # type: ignore
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY:
        print('✅ API key loaded from Colab secrets')
except Exception:
    pass

if not OPENAI_API_KEY:
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

if not OPENAI_API_KEY:
    try:
        from getpass import getpass
        print('💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY')
        OPENAI_API_KEY = getpass('Enter your OpenAI API Key: ')
    except Exception as exc:
        raise ValueError('❌ ERROR: No API key provided! Set OPENAI_API_KEY as an environment variable or Colab secret.') from exc

if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == '':
    raise ValueError('❌ ERROR: No API key provided!')

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print('✅ Authentication configured!')

OPENAI_MODEL = 'gpt-5-nano'  # Using gpt-5-nano for cost efficiency
print(f'🤖 Selected Model: {OPENAI_MODEL}')

OPENAI_EMBED_MODEL = 'text-embedding-3-small'
print(f'🧠 Embedding Model: {OPENAI_EMBED_MODEL}')


In [None]:
import openai
from openai import OpenAI

In [None]:
# Loading API key from the environment
openai.api_key = OPENAI_API_KEY

# Creating a client that we'll use to call the API
client = OpenAI(api_key=OPENAI_API_KEY)

We will query Chroma collection to get the most relevant passages for the question. Here we ask for the top 3 results, and we also include their IDs. These IDs help us later show where the information came from. If no results are found, we stop early and tell the user that nothing matched.

In [None]:
query_text = "What is the difference between symmetric and asymmetric encryption?"

# Step 1: Retrieving relevant passages from Chroma
response = local_collection.query(
    query_texts = [query_text],
    n_results = 3,
    include = ["documents"]
)

docs = response["documents"][0]
ids = response["ids"][0]

if not docs:
    print("No relevant passages found in the vector store. Try rephrasing the question.")
else:
    # Step 2: Packing retrieved results into a labeled context block to make the input clearer for the LLM
    context_blocks = [f"[{i+1} | {ids[i]}]\n{docs[i]}" for i in range(len(docs))]
    context = "\n\n".join(context_blocks)

Finally, we'll send retrieved passages along with the question to the OpenAI model `gpt-5-nano`. The `system` message is an instruction that controls how the model should behave. The `user` message then supplies both the sources and the question, so the model has all the context it needs.

### 📝 EXERCISE 2: Build Your Own RAG Query with OpenAI (10-12 minutes)

**What you'll practice:** Creating a complete RAG pipeline from retrieval to answer generation.

**Your task:**
1. Think of a new question about the cybersecurity documents (e.g., "What is Stuxnet?", "How does Zero Trust work?", "What makes phishing attacks successful?")
2. Retrieve the top 2 relevant documents from `local_collection`
3. Create a context block from the retrieved documents
4. Send the context and your question to OpenAI's API to generate an answer
5. Print the final answer

**Hint:** Follow the same pattern as the example above. You'll need to:
- Use `local_collection.query()` to retrieve documents
- Format the context with document IDs
- Use `client.chat.completions.create()` to generate the answer

**Expected outcome:** You should get a clear, cited answer that references specific source documents.

In [None]:
# YOUR CODE HERE
# Example solution structure:
# 
# my_question = "Your question here"
# 
# # Step 1: Retrieve
# response = local_collection.query(
#     query_texts=[my_question],
#     n_results=2,
#     include=["documents"]
# )
# 
# docs = response["documents"][0]
# ids = response["ids"][0]
# 
# # Step 2: Format context
# context_blocks = [f"[{i+1} | {ids[i]}]\n{docs[i]}" for i in range(len(docs))]
# context = "\n\n".join(context_blocks)
# 
# # Step 3: Generate answer
# resp = client.chat.completions.create(
#     model=OPENAI_MODEL,
#     temperature=0,
#     messages=[
#         {"role": "system", "content": "You are a precise assistant. Answer using only the provided sources and cite them."},
#         {"role": "user", "content": f"SOURCES:\n{context}\n\nQUESTION: {my_question}\n\nANSWER:"}
#     ]
# )
# 
# print(resp.choices[0].message.content)

In [None]:
# Step 3: Generating an answer using the retrieved context
resp = client.chat.completions.create(
    model = OPENAI_MODEL,
    temperature = 0,
    max_tokens = 500,
    messages=[
        {
            "role": "system",
            "content": (
                "You are a precise assistant that must answer ONLY using the provided sources."
                "If the answer is not fully supported by the sources, explain that the information is not in the provided material and suggest how the user might rephrase their question."
                "At the end, cite sources by their bracket labels, e.g., [1], [2]."
            ),
        },
        {
            "role": "user",
            "content": f"SOURCES:\n{context}\n\nQUESTION: {query_text}\n\nANSWER:"
        },
    ],
)

print(resp.choices[0].message.content)

## 2.2 Local Model from Hugging Face

In the previous section, we combined local retrieval with ChromaDB (embeddings stored on disk) and then sent the retrieved passages to OpenAI’s API for generation. That worked well but it also meant our data had to leave our computer and be processed by an external service.

What if we don’t want to send private documents outside our environment? In this section we’ll keep **everything on-device by running a model locally**. We'll use a pretrained instruction-tuned model from HuggingFace called `Qwen/Qwen2.5-1.5B-Instruct`.

This model is small enough to run on a laptop or a single GPU, yet powerful enough to follow instructions and generate answers. With it, both retrieval and generation happen locally:
- our data never leave our computer
- we avoid per-request API costs
- we reduce reliance on external services and avoid latency from network calls

You can read more about this model on [HuggingFace website](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct).

We start by importing the Hugging Face `transformers` library, which gives us the tools to load and run local language models.

In [None]:
# Importing
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

We then load two key components:
- **Tokenizer**: This converts our input text (the context and the question) into numerical tokens the model can understand and process. Later, the tokenizer also converts the model’s output tokens back into human-readable text.
- **Model**: This is the actual neural network with all its trained weights.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_name = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Pulling down the model weights and config
llm = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map = "auto",       # place the model on GPU if available, CPU otherwise
    torch_dtype = "float16")

Now we'll create custom function tht ties everything together. First, it opens the Chroma collection and retrieves the top-k most relevant documents for the given question. Those documents are combined into a prompt which instructs the model to answer only using the retrieved context. The prompt is then tokenized and passed to the language model for generation. The model produces new tokens, which are decoded back into text to form the final answer.

In [None]:
def get_answer(client, collection_name: str, question: str, k: int = 2, max_new_tokens: int = 120):
    # Opening the collection
    col = client.get_collection(collection_name)

    # Retrieving top-k documents
    res = col.query(query_texts=[question], n_results=k, include=["documents"])
    docs = res["documents"][0] if res.get("documents") else []
    if not docs:
        return "Insufficient context."

    # The prompt
    context = "\n\n".join(docs)
    prompt = (
        "You are a precise assistant.\n"
        "Answer ONLY using the CONTEXT below."
        "If the CONTEXT is insufficient, reply exactly: Insufficient context.\n"
        "Keep the answer short (1–2 sentences).\n\n"
        "----- CONTEXT -----\n"
        f"{context}\n"
        "-------------------\n"
        f"QUESTION: {question}\n"
        "Answer:"
    )

    # Tokenizing and generating
    inputs = tokenizer(prompt, return_tensors="pt").to(llm.device)
    outputs = llm.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )

    # Decoding only the newly generated tokens
    answer = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    return answer.strip()

In [None]:
question = "What is the difference between symmetric and asymmetric encryption?"

# Calling the RAG function
answer = get_answer(
    client = chroma_client,
    collection_name = "my_documents_locally",
    question = question,
    k = 2
)

# Displaying the output
print("Question:", question)
print("Answer:", answer)