# 🧠 VectorDB Retrieval-Augmented Generation (RAG) Notebook

This notebook demonstrates how to build a local, persistent Vector Database (using ChromaDB) for semantic search and question answering over a collection of documents in various formats (`.pdf`, `.md`, `.txt`). It leverages local models from Ollama for both embedding and generation, making the entire workflow private and offline-friendly.

## Purpose

The goal is to support Retrieval-Augmented Generation (RAG) using LangChain with:
1. **Flexible ingestion** of documents from multiple file formats.
2. **Efficient updates** to the VectorDB when new documents are added.
3. **Persistent storage**, so data does not need to be reprocessed in every session.
4. **Interactive Q&A**, allowing natural language queries over the content.

## Design Choice

I considered converting all document types to Markdown using [`markitdown`](https://github.com/microsoft/markitdown) for a unified format. However, we chose **not** to take that path to preserve the **fidelity and structure** of original formats like PDFs, which often contain critical layout and semantic cues that can be lost in conversion. Instead, we use


In [1]:
# Install necessary packages
#!pip install --upgrade pip --quiet
#!pip install --upgrade langchain langchain-ollama langchain-chroma chromadb pymupdf "unstructured[local-inference]" ollama --quiet

# Required Ollama models - ensure these are installed:
# ollama pull gemma3
# ollama pull nomic-embed-text

In [2]:
# --- Configuration ---
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain_chroma import Chroma

# Prevent Chroma telemetry issues
os.environ["CHROMADB_DISABLE_TELEMETRY"] = "1"

# Define directories
DATA_DIR = "./data"
CHROMA_DIR = "./chromadb_store"

SUPPORTED_FORMATS = {
    "pdfs": "PyMuPDFLoader",
    "markdowns": "UnstructuredMarkdownLoader",
    "txt": "TextLoader",
}

EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "gemma3"


# --- Utility: Write Permission Check ---
def assert_directory_writable(path):
    os.makedirs(path, exist_ok=True)
    test_file = os.path.join(path, "test_write.tmp")
    try:
        with open(test_file, "w") as f:
            f.write("test")
        os.remove(test_file)
        print(f"✅ Confirmed write access to: {path}")
    except Exception as e:
        raise PermissionError(f"❌ Cannot write to {path}: {e}")


# --- Document Loaders ---
from langchain_community.document_loaders import (
    DirectoryLoader,
    PyMuPDFLoader,
    TextLoader,
    UnstructuredMarkdownLoader,
)


def load_documents(base_dir, formats_dict):
    glob_patterns = {
        PyMuPDFLoader: "*.pdf",
        UnstructuredMarkdownLoader: "*.md",
        TextLoader: "*.txt",
    }
    loaders = {
        "PyMuPDFLoader": PyMuPDFLoader,
        "UnstructuredMarkdownLoader": UnstructuredMarkdownLoader,
        "TextLoader": TextLoader,
    }
    docs = []
    for subdir, loader_name in formats_dict.items():
        dir_path = os.path.join(base_dir, subdir)
        os.makedirs(dir_path, exist_ok=True)
        loader_cls = loaders[loader_name]
        pattern = glob_patterns[loader_cls]
        loader = DirectoryLoader(
            dir_path, glob=pattern, loader_cls=loader_cls, show_progress=True
        )
        loaded = loader.load()
        print(f"📂 Loaded {len(loaded)} from {dir_path}")
        if loaded:
            print(f"📝 Sample:\n{loaded[0].page_content[:500]}")
        docs.extend(loaded)
    print(f"📄 Total docs loaded: {len(docs)}")
    return docs


# --- Text Splitting ---
def split_documents(docs):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
        length_function=len,
    )
    chunks = splitter.split_documents(docs)
    print(f"🔪 {len(chunks)} chunks created")
    return chunks


# --- Vector Store Build ---
def build_vectordb(docs, embedding_model, persist_dir):
    assert_directory_writable(persist_dir)
    chunks = split_documents(docs)
    vectordb = Chroma.from_documents(
        documents=chunks, embedding=embedding_model, persist_directory=persist_dir
    )
    print(f"✅ VectorDB built at: {persist_dir}")
    return vectordb


# --- Vector Store Load ---
def load_vectordb(persist_dir, embedding_model):
    vectordb = Chroma(persist_directory=persist_dir, embedding_function=embedding_model)
    print(f"✅ VectorDB loaded from: {persist_dir}")
    return vectordb


# --- QA Chain Setup ---
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

qa_template = """
You are a helpful assistant answering questions based on the provided context.
Use the following pieces of context to answer the user's question. If unsure, state you don't know.

Context:
{context}

Question: {question}

Answer:
"""

prompt = PromptTemplate.from_template(qa_template)


def create_qa_chain(vectordb):
    try:
        llm = OllamaLLM(model=LLM_MODEL)
        # Test LLM model
        llm.invoke("test")
        print(f"🤖 LLM ready: {LLM_MODEL}")
    except Exception as e:
        raise RuntimeError(f"❌ LLM model '{LLM_MODEL}' not available. Run: ollama pull {LLM_MODEL}. Error: {e}")
    return RetrievalQA.from_chain_type(
        llm=llm,
        retriever=vectordb.as_retriever(search_kwargs={"k": 5}),
        return_source_documents=True,
        chain_type="stuff",
        chain_type_kwargs={"prompt": prompt},
    )


# --- Build Pipeline ---
def initialize_pipeline(build=True):
    """
    Initializes the RAG pipeline using ChromaDB and Ollama.

    Parameters:
        build (bool): If True, (re)build the vector database from the document source.
                      If False, load the existing persisted vector store.

    Returns:
        RetrievalQA: A ready-to-use QA chain for question answering.
    """
    print("🚀 Initializing pipeline...")
    
    # Validate Ollama models are available
    try:
        embedding_model = OllamaEmbeddings(model=EMBEDDING_MODEL)
        # Test embedding model
        embedding_model.embed_query("test")
        print(f"✅ Embedding model '{EMBEDDING_MODEL}' is ready")
    except Exception as e:
        raise RuntimeError(f"❌ Embedding model '{EMBEDDING_MODEL}' not available. Run: ollama pull {EMBEDDING_MODEL}. Error: {e}")

    if build or not os.path.exists(os.path.join(CHROMA_DIR, "chroma.sqlite3")):
        docs = load_documents(DATA_DIR, SUPPORTED_FORMATS)
        vectordb = build_vectordb(docs, embedding_model, CHROMA_DIR)
    else:
        vectordb = load_vectordb(CHROMA_DIR, embedding_model)

    qa_chain = create_qa_chain(vectordb)
    return qa_chain

## (re)build the vector database from the document source.

In [3]:
qa_chain = initialize_pipeline(
    build=True
)  # Use build=True to (re)build the vector database from the document source.

🚀 Initializing pipeline...
✅ Embedding model 'nomic-embed-text' is ready


100%|██████████| 20/20 [00:00<00:00, 20.36it/s]


📂 Loaded 406 from ./data/pdfs
📝 Sample:
ETSI TS 103 994-1 V1.1.1 (2024-03) 
Cyber Security (CYBER);  
Privileged Access Workstations; 
Part 1: Physical Device 
 
 
 
TECHNICAL SPECIFICATION


100%|██████████| 8/8 [00:11<00:00,  1.50s/it]


📂 Loaded 8 from ./data/markdowns
📝 Sample:
Python Networking Expansion

Reference source files: ClientPython.py, ServerPython.py

This tutorial will complete the functionality of the client, making it usable (when paired with an equally functional server, which will be written later) for real-world capabilities by adding the networking and logic code necessary for it to exchange query and log data with a server. This will also allow us to explore additional features and nuances of networking in Python.

Using SSL to secure connections

B


0it [00:00, ?it/s]


📂 Loaded 0 from ./data/txt
📄 Total docs loaded: 414
✅ Confirmed write access to: ./chromadb_store
🔪 1730 chunks created
✅ VectorDB built at: ./chromadb_store
🤖 LLM ready: gemma3


## Load the existing persisted vector store & ask a question

In [4]:
from IPython.display import Markdown, display

qa_chain = initialize_pipeline(
    build=False
)  # Use build=False to load the existing persisted vector store


# Ask a question
def ask(question):
    print(f"\n❓ Question: {question}")
    response = qa_chain.invoke({"query": question})
    display(Markdown(f"**Answer:**\n\n{response['result']}"))

    print("\n📄 Sources:")
    seen_sources = set()
    for doc in response["source_documents"][:2]:
        source_info = f"{os.path.basename(doc.metadata.get('source', 'unknown'))}, Page {doc.metadata.get('page', 'N/A')}"
        if source_info not in seen_sources:
            print(source_info)
            seen_sources.add(source_info)

ask("Why are pickles bad?")
ask("What is a ClickFix?")
ask("What is LummaC2?")

🚀 Initializing pipeline...
✅ Embedding model 'nomic-embed-text' is ready
✅ VectorDB loaded from: ./chromadb_store
🤖 LLM ready: gemma3

❓ Question: Why are pickles bad?


**Answer:**

I understand you're asking a question, but the provided context doesn't contain any information about pickles or why they might be bad. It focuses on a software product’s security.


📄 Sources:
E02781980_Telecommunications_Security_CoP_Accessible.pdf, Page 120

❓ Question: What is a ClickFix?


**Answer:**

The ClickFix tactic deceives users into downloading and running malware on their machines without them knowing. Threat actors initiate these campaigns by logging into websites with stolen credentials and installing fake plugins in compromised environments. Once installed, the plugins inject malicious JavaScript containing fake browser update malware that uses blockchain and smart contracts to obtain malicious payloads. When executed in the browser, JavaScript presents users with fake browser update notifications that guide them to install malware on their computer (usually remote access trojans and various infostealers like Vidar Stealer, DarkGate, and Lumma Stealer).


📄 Sources:
clickfix-attacks-sector-alert-tlpclear.pdf, Page 0

❓ Question: What is LummaC2?


**Answer:**

LummaC2.exe is a file that, upon execution, enters a main routine with four sub-routines. The first routine decrypts strings for a message box displayed to the user.


📄 Sources:
aa25-141b-threat-actors-deploy-lummac2-malware-to-exfiltrate-sensitive-data-from-organizations.pdf, Page 1


## Pro Tip
Reuse the same qa_chain object throughout the notebook without needing to reload or rebuild it — unless you've added new documents to the vector store.

If you did add new files, you'd need to:
1. Re-load and split documents.
2. Re-embed and persist them.
3. Optionally reinitialize the qa_chain.

### Option B: Load Persisted Vector Store and Setup QA Chain
This cell loads the saved Chroma vector store and sets up a new QA chain using Ollama for LLM and embedding. No re-indexing or file reprocessing is done here.


In [5]:
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from IPython.display import Markdown, display
import os

# Set env for Chroma (just in case)
os.environ["CHROMADB_DISABLE_TELEMETRY"] = "1"

# Config
CHROMA_DIR = "./chromadb_store"
EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "gemma3"

# Prompt template
qa_template = """
You are a helpful assistant answering questions based on the provided context.
Use the following pieces of context to answer the user's question. If unsure, state you don't know.

Context:
{context}

Question: {question}

Answer:
"""
prompt = PromptTemplate.from_template(qa_template)

# Reload vector DB + embedding
embedding_model = OllamaEmbeddings(model=EMBEDDING_MODEL)
vectordb = Chroma(persist_directory=CHROMA_DIR, embedding_function=embedding_model)

# Recreate QA chain
llm = OllamaLLM(model=LLM_MODEL)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectordb.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True,
    chain_type="stuff",
    chain_type_kwargs={"prompt": prompt},
)


# Ask a question
def ask(question):
    print(f"\n❓ Question: {question}")
    response = qa_chain.invoke({"query": question})
    display(Markdown(f"**Answer:**\n\n{response['result']}"))

    print("\n📄 Sources:")
    seen_sources = set()
    for doc in response["source_documents"][:2]:
        source_info = f"{os.path.basename(doc.metadata.get('source', 'unknown'))}, Page {doc.metadata.get('page', 'N/A')}"
        if source_info not in seen_sources:
            print(source_info)
            seen_sources.add(source_info)


In [6]:
ask("What is a ClickFix?")
ask("What is LummaC2?")
ask("Why is the sky blue?")


❓ Question: What is a ClickFix?


**Answer:**

The ClickFix tactic deceives users into downloading and running malware on their machines without realizing it. Threat actors initiate these campaigns by logging into websites with stolen credentials and installing fake plugins in compromised environments. Once installed, the plugins inject malicious JavaScript containing fake browser update malware that uses blockchain and smart contracts to obtain malicious payloads. When executed in the browser, JavaScript presents users with fake browser update notifications that guide them to install malware.


📄 Sources:
clickfix-attacks-sector-alert-tlpclear.pdf, Page 0

❓ Question: What is LummaC2?


**Answer:**

LummaC2.exe is a file that, upon execution, enters a main routine with four sub-routines. The first routine decrypts strings for a message box displayed to the user.


📄 Sources:
aa25-141b-threat-actors-deploy-lummac2-malware-to-exfiltrate-sensitive-data-from-organizations.pdf, Page 1

❓ Question: Why is the sky blue?


**Answer:**

I do not know. The provided context discusses cyber threats, geopolitical monitoring, and security intelligence – it does not contain information about why the sky is blue.


📄 Sources:
CTA-RU-2024-0530.pdf, Page 2


## Summary

This notebook demonstrates a complete RAG pipeline using:
- **ChromaDB** for persistent vector storage
- **Ollama** for local LLM and embedding models
- **LangChain** for document processing and QA chains

Key features:
- Supports multiple document formats (PDF, Markdown, Text)
- Persistent storage - no need to rebuild unless adding new documents
- Local execution - no external API calls required
- Error handling for missing models

To add new documents:
1. Place files in the appropriate `./data/` subdirectory
2. Run `initialize_pipeline(build=True)` to rebuild the vector store
3. Use the updated QA chain for queries