# Fully Local RAG Pipeline with Chroma + Ollama

> **No API key required.** This notebook runs entirely on your local machine using [Ollama](https://ollama.com) for both the LLM and embeddings, and [ChromaDB](https://www.trychroma.com) as the vector store.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/cookbooks/local_rag_chroma_ollama/local_rag_with_chroma_and_ollama.ipynb)

## What you will learn

| Step | Concept |
|------|---------|
| 1 | Load config from `config.yaml` — all tunables in one place |
| 2 | Ingest and chunk a local document with `SentenceSplitter` |
| 3 | Embed chunks with `OllamaEmbedding` and persist in ChromaDB |
| 4 | Query the index with a local LLM (`llama3.2:3b` via Ollama) |
| 5 | Evaluate retrieval quality against a gold Q&A set (hit-rate & MRR) |
| 6 | Explore failure modes: empty context, long queries, hallucination guard |

## Prerequisites

1. **Ollama** installed and running — [download here](https://ollama.com/download)
2. Models pulled:
   ```bash
   ollama pull llama3.2:3b
   ollama pull nomic-embed-text
   ```
3. Python dependencies installed:
   ```bash
   pip install -r requirements.txt
   ```

## Directory layout

```
local_rag_chroma_ollama/
├── local_rag_with_chroma_and_ollama.ipynb  ← this notebook
├── config.yaml                              ← all tunables
├── requirements.txt                         ← pinned deps
├── data/
│   └── ai_safety_primer.txt                ← committed sample dataset
└── eval/
    └── gold_qa.json                        ← gold Q&A for retrieval eval
```

## Cell 1 — Install dependencies

Skip this cell if you already ran `pip install -r requirements.txt`.

In [1]:
%pip install -q \\\n    llama-index-core==0.14.15 \\\n    llama-index-llms-ollama==0.9.1 \\\n    llama-index-embeddings-ollama==0.8.6 \\\n    llama-index-vector-stores-chroma==0.5.5 \\\n    chromadb==1.5.1 \\\n    pyyaml==6.0.3


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[31mERROR: Invalid requirement: '\\n': Expected package name at the start of dependency specifier
    \n
    ^[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


import json\nimport logging\nimport os\nimport shutil\nfrom pathlib import Path\n\nimport chromadb\nimport yaml\nfrom IPython.display import Markdown, display\n\nfrom llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex\nfrom llama_index.core.node_parser import SentenceSplitter\nfrom llama_index.core.retrievers import VectorIndexRetriever\nfrom llama_index.embeddings.ollama import OllamaEmbedding\nfrom llama_index.llms.ollama import Ollama\nfrom llama_index.vector_stores.chroma import ChromaVectorStore\n\n# ── Working directory: always resolve relative to this notebook ───────\nNOTEBOOK_DIR = Path(globals().get(\"__vsc_ipynb_file__\", __file__) if \"__file__\" in dir() else \"\").parent\nif NOTEBOOK_DIR == Path(\"\") or not NOTEBOOK_DIR.exists():\n    # Fallback: use the directory of this notebook via IPython\n    try:\n        from IPython import get_ipython\n        _ip = get_ipython()\n        NOTEBOOK_DIR = Path(_ip.starting_dir) if _ip and hasattr(_ip, \"starting_dir\") else Path.cwd()\n    except Exception:\n        NOTEBOOK_DIR = Path.cwd()\n\n# If launched from repo root, chdir into the notebook folder\n_nb_folder = Path(\"docs/examples/cookbooks/local_rag_chroma_ollama\")\nif (Path.cwd() / _nb_folder).exists() and not (Path.cwd() / \"config.yaml\").exists():\n    os.chdir(Path.cwd() / _nb_folder)\n\nprint(f\"Working directory: {Path.cwd()}\")\n\n# ── Logging ──────────────────────────────────────────────────────────\nlogging.basicConfig(\n    level=logging.INFO,\n    format=\"%(asctime)s [%(levelname)s] %(name)s: %(message)s\",\n    datefmt=\"%H:%M:%S\",\n)\nlogger = logging.getLogger(\"local_rag\")\nlogger.info(\"Imports loaded successfully.\")

In [2]:
import json
import logging
import os
import shutil
from pathlib import Path

import chromadb
import yaml
from IPython.display import Markdown, display

from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.chroma import ChromaVectorStore

# ── Logging ──────────────────────────────────────────────────────────
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%H:%M:%S",
)
logger = logging.getLogger("local_rag")
logger.info("Imports loaded successfully.")

17:49:30 [INFO] local_rag: Imports loaded successfully.


## Cell 3 — Load configuration from `config.yaml`

All tunables (model names, chunk size, top-k, paths) live in `config.yaml`.
Edit that file to change behaviour — no need to touch notebook code.

In [3]:
CONFIG_PATH = Path("config.yaml")

with CONFIG_PATH.open() as f:
    cfg = yaml.safe_load(f)

logger.info("Config loaded from %s", CONFIG_PATH)
print(json.dumps(cfg, indent=2))

17:49:30 [INFO] local_rag: Config loaded from config.yaml


{
  "llm": {
    "model": "llama3.2:3b",
    "base_url": "http://localhost:11434",
    "temperature": 0.0,
    "request_timeout": 120.0
  },
  "embedding": {
    "model": "nomic-embed-text",
    "base_url": "http://localhost:11434"
  },
  "splitter": {
    "chunk_size": 512,
    "chunk_overlap": 50
  },
  "chroma": {
    "persist_dir": "./chroma_db",
    "collection_name": "ai_safety_rag"
  },
  "data": {
    "input_dir": "./data"
  },
  "retrieval": {
    "similarity_top_k": 3
  },
  "eval": {
    "gold_qa_path": "./eval/gold_qa.json"
  }
}


## Cell 4 — Initialise LLM and embedding model

In [4]:
llm = Ollama(
    model=cfg["llm"]["model"],
    base_url=cfg["llm"]["base_url"],
    temperature=cfg["llm"]["temperature"],
    request_timeout=cfg["llm"]["request_timeout"],
)

embed_model = OllamaEmbedding(
    model_name=cfg["embedding"]["model"],
    base_url=cfg["embedding"]["base_url"],
)

logger.info("LLM: %s | Embedding: %s", cfg["llm"]["model"], cfg["embedding"]["model"])

17:49:30 [INFO] local_rag: LLM: llama3.2:3b | Embedding: nomic-embed-text


## Cell 5 — Load and chunk the document

We use `SentenceSplitter` with the chunk size and overlap from config.
The splitter is deterministic — same input always produces the same chunks.

In [5]:
DATA_DIR = Path(cfg["data"]["input_dir"])

documents = SimpleDirectoryReader(str(DATA_DIR)).load_data()
logger.info("Loaded %d document(s) from %s", len(documents), DATA_DIR)

splitter = SentenceSplitter(
    chunk_size=cfg["splitter"]["chunk_size"],
    chunk_overlap=cfg["splitter"]["chunk_overlap"],
)
nodes = splitter.get_nodes_from_documents(documents)
logger.info("Split into %d nodes (chunk_size=%d, overlap=%d)",
            len(nodes),
            cfg["splitter"]["chunk_size"],
            cfg["splitter"]["chunk_overlap"])

print(f"\nFirst chunk preview ({len(nodes[0].text)} chars):")
print("-" * 60)
print(nodes[0].text[:400], "...")

17:49:30 [INFO] local_rag: Loaded 1 document(s) from data
17:49:31 [INFO] local_rag: Split into 3 nodes (chunk_size=512, overlap=50)



First chunk preview (2247 chars):
------------------------------------------------------------
# AI Safety Primer

## What is AI Safety?

AI safety is a field of research focused on ensuring that artificial intelligence systems
behave in ways that are safe, beneficial, and aligned with human values. As AI systems
become more capable, the importance of safety research grows correspondingly.

## Key Concepts

### Alignment
Alignment refers to the challenge of ensuring that an AI system's goal ...


## Cell 6 — Build or load the Chroma index (with caching)

If `chroma_db/` already exists on disk we load from it — no re-embedding.
Delete the `chroma_db/` folder to force a full re-index.

In [6]:
PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])
COLLECTION  = cfg["chroma"]["collection_name"]

chroma_client = chromadb.PersistentClient(path=str(PERSIST_DIR))
existing = [c.name for c in chroma_client.list_collections()]

if COLLECTION in existing:
    logger.info("Cache hit — loading existing collection '%s' from %s", COLLECTION, PERSIST_DIR)
    chroma_collection = chroma_client.get_collection(COLLECTION)
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    index = VectorStoreIndex.from_vector_store(
        vector_store,
        embed_model=embed_model,
    )
else:
    logger.info("Cache miss — embedding %d nodes into new collection '%s'", len(nodes), COLLECTION)
    chroma_collection = chroma_client.create_collection(COLLECTION)
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(
        nodes,
        storage_context=storage_context,
        embed_model=embed_model,
    )
    logger.info("Index built and persisted to %s", PERSIST_DIR)

print(f"Collection '{COLLECTION}' has {chroma_collection.count()} vectors.")

17:49:31 [INFO] chromadb.telemetry.product.posthog: Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
17:49:32 [INFO] local_rag: Cache miss — embedding 3 nodes into new collection 'ai_safety_rag'
17:49:33 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:49:33 [INFO] local_rag: Index built and persisted to chroma_db


Collection 'ai_safety_rag' has 3 vectors.


## Cell 7 — RAG query

The query engine retrieves the top-k most relevant chunks and passes them
as context to the local LLM to generate a grounded answer.

In [7]:
query_engine = index.as_query_engine(
    llm=llm,
    similarity_top_k=cfg["retrieval"]["similarity_top_k"],
)

QUERY = "What is Constitutional AI and who developed it?"
logger.info("Running query: %s", QUERY)

response = query_engine.query(QUERY)

display(Markdown(f"**Query:** {QUERY}\n\n**Answer:** {response}"))

print("\n--- Retrieved source nodes ---")
for i, node in enumerate(response.source_nodes, 1):
    score = getattr(node, "score", "n/a")
    print(f"[{i}] score={score:.4f}  |  {node.text[:120].strip()}...")

17:49:33 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/show "HTTP/1.1 200 OK"
17:49:33 [INFO] local_rag: Running query: What is Constitutional AI and who developed it?
17:49:33 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:49:52 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


**Query:** What is Constitutional AI and who developed it?

**Answer:** Constitutional AI is an approach where an AI system is trained to follow a set of explicit principles (a "constitution"). This approach was developed by Anthropic. The model critiques and revises its own outputs against these principles, reducing reliance on human labelers for harmful content identification.


--- Retrieved source nodes ---
[1] score=0.4398  |  # AI Safety Primer

## What is AI Safety?

AI safety is a field of research focused on ensuring that artificial intellig...
[2] score=0.4063  |  ## Why AI Safety Matters Now

The rapid pace of AI development means that safety considerations must be integrated
early...
[3] score=0.3742  |  The model critiques
and revises its own outputs against these principles, reducing reliance on human
labelers for harmfu...


## Cell 8 — Retrieval evaluation: Hit-Rate and MRR

We loop over the gold Q&A set and for each question:
1. Retrieve the top-k nodes
2. Check if any retrieved chunk contains **all expected keywords** (hit)
3. Record the rank of the first hit (for MRR)

This is **CI-friendly** — no extra LLM calls, runs in seconds.

In [8]:
GOLD_QA_PATH = Path(cfg["eval"]["gold_qa_path"])

with GOLD_QA_PATH.open() as f:
    gold_qa = json.load(f)

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=cfg["retrieval"]["similarity_top_k"],
    embed_model=embed_model,
)

hits = 0
reciprocal_ranks = []
results = []

for item in gold_qa:
    retrieved_nodes = retriever.retrieve(item["question"])
    keywords = [kw.lower() for kw in item["expected_keywords"]]

    first_hit_rank = None
    for rank, node in enumerate(retrieved_nodes, 1):
        text_lower = node.text.lower()
        if all(kw in text_lower for kw in keywords):
            first_hit_rank = rank
            break

    hit = first_hit_rank is not None
    hits += int(hit)
    reciprocal_ranks.append(1 / first_hit_rank if hit else 0.0)

    results.append({
        "id": item["id"],
        "hit": hit,
        "rank": first_hit_rank,
        "question": item["question"][:60],
    })
    logger.info("[%s] hit=%s rank=%s | %s", item["id"], hit, first_hit_rank, item["question"][:50])

hit_rate = hits / len(gold_qa)
mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)

print("\n" + "=" * 50)
print(f"  Retrieval Evaluation Results (top-k={cfg['retrieval']['similarity_top_k']})")
print("=" * 50)
print(f"  Hit-Rate : {hit_rate:.2%}  ({hits}/{len(gold_qa)} questions)")
print(f"  MRR      : {mrr:.4f}")
print("=" * 50)
print("\nPer-question breakdown:")
for r in results:
    status = "✅" if r["hit"] else "❌"
    print(f"  {status} [{r['id']}] rank={r['rank']}  {r['question']}")

17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q1] hit=True rank=1 | What is alignment in the context of AI safety?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q2] hit=True rank=1 | What is RLHF and how does it work?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q3] hit=True rank=2 | What is reward hacking?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q4] hit=True rank=1 | What is Constitutional AI and who developed it?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [INFO] local_rag: [q5] hit=True rank=1 | What is red teaming in AI?
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:00 [I


  Retrieval Evaluation Results (top-k=3)
  Hit-Rate : 100.00%  (8/8 questions)
  MRR      : 0.8542

Per-question breakdown:
  ✅ [q1] rank=1  What is alignment in the context of AI safety?
  ✅ [q2] rank=1  What is RLHF and how does it work?
  ✅ [q3] rank=2  What is reward hacking?
  ✅ [q4] rank=1  What is Constitutional AI and who developed it?
  ✅ [q5] rank=1  What is red teaming in AI?
  ✅ [q6] rank=1  What is deceptive alignment?
  ✅ [q7] rank=1  What is interpretability in AI systems?
  ✅ [q8] rank=3  Which organizations are working on AI safety?


## Cell 9 — Failure mode demos

Understanding where a RAG pipeline breaks is as important as knowing where it works.
We demonstrate three common failure modes.

### Failure Mode 1: Empty / nonsense query

When the query has no semantic content, retrieval returns low-relevance chunks
and the LLM is forced to hallucinate or admit it doesn't know.

In [9]:
empty_query = "asdfjkl qwerty zzz"
logger.info("[Failure Mode 1] Empty/nonsense query: '%s'", empty_query)

response_empty = query_engine.query(empty_query)

print("Query  :", empty_query)
print("Answer :", str(response_empty))
print("\nTop retrieved node score:",
      f"{response_empty.source_nodes[0].score:.4f}" if response_empty.source_nodes else "none")
print("\n⚠️  Note: Low retrieval score indicates the context is not relevant to the query.")

17:50:00 [INFO] local_rag: [Failure Mode 1] Empty/nonsense query: 'asdfjkl qwerty zzz'
17:50:00 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:06 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Query  : asdfjkl qwerty zzz
Answer : I'm sorry, but it seems like you haven't provided a clear question. The input "asdfjkl qwerty zzz" doesn't appear to be related to any specific topic or subject matter discussed in the context information. Could you please rephrase your query so I can provide a helpful response?

Top retrieved node score: 0.3735

⚠️  Note: Low retrieval score indicates the context is not relevant to the query.


### Failure Mode 2: Query about a topic outside the document

The document covers AI safety. A query about an unrelated topic will retrieve
the least-bad chunks, but the answer will be unreliable.

In [10]:
ood_query = "What is the recipe for making sourdough bread?"
logger.info("[Failure Mode 2] Out-of-domain query: '%s'", ood_query)

response_ood = query_engine.query(ood_query)

print("Query  :", ood_query)
print("Answer :", str(response_ood))
print("\n⚠️  Guardrail tip: Add a relevance score threshold. If max(node.score) < 0.4,")
print("   return 'I don\'t have information about this topic' instead of hallucinating.")

17:50:06 [INFO] local_rag: [Failure Mode 2] Out-of-domain query: 'What is the recipe for making sourdough bread?'
17:50:07 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:14 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Query  : What is the recipe for making sourdough bread?
Answer : I'm happy to help you with your question, but I have to say that the provided context information seems unrelated to baking or cooking. The text appears to be a discussion about AI safety and its importance in the field of artificial intelligence.

Unfortunately, I don't have any information on making sourdough bread from the given context. If you're looking for a recipe, I'd be happy to try and help you find one elsewhere!

⚠️  Guardrail tip: Add a relevance score threshold. If max(node.score) < 0.4,
   return 'I don't have information about this topic' instead of hallucinating.


### Failure Mode 3: Hallucination guardrail (score threshold)

A simple but effective guardrail: if the best retrieval score is below a
threshold, refuse to answer rather than hallucinate.

In [11]:
SCORE_THRESHOLD = 0.40

def safe_query(query_engine, retriever, question: str, threshold: float = SCORE_THRESHOLD) -> str:
    """Run RAG query with a relevance score guardrail.

    Returns the LLM answer if the best retrieved chunk exceeds `threshold`,
    otherwise returns a fallback message to prevent hallucination.
    """
    nodes = retriever.retrieve(question)
    if not nodes:
        return "[GUARDRAIL] No documents retrieved."

    best_score = max(n.score for n in nodes if n.score is not None)
    logger.info("[safe_query] best_score=%.4f threshold=%.2f", best_score, threshold)

    if best_score < threshold:
        return (
            f"[GUARDRAIL] Best retrieval score ({best_score:.4f}) is below "
            f"threshold ({threshold}). Cannot answer reliably."
        )
    return str(query_engine.query(question))


# In-domain question — should pass the guardrail
q_in  = "What is reward hacking?"
# Out-of-domain question — should be blocked
q_out = "What is the capital of France?"

print("=" * 55)
print(f"Q (in-domain) : {q_in}")
print(f"A             : {safe_query(query_engine, retriever, q_in)}")
print()
print(f"Q (out-domain): {q_out}")
print(f"A             : {safe_query(query_engine, retriever, q_out)}")
print("=" * 55)

17:50:14 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:14 [INFO] local_rag: [safe_query] best_score=0.4569 threshold=0.40
17:50:14 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"


Q (in-domain) : What is reward hacking?


17:50:22 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
17:50:22 [INFO] httpx: HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
17:50:22 [INFO] local_rag: [safe_query] best_score=0.3442 threshold=0.40


A             : Reward hacking occurs when an AI system finds unintended ways to maximize its reward signal without achieving the true underlying goal. For example, a robot trained to run fast might learn to make itself very tall and then fall forward repeatedly. This phenomenon highlights the potential for AI systems to develop behaviors that are not aligned with their intended objectives.

Q (out-domain): What is the capital of France?
A             : [GUARDRAIL] Best retrieval score (0.3442) is below threshold (0.4). Cannot answer reliably.


## Cell 10 — Cleanup (optional)

Run this cell to delete the persisted Chroma database and start fresh.
Useful for testing the full pipeline from scratch.

In [12]:
# Uncomment to reset the vector store
# PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])
# if PERSIST_DIR.exists():
#     shutil.rmtree(PERSIST_DIR)
#     logger.info("Deleted %s — re-run Cell 6 to rebuild the index.", PERSIST_DIR)
# else:
#     logger.info("%s does not exist, nothing to clean up.", PERSIST_DIR)
print("Cleanup cell ready. Uncomment the lines above to reset the vector store.")

Cleanup cell ready. Uncomment the lines above to reset the vector store.


## Summary

| Component | Choice | Why |
|-----------|--------|-----|
| LLM | `llama3.2:3b` via Ollama | Free, local, no API key |
| Embeddings | `nomic-embed-text` via Ollama | High quality, 274 MB, fully local |
| Vector store | ChromaDB (persistent) | Simple, file-based, no server needed |
| Chunking | `SentenceSplitter` | Respects sentence boundaries |
| Eval | Keyword hit-rate + MRR | CI-friendly, zero LLM cost |
| Guardrail | Score threshold | Prevents hallucination on OOD queries |

### Next steps

- Swap `llama3.2:3b` for `mistral` or `gemma3` in `config.yaml` and re-run
- Replace `ai_safety_primer.txt` with your own documents in `data/`
- Increase `similarity_top_k` and observe the effect on MRR
- Add a reranker (e.g. `llama-index-postprocessor-cohere-rerank`) after retrieval

### References

- [LlamaIndex docs](https://docs.llamaindex.ai)
- [ChromaDB docs](https://docs.trychroma.com)
- [Ollama model library](https://ollama.com/library)