**Building a Local RAG Document Knowledge Base with Ollama, LlamaIndex, and ChromaDB**

---

This article outlines the process of building a local Retrieval Augmented Generation (RAG) document knowledge base using Ollama for local Large Language Model (LLM) inference, LlamaIndex for data ingestion and indexing, and ChromaDB for vector storage.  This approach allows for efficient querying of a private document corpus without relying on external LLM APIs, enhancing privacy and reducing costs.

---

## 1. Introduction
**Retrieval Augmented Generation (RAG)** combines the strengths of LLMs with external knowledge sources.  Instead of relying solely on the LLM's pre-trained knowledge, RAG retrieves relevant information from a knowledge base before generating a response. This allows the LLM to provide more accurate, contextually grounded, and up-to-date answers.  This paper focuses on building a local RAG pipeline, meaning all components run on your machine, offering increased data privacy and control.


> This setup was tried and tested on a **Dell Latitude 5430 Laptop** , with following specification
> > 1. **CPU** - 12th Gen Intel® Core™ i7-1256U vPro® *(12 MB Intel® Smart Cache, 10 cores, 12 threads, up to 4.80 GHz Turbo)*
> > 2. **GPU** -  Integrated Intel® Iris® Xe Graphics 1.25 GHz - * 8GB Shared Memory*
> > 3. **RAM** - 16 GB, 2 x 8 GB, DDR4, 3200 MT/s, Non-ECC, dual-channel
> > 4. **OS** - Windows 11 Enterprise version 23H2 *build 22631.4751*
> > 5. **Conda** version 24.9.1 environment with following : 
> > > 1. **Python** - Python 3.11.0 | *packaged by Anaconda, Inc.|(main, Mar  1 2023, 18:18:21) [MSC v.1916 64 bit (AMD64)] on win32*
> > > 2. **intel IPEX-LLM** for Ollama [***pip install --pre --upgrade ipex-llm[cpp]*** ]
> > > 3. **llama-index**                              0.12.19 , ***with additional integrations as below***
> > > > 1. **llama-index-vector-stores-chroma**         0.4.1
> > > > 2. **llama-index-llms-ollama**                  0.5.2
> > > > 3. **llama-index-embeddings-huggingface**      0.5.1
> > > 5. **ollama**                                    0.4.7
> > > 6. **chromadb**                                 0.6.3
> > 5. **ollama Server**                             0.5.4-ipexllm-20250223
> > 6. **LLM Model** - **"IBM Granite 3.1-dense 8B"** *[ollama run granite3.1-dense:8b] from [https://ollama.com/library/granite3.1-dense]* is a text-only dense LLM trained on over 12 trillion tokens of data from IBM’s granite family of small open LLM. *The IBM Granite 3.1 language models [https://www.ibm.com/granite] are designed for agentic workflows, RAG, text summarization, text analytics and extraction, classification, and content generation.*
> > 7. **Embedding Model** - **"BAAI/llm-embedder"** *from offcial BAAI hugginface repo [https://huggingface.co/BAAI/llm-embedder]* - *LLM-Embedder [https://github.com/FlagOpen/FlagEmbedding/tree/master/research/llm_embedder] is small size unified embedding model that support diverse retrieval augmentation needs for LLMs* 

## 2. Key Components and Concepts
 * **Large Language Model (LLM)** :  A powerful AI model trained on a massive dataset, capable of understanding and generating human-like text.  Examples include Llama 2, Mistral, and others.  In this setup, we'll use Ollama to run the LLM locally.
 * **Ollama** : A tool for running LLMs locally. It simplifies the process of downloading, managing, and running open-source LLMs on your own hardware.
 * **LlamaIndex** : A framework that simplifies the process of connecting LLMs to external data sources. It provides tools for data ingestion, indexing, and querying.
 * **ChromaDB** : A vector database designed for storing and querying embeddings. Embeddings are numerical representations of text, capturing semantic meaning. ChromaDB enables efficient similarity searches, crucial for retrieving relevant documents.
 * **Embeddings** : Numerical vectors that represent the semantic meaning of text.  Similar texts have similar embeddings.  We'll use an embedding model (potentially also hosted locally via Ollama) to generate these vectors.
 * **Vector Database** : A specialized database designed to store and efficiently query vector embeddings.  ChromaDB is used here to store and retrieve document embeddings.
 * **Document Chunks** :  Large documents are often split into smaller chunks for processing and retrieval. This improves retrieval accuracy and reduces computational load.
 * **Query** : A user's question or request to the knowledge base.
 * **Retrieval** : The process of identifying relevant document chunks from the knowledge base based on the query.
 * **Augmentation** : The process of combining the retrieved document chunks with the user's query before sending it to the LLM.
 * **Prompt Engineering** : Designing effective prompts to guide the LLM towards generating desired responses.

## 3. System Architecture
The system follows a typical RAG architecture:
 1. **Data Ingestion** : Documents are loaded and processed.
 2. **Chunking** : Documents are split into smaller, manageable chunks.
 3. **Embedding Generation** : Embeddings are created for each chunk using an embedding model.
 4. **Vector Storage** : Embeddings are stored in ChromaDB.
 5. **Query Processing** : User queries are embedded.
 6. **Retrieval** : Relevant chunks are retrieved from ChromaDB based on embedding similarity.
 7. **Augmentation** : The query and retrieved chunks are combined into a prompt.
 8. **LLM Inference** : The prompt is sent to the local LLM (via Ollama).
 9. **Response Generation** : The LLM generates a response based on the provided context.

## 4. Implementation Steps and Code

#### Step 1 : Configure logging

In [None]:
import os
import logging
# You may turn off logging or change levels once testing is complete and everything is working as intended
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') # You can output to screen if you're using the code in an interactive notebook
logger = logging.getLogger()

#### Step 2 : Import required libraries

In [None]:
from llama_index.core import (
    SimpleDirectoryReader,
    Settings,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from tqdm.notebook import tqdm as log_progress

import chromadb

##### ***If you're running the code as an app you can save logs to a timestamped file like this ->***
---

```Python
from logging.handlers import RotatingFileHandler
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
log_file_path= "./logs"
log_filename= os.path.join(log_file_path, 'RAGBot.log')
if not os.path.exists(log_file_path):
    os.makedirs(log_file_path)
handler = RotatingFileHandler(log_filename, maxBytes=100000,backupCount=3, encoding='utf-8')
handler.setFormatter(formatter)
logger.addHandler(handler)
```
---

#### Step 3 : Setup default paths and directories

In [None]:
# Specify the directory containing your documents
DOCUMENT_PATH = "./data/pdf"

# Specify the file to hold the list of documents indexed
INDEXED_FILES_RECORD = os.path.join(DOCUMENT_PATH, 'indexed_files.txt')

# Directory to persist ChromaDB
PERSIST_DIR = "./data/storage/chroma_db"

# Set collection name
COLLECTION_NAME = "document_index"

#### Step 4 : Setup the Embedding Model and Ollama Local LLM of your choice

In [None]:
# Settings control global defaults
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/llm-embedder")
Settings.llm = Ollama(model="granite3.1-dense:latest", request_timeout=360.0)

#### Step 5 : Load Data
Our approach for scanning and adding files to vector DB is as follows :
> * On first run, scan and load all the files in the DOCUMENT_PATH and create an INDEXED_FILES_RECORD (txt) to store the list of files indexed - *First time run is assumed if no INDEXED_FILES_RECORD is found in DOCUMENT_PATH*
> * The INDEXED_FILES_RECORD is then populated with names of currently loaded documents
> * This file is then read in subsequent runs and compared with the DOCUMENT_PATH listing to see any new files have been added
> * If any new file is found then they are indexed and vectorized and the INDEXED_FILES_RECORD is updated
> * This way we can make build our knowldge base as and when more data is available

In [None]:
# index file doesnot exist first time. So create an empty file
if not os.path.exists(INDEXED_FILES_RECORD):
    open(INDEXED_FILES_RECORD, "w").close()
    # Get all files in the directory
    all_files = [f for f in os.listdir(DOCUMENT_PATH) if os.path.isfile(os.path.join(DOCUMENT_PATH, f))]

    # Write file names to indexed_files.txt
    with open(INDEXED_FILES_RECORD, "w") as f:
        for file in all_files:
            f.write(file + "\n")

    print(f"Indexed {len(all_files)} files from {DOCUMENT_PATH} into {INDEXED_FILES_RECORD}.")

    # Load and preprocess the documents
    documents = SimpleDirectoryReader(DOCUMENT_PATH).load_data()
    # 'documents' now holds a list of document objects ready for indexing.

#### Step 6 : Set up the ChromaDB client with a persistent storage directory
> * Check for PERSIST_DIR , and if it's not available , build a fresh COLLECTION_NAME and store it
> * if PERSIST_DIR exists , then Update existing COLLECTION_NAME if there are any new files in DOCUMENT_PATH
> * Update INDEXED_FILES_RECORD once all new files are vectorized and added to COLLECTION_NAME

In [None]:
if not os.path.exists(PERSIST_DIR):

    # Create Chroma client, opt out of telemetry and create our collection
    from chromadb.config import Settings
    chroma_client = chromadb.PersistentClient(path=PERSIST_DIR,settings=Settings(anonymized_telemetry=False))
    chroma_collection = chroma_client.create_collection(COLLECTION_NAME)

    # Create vector store and the storage context; the 'collection_name' groups our embeddings.
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    # Build the index from the loaded documents.
    index = VectorStoreIndex.from_documents(documents, show_progress=True, storage_context=storage_context)
    print(f"FINISH: Updated Chroma collection / index with new documents now has {chroma_collection.count()} vectors.  Each vector represents one page of a document.")
    # The index now holds the vector embeddings of our documents and is ready for querying.

    # Persist the index
    index.storage_context.persist(persist_dir=PERSIST_DIR)

else:

    # Load the existing index
    from chromadb.config import Settings
    chroma_client = chromadb.PersistentClient(path=PERSIST_DIR,settings=Settings(anonymized_telemetry=False))
    chroma_collection= chroma_client.get_collection(COLLECTION_NAME)
    print(f"START: Initial Chroma collection has {chroma_collection.count()} vectors.  Each vector represents one page of a document.")
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection, mode="append")
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex.from_vector_store(
        vector_store,
        show_progress=True
    )
    print("Loaded existing index from Disk")

    # Load previously indexed files
    indexed_files = set()
    if os.path.exists(INDEXED_FILES_RECORD):
        with open(INDEXED_FILES_RECORD, "r") as f:
            indexed_files = set(f.read().splitlines())

    # Get list of all documents files in the directory and check for new ones
    all_docs = set(os.listdir(DOCUMENT_PATH))
    new_docs = [f for f in all_docs if f not in indexed_files]

    if new_docs:
        print(f"Found {len(new_docs)} new Document(s). Adding to vector store...")

        new_documents = []
        for doc in new_docs:
            doc_path = os.path.join(DOCUMENT_PATH, doc)
            docs = SimpleDirectoryReader(input_files=[doc_path]).load_data() # Load and preprocess the new documents
            new_documents.extend(docs) # Collect the new documents

        # Build the index from the new documents.
        index = VectorStoreIndex.from_documents(new_documents, show_progress=True, storage_context=storage_context)
        print(f"FINISH: Updated Chroma collection / index with new documents now has {chroma_collection.count()} vectors.  Each vector represents one page of a document.")
        # The index now holds the vector embeddings of our new documents and is ready for querying.

        # Persist the index
        index.storage_context.persist(persist_dir=PERSIST_DIR)

        # Update the indexed files record
        with open(INDEXED_FILES_RECORD, "a") as f:
            for doc in new_docs:
                f.write(doc + "\n")

        print("Vector store updated successfully.")
    else:
        print("No new files to index.")

#### Step 7 : Query and test the model

In [None]:
# Try a sample query to test the response from our model
query_engine = index.as_query_engine()
# Example query
query = "What is the main topic of this document?"
response = query_engine.query(query)

# Print the generated response.
print("Response:", response)

#### Step 8 : (Optional) Run a simple chatbot interface to ask questions about the documents

In [None]:
# Start interactive query loop
while True:
    query = input("Hi, This is a simple RAGBot. Enter your query (or type 'exit' to quit) : \n")

    if query.lower() == 'exit':
            break

    # Keep retrieving relevant responses for the query
    response = query_engine.query(query)

    # Display the retrieved data
    print("RAGBot:\n", response, "\n")

## 5. Explanation of Code Sections
---
 1. **Logging** : Set up basic logging for debugging.
 2. **Imports** : Import necessary libraries.
 4. **Configure** : Configure the data folder where documents are stored, the name of the ChomaDB collection and location for the presistent storage 
 5. **Model Settings** : Initializes the Ollama LLM and embedding model. Replace "granite3.1-dense:latest" with the name of the model you have downloaded via ollama pull. This context is used throughout the indexing and querying process.
 6. **Data Loading** :  SimpleDirectoryReader loads documents from a specified directory.  LlamaIndex supports various data connectors (e.g., web pages, databases).
 7. **ChromaDB Setup and Index Creation/Loading** : The code checks for a persisted index. If it doesn't exist, it creates a ChromaDB client, a collection, a ChromaVectorStore, and a StorageContext. It then builds the VectorStoreIndex from the documents and persists it.  If the index does exist, it's loaded from disk.  This persistence step is crucial for efficiency – you only build the index once.
 8. **Querying** : Creates a query_engine from the index and uses it to answer a query.
 9. **Chatbot Interface (Optional)** :  Provides a simple loop for interactive querying.
 ---

#### 6. Setting up Ollama
 * **Installation** : Follow the instructions on the Ollama website (https://ollama.ai/) for your operating system.
 * **Model Download** :  Use the ollama pull <model_name> command to download the desired LLM.  For example, ollama pull llama2.  You'll need to choose a model that is compatible with LlamaIndex.

#### 7. Setting up ChromaDB
 * **Installation** : pip install chromadb

#### 8. Setting up LlamaIndex
 * **Installation** : pip install llama-index

#### 9. Running the Code
 * **Data** : Place your documents in the "data" directory (or adjust the path in the code).
 * **Dependencies** : Ensure all required libraries are installed.
 * **Execution** : Run the Python script.  The first run will take longer as the index is built. Subsequent runs will be faster as the index is loaded from disk.

## 10.  Further Enhancements
 * **More Advanced Querying** : Explore LlamaIndex's query engine options for more sophisticated retrieval strategies (e.g., keyword-based retrieval, hybrid search).
 * **Evaluation** : Implement metrics to evaluate the performance of the RAG system.
 * **Prompt Engineering** : Experiment with different prompt templates to optimize the quality of LLM responses.
 * **Data Connectors** : Explore LlamaIndex's data connectors for integrating with various data sources.
 * **Asynchronous Operations** : For larger datasets, consider using asynchronous operations to speed up indexing and querying.
 * **Fine-tuning** :  For highly specific domains,