This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using vector databases. I'll walk through the complete process of setting up a RAG pipeline that ingests classic literature (Dracula and The Adventures of Sherlock Holmes) and enables question-answering capabilities based solely on the content of these books.

**What You'll Learn**

- How RAG systems work and why they're important for grounding LLM responses
- Setting up document ingestion pipelines with text chunking strategies
- Creating and querying vector databases using ChromaDB
- Combining retrieval with language model generation for accurate, source-backed answers

# Background

### How will LLM know your data?

LLMs are trained on publicly available internet data up to their training cutoff date. They have no knowledge of your private documents, proprietary data, or any information that wasn't part of their training dataset. When you ask questions about your specific documents, the LLM cannot provide accurate answers because it simply doesn't have access to that information.

<img src="assets/How would it know your data.png">

### Manually pass document with query

Since model context windows are large, we can provide documents as context to the model and ask questions directly. However, this approach requires manually selecting each document and does not scale effectively when dealing with large numbers of documents or extensive document collections.

<image src = "assets/Query with document.png">

### Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) solves the above problems by following a 4-step process:

1. **Document Ingestion**: A collection of documents is passed to the system, where each document is broken down into smaller chunks. This chunking strategy helps provide the model with only relevant content from multiple sections of a document. These chunks are then passed to an embedding model that converts the text into numeric vectors (essentially series of numbers). These numeric vectors are ingested into a dedicated vector database.

2. **Query Processing**: When a user asks a query, the query is converted into a numeric representation using the same embedding model and searched within the vector database.

3. **Retrieval**: Vectors similar to the user's query are fetched from the database. The embedding model converts these vectors back to their text representation. These relevant text chunks are passed as context along with the original prompt to create the final prompt for the model.

4. **Generation**: The model takes the user's original query along with the relevant additional context from the vector database to generate the final response, ensuring the answer is grounded in the provided documents.

<image src = "assets/Retrieval Augmented Generation.png">

# Code implementation

We now implement RAG using ChromaDB as vector database, OpenAI's text-embedding-3-small as embedding model and GPT-4o-mini as LLM. To create chunks we use LangChain's recursive chunking method.

### Ingest into database

This step involves loading text files from the SOURCE_DIR, splitting them into manageable chunks using a recursive text splitter, converting the chunks into vector embeddings using OpenAI's embedding model, and storing the resulting vectors in a persistent ChromaDB database at the DB_DIR path for efficient similarity search and retrieval.

In [1]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
import os

In [2]:
# make sure the environment variables are loaded
load_dotenv("configs.env")

True

In [3]:
# Folder with .html files
SOURCE_DIR = "data_files"
# Persistent vector database directory
DB_DIR = "vector_db"

In [4]:
# MODELS
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4o-mini"

In [5]:
loader = DirectoryLoader(SOURCE_DIR, glob="**/*.txt", show_progress=True)
docs = loader.load()

  0%|          | 0/2 [00:00<?, ?it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
 50%|█████     | 1/2 [00:02<00:02,  2.46s/it]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
100%|██████████| 2/2 [00:02<00:00,  1.45s/it]


In [6]:
print(f"{len(docs)} files loaded successfully.")

2 files loaded successfully.


In [7]:
# split each book into smaller chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=400)
chunks = splitter.split_documents(docs)

In [8]:
print(f"{len(chunks)} chunks created successfully.")

1381 chunks created successfully.


In [9]:
# create folder if it doesnt exist
os.makedirs(DB_DIR, exist_ok=True)

# vectorize document chunks using text embedding model
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model=EMBEDDING_MODEL),
    persist_directory=DB_DIR,
)
print(f"✅ Ingestion complete. Persistent DB stored at: {DB_DIR}")

✅ Ingestion complete. Persistent DB stored at: vector_db


### Fetch from database

Here we are fetching relevant chunks from database using sample query.

In [10]:
# === Load persistent vector DB ===
vectordb = Chroma(
    persist_directory=DB_DIR,
    embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL),
)

In [11]:
# retrieval using chromaDB
query = "Who is Irene Adler?"

In [12]:
# fetch top 3 most similar chunks
results = vectordb.similarity_search(query, k=3)

In [13]:
for i, doc in enumerate(results, 1):
    print(f"\n-------------------Result {i}-------------------:")
    print(f"Source: {doc.metadata.get('source', 'unknown')}")
    print(f"Truncated Content: {doc.page_content[:100]}...")


-------------------Result 1-------------------:
Source: data_files/The Adventures of Sherlock Holmes.txt
Truncated Content: I. A SCANDAL IN BOHEMIA

I.

To Sherlock Holmes she is always _the_ woman. I have seldom heard him m...

-------------------Result 2-------------------:
Source: data_files/The Adventures of Sherlock Holmes.txt
Truncated Content: “I then lounged down the street and found, as I expected, that there was a mews in a lane which runs...

-------------------Result 3-------------------:
Source: data_files/The Adventures of Sherlock Holmes.txt
Truncated Content: It was close upon four before the door opened, and a drunken-looking groom, ill-kempt and side-whisk...


### Generate LLM response

Here we are using the generated vector database along with user query and passing the relevant chunks to LLM to get final sanitized response.

In [14]:
query = "Who is Irene Adler?"

In [15]:
# Set up the LLM and RetrievalQA chain
llm = ChatOpenAI(model=LLM_MODEL, temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
)

In [16]:
response = qa_chain.invoke(query)

In [17]:
# print source and first 500 characters of page content
for doc in response["source_documents"]:
    print(
        f"\n-------------------Source: {doc.metadata.get('source', 'unknown')}-------------------:"
    )
    # preview first 100 chars
    print(f"Truncated Content: {doc.page_content[:100]}...")


-------------------Source: data_files/The Adventures of Sherlock Holmes.txt-------------------:
Truncated Content: I. A SCANDAL IN BOHEMIA

I.

To Sherlock Holmes she is always _the_ woman. I have seldom heard him m...

-------------------Source: data_files/The Adventures of Sherlock Holmes.txt-------------------:
Truncated Content: “I then lounged down the street and found, as I expected, that there was a mews in a lane which runs...

-------------------Source: data_files/The Adventures of Sherlock Holmes.txt-------------------:
Truncated Content: It was close upon four before the door opened, and a drunken-looking groom, ill-kempt and side-whisk...

-------------------Source: data_files/The Adventures of Sherlock Holmes.txt-------------------:
Truncated Content: “Mr. Sherlock Holmes, I believe?” said she.

“I am Mr. Holmes,” answered my companion, looking at he...


In [18]:
print("# User Query:\n", response["query"])
print("\n# LLM Response:\n", response["result"])

# User Query:
 Who is Irene Adler?

# LLM Response:
 Irene Adler is a character in Arthur Conan Doyle's story "A Scandal in Bohemia." She is portrayed as a talented and beautiful woman who captivates Sherlock Holmes, who refers to her as "the woman." Adler is known for her intelligence and resourcefulness, and she plays a significant role in the story as she outsmarts Holmes, which is a rare occurrence for the famous detective.
