This notebook introduces Retrieval Augmented Generation (RAG) and demonstrates how to set up, ingest data into, and retrieve information from a vector database.

# Background

## How will LLM know your data?

LLMs are pretrained on vast amounts of internet data. It never has seen your data thus if you ask it a question related 

<img src="assets/How would it know your data.png">

## Manually pass document with query

<image src = "assets/Query with document.png">

## Retrieval Augmented Generation

<image src = "assets/Retrieval Augmented Generation.png">

# Code implementation

## Ingest into database

In [1]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
import os

In [2]:
load_dotenv("configs.env")

True

In [3]:
# Folder with .html files
SOURCE_DIR = "data_files"
# Persistent vector database directory
DB_DIR = "vector_db"

In [4]:
loader = DirectoryLoader(SOURCE_DIR, glob="**/*.txt", show_progress=True)
docs = loader.load()

  0%|          | 0/2 [00:00<?, ?it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
 50%|█████     | 1/2 [00:01<00:01,  1.50s/it]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
100%|██████████| 2/2 [00:01<00:00,  1.03it/s]


In [5]:
print(f"{len(docs)} files loaded successfully.")

2 files loaded successfully.


In [6]:
# split each book into smaller chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=400)
chunks = splitter.split_documents(docs)

In [7]:
print(f"{len(chunks)} chunks created successfully.")

1381 chunks created successfully.


In [8]:
# create folder if it doesnt exist
os.makedirs(DB_DIR, exist_ok=True)

# vectorize document chunks using text embedding model
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory=DB_DIR,
)
print(f"✅ Ingestion complete. Persistent DB stored at: {DB_DIR}")

✅ Ingestion complete. Persistent DB stored at: vector_db


## Fetch from database

In [9]:
# === Load persistent vector DB ===
vectordb = Chroma(
    persist_directory=DB_DIR,
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)

In [10]:
# retrieval using chromaDB
query = "Who is Irene Adler?"

In [11]:
# fetch top 3 most similar chunks
results = vectordb.similarity_search(query, k=3)

In [12]:
for i, doc in enumerate(results, 1):
    print(f"\n-------------------Result {i}-------------------:")
    print(f"Source: {doc.metadata.get('source', 'unknown')}")
    print(f"Truncated Content: {doc.page_content[:100]}...")


-------------------Result 1-------------------:
Source: data_files/The Adventures of Sherlock Holmes.txt
Truncated Content: I. A SCANDAL IN BOHEMIA

I.

To Sherlock Holmes she is always _the_ woman. I have seldom heard him m...

-------------------Result 2-------------------:
Source: data_files/The Adventures of Sherlock Holmes.txt
Truncated Content: “I then lounged down the street and found, as I expected, that there was a mews in a lane which runs...

-------------------Result 3-------------------:
Source: data_files/The Adventures of Sherlock Holmes.txt
Truncated Content: It was close upon four before the door opened, and a drunken-looking groom, ill-kempt and side-whisk...


## Generate LLM response

In [13]:
query = "Who is Irene Adler?"

In [14]:
# Set up the LLM and RetrievalQA chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
)

In [15]:
response = qa_chain.invoke(query)

In [16]:
# print source and first 500 characters of page content
for doc in response["source_documents"]:
    print(
        f"\n-------------------Source: {doc.metadata.get('source', 'unknown')}-------------------:"
    )
    # preview first 100 chars
    print(f"Truncated Content: {doc.page_content[:100]}...")


-------------------Source: data_files/The Adventures of Sherlock Holmes.txt-------------------:
Truncated Content: I. A SCANDAL IN BOHEMIA

I.

To Sherlock Holmes she is always _the_ woman. I have seldom heard him m...

-------------------Source: data_files/The Adventures of Sherlock Holmes.txt-------------------:
Truncated Content: “I then lounged down the street and found, as I expected, that there was a mews in a lane which runs...

-------------------Source: data_files/The Adventures of Sherlock Holmes.txt-------------------:
Truncated Content: It was close upon four before the door opened, and a drunken-looking groom, ill-kempt and side-whisk...

-------------------Source: data_files/The Adventures of Sherlock Holmes.txt-------------------:
Truncated Content: “Mr. Sherlock Holmes, I believe?” said she.

“I am Mr. Holmes,” answered my companion, looking at he...


In [17]:
print("# User Query:\n", response["query"])
print("\n# LLM Response:\n", response["result"])

# User Query:
 Who is Irene Adler?

# LLM Response:
 Irene Adler is a character in Arthur Conan Doyle's story "A Scandal in Bohemia." She is portrayed as a talented and beautiful woman who captures the attention of Sherlock Holmes, who refers to her as "the woman." Adler is known for her intelligence and resourcefulness, and she plays a significant role in the story as she outsmarts Holmes, which is a rare occurrence for the famous detective.
