This notebook introduces Retrieval Augmented Generation (RAG) and demonstrates how to set up, ingest data into, and retrieve information from a vector database.

# Chatbots that know your data

Insert image of RAG

# Ingest into database

In [12]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

In [2]:
load_dotenv("configs.env")

True

In [3]:
# Folder with your .txt files
SOURCE_DIR = "data_files"
# Persistent vector database directory
DB_DIR = "vector_db"

In [4]:
loader = DirectoryLoader(SOURCE_DIR, glob="**/*.txt", show_progress=True)
docs = loader.load()

  0%|          | 0/5 [00:00<?, ?it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
 20%|██        | 1/5 [00:01<00:05,  1.28s/it]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
 40%|████      | 2/5 [00:01<00:02,  1.31it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
 60%|██████    | 3/5 [00:01<00:00,  2.16it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
 80%|████████  | 4/5 [00:02<00:00,  2.18it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
100%|██████████| 5/5 [00:02<00:00,  1.90it/s]


In [5]:
print(f"{len(docs)} txt files loaded successfully.")

5 txt files loaded successfully.


In [6]:
# split each book into smaller chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

In [7]:
print(f"{len(chunks)} chunks created successfully.")

3676 chunks created successfully.


In [8]:
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    persist_directory=DB_DIR,
)
print(f"✅ Ingestion complete. Persistent DB stored at: {DB_DIR}")

✅ Ingestion complete. Persistent DB stored at: vector_db


# Fetch from database

In [9]:
# === Load persistent vector DB ===
vectordb = Chroma(
    persist_directory=DB_DIR,
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
)

In [10]:
# retrieval using chromaDB
query = "Who is Irene Adler?"

In [None]:
# fetch top 3 most similar chunks
results = vectordb.similarity_search(query, k=3)

In [22]:
for i, doc in enumerate(results, 1):
    print(f"\n-------------------Result {i}-------------------:")
    print(f"Source: {doc.metadata.get('source', 'unknown')}")
    print(doc.page_content)


-------------------Result 1-------------------:
Source: data_files/The Adventures of Sherlock Holmes.txt
“And what of Irene Adler?” I asked.

“Oh, she has turned all the men’s heads down in that part. She is the daintiest thing under a bonnet on this planet. So say the Serpentine-mews, to a man. She lives quietly, sings at concerts, drives out at five every day, and returns at seven sharp for dinner. Seldom goes out at other times, except when she sings. Has only one male visitor, but a good deal of him. He is dark, handsome, and dashing, never calls less than once a day, and often twice. He is a Mr. Godfrey Norton, of the Inner Temple. See the advantages of a cabman as a confidant. They had driven him home a dozen times from Serpentine-mews, and knew all about him. When I had listened to all they had to tell, I began to walk up and down near Briony Lodge once more, and to think over my plan of campaign.

-------------------Result 2-------------------:
Source: data_files/The Adventure

# Generate LLM response

In [13]:
query = "Who is Irene Adler?"

In [23]:
# Set up the LLM and RetrievalQA chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
)

In [24]:
response = qa_chain.invoke(query)

In [25]:
response

{'query': 'Who is Irene Adler?',
 'result': 'Irene Adler is a character from Arthur Conan Doyle\'s Sherlock Holmes stories. She is known as "the woman" to Sherlock Holmes, who regards her as a significant figure in his life. Adler is described as a beautiful and clever woman, an adventuress who has captured the attention of many men, including a king. In the story, she is married to an English lawyer named Godfrey Norton. Holmes admires her intelligence and resourcefulness, and she plays a crucial role in one of his most famous cases, "A Scandal in Bohemia."',
 'source_documents': [Document(id='cbb48401-0de1-42f1-a6c5-5600404f1aa7', metadata={'source': 'data_files/The Adventures of Sherlock Holmes.txt'}, page_content='“And what of Irene Adler?” I asked.\n\n“Oh, she has turned all the men’s heads down in that part. She is the daintiest thing under a bonnet on this planet. So say the Serpentine-mews, to a man. She lives quietly, sings at concerts, drives out at five every day, and return

In [21]:
print("# User Query:\n", response["query"])
print("\n# LLM Response:\n", response["result"])

# User Query:
 Who is Irene Adler?

# LLM Response:
 Irene Adler is a character from Arthur Conan Doyle's Sherlock Holmes stories. She is known as "the woman" to Sherlock Holmes, who regards her as a significant figure in his life. Adler is described as a beautiful and clever woman, an adventuress who has captured the attention of many men, including a king. In the story, she is married to an English lawyer named Godfrey Norton. Holmes admires her intelligence and resourcefulness, and she plays a crucial role in one of his most famous cases, "A Scandal in Bohemia."
