## Training for RAG

### Import Packages

In [2]:

from dotenv import load_dotenv
from langchain.chat_models import init_chat_model
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from typing_extensions import TypedDict
from typing import List, Annotated

from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import InjectedState
from langgraph.checkpoint.memory import MemorySaver
from IPython.display import Image, display, Markdown

import requests
import os
from dotenv import load_dotenv
from typing import Optional
from pprint import pprint


In [3]:
load_dotenv()

True

### Retrieval Augmented Generation for a Chatbot

The objectives of this agent are:
1. We have a corpus of business docs for different departments and we want to build a secure Role Based Access Controlled Chatbot that answers questions to employees' queries with relevant department specific data and does not infilterate the data across departments.
2. Security is one crucial aspect of this project.

The RAG will have 2 components -
1. Indexing - we will feed the docs to vector database and index them
2. Retrieval - As user ask questions, we query the similar data from vector db and add it to the prompt

### Init the chat model

In [4]:
model = init_chat_model("gemini-2.5-flash", model_provider="google_genai")

In [8]:
# this package is from google with langchain supported methods
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

NOTE: 
gemeini api supports different `task_types` in embedding. The above langchain lib supports it by default so we don't need to set it explictly for RAG based tasks. [Check this official doc](https://python.langchain.com/docs/integrations/text_embedding/google_generative_ai/)

In [7]:
idx = embeddings.embed_query("Who am I?")

In [9]:
len(idx)

768

### Init the Vector Database

In [15]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="documentation",
    embedding_function=embeddings,
    persist_directory="./chroma_db" # we need to put it in the backend
)

### Indexing raw docs

1. Load the documents (which are in different format `.md` and `.csv`)
2. Split it into smaller meaningful chunks
3. Index in ChromaDB

[Load Markdown](https://python.langchain.com/docs/how_to/document_loader_markdown/)

In [16]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader, CSVLoader

In [17]:
markdown_loader = UnstructuredMarkdownLoader(
    file_path="data/engineering/engineering_master_doc.md", mode="elements")

docs = markdown_loader.load()

In [12]:
len(docs)

573

In [21]:
pprint(docs[3])

Document(metadata={'source': 'data/engineering/engineering_master_doc.md', 'languages': ['eng'], 'file_directory': 'data/engineering', 'filename': 'engineering_master_doc.md', 'filetype': 'text/markdown', 'last_modified': '2025-07-12T15:19:27', 'parent_id': '131718987add1083e37146e6cf1491dd', 'category': 'NarrativeText', 'element_id': '40cfe29cd980e28cda6a9e4376c53a2d'}, page_content='FinSolve Technologies is a leading FinTech company headquartered in Bangalore, India, with operations across North America, Europe, and Asia-Pacific. Founded in 2018, FinSolve provides innovative financial solutions, including digital banking, payment processing, wealth management, and enterprise financial analytics, serving over 2 million individual users and 10,000 businesses globally.')


### Embed and store in vector db

NOTE:
The `chromadb` throws an error if you try adding raw documents, since their metadata have lists which is not supported. So we need to clear the metadata. Additionaly, we need to add some special data in metadata to support RBAC requirement.

In [18]:
from langchain_community.vectorstores.utils import filter_complex_metadata

filtered_docs = []

for doc in docs[:4]:  # Process more documents for testing
    print(f"Original doc type: {type(doc)}")
    print(f"Original metadata: {doc.metadata}")


    # Add simple metadata for RBAC
    doc.metadata["department"] = "engineering"
    doc.metadata["source_file"] = "engineering_master_doc.md"
    doc.metadata["access_level"] = "engineering_team"
    filtered_docs.append(doc)

# filtered_docs

Original doc type: <class 'langchain_core.documents.base.Document'>
Original metadata: {'source': 'data/engineering/engineering_master_doc.md', 'category_depth': 0, 'languages': ['eng'], 'file_directory': 'data/engineering', 'filename': 'engineering_master_doc.md', 'filetype': 'text/markdown', 'last_modified': '2025-07-12T15:19:27', 'category': 'Title', 'element_id': 'a9818428aac624384a9da60690636112'}
Original doc type: <class 'langchain_core.documents.base.Document'>
Original metadata: {'source': 'data/engineering/engineering_master_doc.md', 'category_depth': 1, 'languages': ['eng'], 'file_directory': 'data/engineering', 'filename': 'engineering_master_doc.md', 'filetype': 'text/markdown', 'last_modified': '2025-07-12T15:19:27', 'parent_id': 'a9818428aac624384a9da60690636112', 'category': 'Title', 'element_id': 'f963a63a1a2de90bdc2e356d9f0edc35'}
Original doc type: <class 'langchain_core.documents.base.Document'>
Original metadata: {'source': 'data/engineering/engineering_master_doc.

In [19]:
# try one embedding and test it before doing for all.
# this method will call embedding function from LLM and generate vector embeddings and store it

doc_ids = vector_store.add_documents(filter_complex_metadata(filtered_docs))

In [20]:
print(doc_ids)

['34f1342e-735c-4b8e-bfe8-00701ddc8654', 'c19e7484-36fe-4c9b-97ec-28873e07aec0', 'd8259a1f-ec2b-40f7-904d-41e7898a3048', 'bd368ca6-00ba-4108-9912-3b87b727f515']


In [39]:
vector_store.get_by_ids(['a6144b78-a2b4-4848-b5b7-068329bd159d'])

[Document(id='a6144b78-a2b4-4848-b5b7-068329bd159d', metadata={'access_level': 'engineering_team', 'filetype': 'text/markdown', 'element_id': '40cfe29cd980e28cda6a9e4376c53a2d', 'file_directory': 'data/engineering', 'department': 'engineering', 'filename': 'engineering_master_doc.md', 'last_modified': '2025-07-12T15:19:27', 'parent_id': '131718987add1083e37146e6cf1491dd', 'source': 'data/engineering/engineering_master_doc.md', 'category': 'NarrativeText', 'source_file': 'engineering_master_doc.md'}, page_content='FinSolve Technologies is a leading FinTech company headquartered in Bangalore, India, with operations across North America, Europe, and Asia-Pacific. Founded in 2018, FinSolve provides innovative financial solutions, including digital banking, payment processing, wealth management, and enterprise financial analytics, serving over 2 million individual users and 10,000 businesses globally.')]

In [21]:
vector_store.similarity_search(query="When was FinSolve founded?", k=1)

[Document(id='bd368ca6-00ba-4108-9912-3b87b727f515', metadata={'source_file': 'engineering_master_doc.md', 'filename': 'engineering_master_doc.md', 'parent_id': '131718987add1083e37146e6cf1491dd', 'file_directory': 'data/engineering', 'last_modified': '2025-07-12T15:19:27', 'access_level': 'engineering_team', 'department': 'engineering', 'element_id': '40cfe29cd980e28cda6a9e4376c53a2d', 'category': 'NarrativeText', 'filetype': 'text/markdown', 'source': 'data/engineering/engineering_master_doc.md'}, page_content='FinSolve Technologies is a leading FinTech company headquartered in Bangalore, India, with operations across North America, Europe, and Asia-Pacific. Founded in 2018, FinSolve provides innovative financial solutions, including digital banking, payment processing, wealth management, and enterprise financial analytics, serving over 2 million individual users and 10,000 businesses globally.')]

In [43]:
res = vector_store.get(limit=2, include=["embeddings", "documents"])

In [44]:
res['embeddings'][0][:5]

array([ 0.02460302, -0.01240093, -0.03797761,  0.00284989,  0.05209341])

## Test the LLM response without RAG
Our LLM guru does not know about this fictious company so it will either
1. fail gracefully
2. or hallucinate

In [45]:
response = model.invoke("When was FinSolve founded?")
response.content

'I cannot find any publicly available information about a company called "FinSolve" or its founding date.\n\nIt\'s possible:\n* The company is very new and information hasn\'t been widely published yet.\n* It\'s a private company that doesn\'t widely disclose this information.\n* The name might be slightly different.\n\nCould you provide any more details or confirm the spelling?'

Now our `RAG` system will come into picture.

In [46]:
# we can create a teplate for prompt
prompt_temp = """
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: When was FinSolve founded?
Context: FinSolve Technologies is a leading FinTech company headquartered in Bangalore, India, with operations across North America, Europe, and Asia-Pacific. Founded in 2018, FinSolve provides innovative financial solutions, including digital banking, payment processing, wealth management, and enterprise financial analytics, serving over 2 million individual users and 10,000 businesses globally.
Answer:
"""

response = model.invoke(prompt_temp)
response.content

'FinSolve Technologies was founded in 2018. It is a leading FinTech company headquartered in Bangalore, India. FinSolve provides innovative financial solutions globally.'

### Role Based Retrieval

Given a role of the user, we only retrive the data scoped to that role.

In [24]:
vector_store.similarity_search(
    query="When was FinSolve founded?", k=1, filter={"access_level": "engineering_team"})

[Document(id='bd368ca6-00ba-4108-9912-3b87b727f515', metadata={'department': 'engineering', 'source': 'data/engineering/engineering_master_doc.md', 'parent_id': '131718987add1083e37146e6cf1491dd', 'source_file': 'engineering_master_doc.md', 'file_directory': 'data/engineering', 'filename': 'engineering_master_doc.md', 'element_id': '40cfe29cd980e28cda6a9e4376c53a2d', 'last_modified': '2025-07-12T15:19:27', 'access_level': 'engineering_team', 'category': 'NarrativeText', 'filetype': 'text/markdown'}, page_content='FinSolve Technologies is a leading FinTech company headquartered in Bangalore, India, with operations across North America, Europe, and Asia-Pacific. Founded in 2018, FinSolve provides innovative financial solutions, including digital banking, payment processing, wealth management, and enterprise financial analytics, serving over 2 million individual users and 10,000 businesses globally.')]