# Retrieval Augmented Generation (RAG)<br>
<ul>
Indexing<ol>
Doccument loading<br>
Text Splitting<br>
Embedding<br>
vector storeage<br></ol>
Retirval<ol>
query embedding<br>
Sim search<br>
retirve K docc K-means<br>
Context augmentation<br></ol>
</ul>


### Data loaders

In [9]:
from langchain_community.document_loaders.text import TextLoader
from langchain_community.document_loaders import PyPDFLoader
import os

loader= TextLoader("data/text.txt",encoding="utf-8")
documents = loader.load()

print(f"Loaded {len(documents)} documents from text file.")
print("metadata:", documents[0].metadata)

Loaded 1 documents from text file.
metadata: {'source': 'data/text.txt'}


In [10]:
loader = PyPDFLoader("data/pdf.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} documents from PDF file.")
print("metadata:", documents[0].metadata)

Loaded 3 documents from PDF file.
metadata: {'producer': 'WeasyPrint 65.1', 'creator': 'ChatGPT', 'creationdate': '', 'title': 'Smart Water Bottle Proposal', 'author': 'ChatGPT Canvas', 'source': 'data/pdf.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1'}


In [12]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://en.wikipedia.org/wiki/William_Hanna")
documents = loader.load()
print(f"Loaded {len(documents)} documents from web page.")
print("metadata:", documents[0].metadata)

Loaded 1 documents from web page.
metadata: {'source': 'https://en.wikipedia.org/wiki/William_Hanna', 'title': 'William Hanna - Wikipedia', 'language': 'en'}


### Text splitter

In [15]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

long_text = """
LangChain is a powerful framework for developing applications powered by language models.
It enables applications that are:
1. Data-aware: connect a language model to other sources of data.
2. Agentic: allow a language model to interact with its environment.

The core idea of LangChain is to "chain" together different components to create more advanced use cases around LLMs.
This includes modules for:
- Models: LLMs, ChatModels, Embeddings
- Prompts: PromptTemplate, ChatPromptTemplate
- Output Parsers: StrOutputParser, JsonOutputParser, PydanticOutputParser
- Indexes: Document Loaders, Text Splitters, Vectorstores, Retrievers
- Chains: Combining components with LCEL
- Agents: LLMs that can make decisions and use tools
- Memory: Persisting state between turns

LangChain is available in Python and JavaScript/TypeScript.
It also has related projects like LangServe for deployment and LangSmith for observability.
"""
doc_to_split = Document(page_content=long_text, metadata={"source": "internal_doc"})

In [17]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, chunk_overlap=20,separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_documents([doc_to_split])

print(f"Split document into {len(chunks)} chunks:")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk.page_content[:50]}... (metadata: {chunk.metadata})")

Split document into 13 chunks:
Chunk 1: LangChain is a powerful framework for developing a... (metadata: {'source': 'internal_doc'})
Chunk 2: It enables applications that are:
1. Data-aware: c... (metadata: {'source': 'internal_doc'})
Chunk 3: 2. Agentic: allow a language model to interact wit... (metadata: {'source': 'internal_doc'})
Chunk 4: The core idea of LangChain is to "chain" together ... (metadata: {'source': 'internal_doc'})
Chunk 5: more advanced use cases around LLMs.... (metadata: {'source': 'internal_doc'})
Chunk 6: This includes modules for:
- Models: LLMs, ChatMod... (metadata: {'source': 'internal_doc'})
Chunk 7: - Prompts: PromptTemplate, ChatPromptTemplate... (metadata: {'source': 'internal_doc'})
Chunk 8: - Output Parsers: StrOutputParser, JsonOutputParse... (metadata: {'source': 'internal_doc'})
Chunk 9: - Indexes: Document Loaders, Text Splitters, Vecto... (metadata: {'source': 'internal_doc'})
Chunk 10: - Chains: Combining components with LCEL
- Agents:... (metad

## Vector Store

In [25]:
from dotenv import load_dotenv
load_dotenv()
from langchain_community.vectorstores import Chroma

google_api_key = os.getenv("GOOGLE_API_KEY")

from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=google_api_key)

loader = PyPDFLoader("data/pdf.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} documents from text file.")

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, chunk_overlap=20,separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)

for i in range(3):
    print(f"Chunk {i+1}: {chunks[i].page_content[:50]}... (metadata: {chunks[i].metadata})")

DB=Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_db"
)
print(f"Created vector store with {len(DB)} documents.")
    

Loaded 3 documents from text file.
Chunk 1: Title of the Innovation:Smart Heating and Cooling ... (metadata: {'producer': 'WeasyPrint 65.1', 'creator': 'ChatGPT', 'creationdate': '', 'title': 'Smart Water Bottle Proposal', 'author': 'ChatGPT Canvas', 'source': 'data/pdf.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1'})
Chunk 2: Theme:
Low carbon footprint solution/Technology... (metadata: {'producer': 'WeasyPrint 65.1', 'creator': 'ChatGPT', 'creationdate': '', 'title': 'Smart Water Bottle Proposal', 'author': 'ChatGPT Canvas', 'source': 'data/pdf.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1'})
Chunk 3: Problem Statement :\ Access to clean and temperatu... (metadata: {'producer': 'WeasyPrint 65.1', 'creator': 'ChatGPT', 'creationdate': '', 'title': 'Smart Water Bottle Proposal', 'author': 'ChatGPT Canvas', 'source': 'data/pdf.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1'})
Created vector store with 124 documents.


## Search and Retrieval

In [28]:
db_retriever = Chroma(persist_directory="./chroma_db", embedding_function=embedding_model) #load vecore store
retriever = db_retriever.as_retriever(search_kwargs={"k": 3})
print(f"Created retriever with {retriever.get_relevant_documents('LangChain')}")

  print(f"Created retriever with {retriever.get_relevant_documents('LangChain')}")


Created retriever with [Document(metadata={'total_pages': 3, 'producer': 'WeasyPrint 65.1', 'creator': 'ChatGPT', 'page': 1, 'author': 'ChatGPT Canvas', 'source': 'data/pdf.pdf', 'creationdate': '', 'title': 'Smart Water Bottle Proposal', 'page_label': '2'}, page_content='making it cost-effective'), Document(metadata={'page_label': '2', 'creator': 'ChatGPT', 'total_pages': 3, 'producer': 'WeasyPrint 65.1', 'creationdate': '', 'source': 'data/pdf.pdf', 'author': 'ChatGPT Canvas', 'page': 1, 'title': 'Smart Water Bottle Proposal'}, page_content='dependency  on  a  continuous'), Document(metadata={'producer': 'WeasyPrint 65.1', 'page_label': '2', 'creator': 'ChatGPT', 'title': 'Smart Water Bottle Proposal', 'total_pages': 3, 'author': 'ChatGPT Canvas', 'creationdate': '', 'page': 1, 'source': 'data/pdf.pdf'}, page_content='integration of simple')]


In [32]:
query_1="who is the information about"
query_2="what is the lenghts of amazon river"

retrieved_docs_1 = retriever.invoke(query_1)
for i, doc in enumerate(retrieved_docs_1):
    print(f"Retrieved Document {i+1} for query '{query_1}': {doc.page_content[:50]}... (metadata: {doc.metadata})")

print("\n")
retrieved_docs_2 = retriever.invoke(query_2)
for i, doc in enumerate(retrieved_docs_2):
    print(f"Retrieved Document {i+1} for query '{query_2}': {doc.page_content[:50]}... (metadata: {doc.metadata})")

Retrieved Document 1 for query 'who is the information about': way to access... (metadata: {'author': 'ChatGPT Canvas', 'creator': 'ChatGPT', 'source': 'data/pdf.pdf', 'page_label': '1', 'total_pages': 3, 'producer': 'WeasyPrint 65.1', 'page': 0, 'title': 'Smart Water Bottle Proposal', 'creationdate': ''})
Retrieved Document 2 for query 'who is the information about': products.... (metadata: {'title': 'Smart Water Bottle Proposal', 'page': 0, 'page_label': '1', 'creator': 'ChatGPT', 'producer': 'WeasyPrint 65.1', 'creationdate': '', 'total_pages': 3, 'source': 'data/pdf.pdf', 'author': 'ChatGPT Canvas'})
Retrieved Document 3 for query 'who is the information about': components.... (metadata: {'page_label': '2', 'title': 'Smart Water Bottle Proposal', 'producer': 'WeasyPrint 65.1', 'source': 'data/pdf.pdf', 'creator': 'ChatGPT', 'total_pages': 3, 'author': 'ChatGPT Canvas', 'creationdate': '', 'page': 1})


Retrieved Document 1 for query 'what is the lenghts of amazon river': way to acc