## RAG Experimentation

We will first write all of our code here and test it and then convert it to modular code.

In [2]:
from langchain_openai import ChatOpenAI
from langchain_groq import ChatGroq
from dotenv import load_dotenv
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings
load_dotenv()

True

In [3]:
## intialize LLM
groq_llm = ChatGroq(model='qwen/qwen3-32b')

In [4]:
groq_llm.invoke('What is capital of Inida').content

'<think>\nOkay, the user is asking for the capital of Inida. First, I need to check if "Inida" is a recognized country or a misspelling. Let me think. There\'s no country named Inida in the world. The correct spelling might be India. The user probably made a typo. So, I should consider that they meant India. The capital of India is New Delhi. Let me confirm that. Yes, New Delhi has been the capital since 1911 when the British moved the capital from Calcutta. It\'s part of the National Capital Territory of Delhi. I should mention that the correct spelling is India and provide the capital. Also, maybe add a bit about why it\'s New Delhi, maybe a brief historical note. But the user might just want the answer quickly. So the main points: correct the spelling, state the capital, and maybe a short fact. I should make sure to be clear and helpful.\n</think>\n\nThe capital of **India** is **New Delhi**. It is located in the National Capital Territory of Delhi and has been the capital since 191

In [5]:
## Embedding model
google_embedding_model = GoogleGenerativeAIEmbeddings(model='models/embedding-001')

In [18]:
embedding = google_embedding_model.embed_query('What is the capital of India?')
len(embedding)

768

## RAG

### 1. Data Ingestion

In [7]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
from dotenv import load_dotenv
load_dotenv()

True

In [8]:
file_path = os.path.join(os.getcwd(),'data','sample.pdf')
pdf_loader = PyPDFLoader(file_path=file_path)

In [9]:
documents = pdf_loader.load()
len(documents)

77

In [10]:
documents[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-07-20T00:30:36+00:00', 'author': '', 'keywords': '', 'moddate': '2023-07-20T00:30:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'c:\\Users\\ashut\\OneDrive\\Documents\\study material\\Agentic_AI_Krish\\Projects\\document_analysis_portal\\notebook\\data\\sample.pdf', 'total_pages': 77, 'page': 0, 'page_label': '1'}, page_content='Llama 2: Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗ Louis Martin† Kevin Stone†\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan I

In [11]:
### Output
'''
Document(
    metadata={
        'producer': 'pdfTeX-1.40.25', 
        'creator': 'LaTeX with hyperref', 
        'creationdate': '2023-07-20T00:30:36+00:00', 
        'author': '', 
        'keywords': '', 
        'moddate': '2023-07-20T00:30:36+00:00', 
        'ptex.fullbanner': 'This is pdfTeX, 
        Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 
        'subject': '', 
        'title': '', 
        'trapped': '/False', 
        'source': 'c:\\Users\\ashut\\OneDrive\\Documents\\study material\\Agentic_AI_Krish\\Projects\\document_analysis_portal\\notebook\\data\\sample.pdf', 
        'total_pages': 77, 
        'page': 0, 
        'page_label': '1'
    }, 
    page_content=
        'Llama 2: Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗ Louis Martin† Kevin Stone†\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev\nPunit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich\nYinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra\nIgor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi\nAlan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang\nRoss Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang\nAngela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\nSergey Edunov Thomas Scialom∗\nGenAI, Meta\nAbstract\nIn this work, we develop and release Llama 2, a collection of pretrained and fine-tuned\nlarge language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\nOur fine-tuned LLMs, calledLlama 2-Chat, are optimized for dialogue use cases. Our\nmodels outperform open-source chat models on most benchmarks we tested, and based on\nour human evaluations for helpfulness and safety, may be a suitable substitute for closed-\nsource models. We provide a detailed description of our approach to fine-tuning and safety\nimprovements ofLlama 2-Chatin order to enable the community to build on our work and\ncontribute to the responsible development of LLMs.\n∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com\n†Second author\nContributions for all the authors can be found in Section A.1.\narXiv:2307.09288v2  [cs.CL]  19 Jul 2023'
)
'''
print()




### Chunking

In [22]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 100,
    length_function = len
)

In [23]:
docs = text_splitter.split_documents(documents)
len(docs)

662

In [24]:
docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-07-20T00:30:36+00:00', 'author': '', 'keywords': '', 'moddate': '2023-07-20T00:30:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'c:\\Users\\ashut\\OneDrive\\Documents\\study material\\Agentic_AI_Krish\\Projects\\document_analysis_portal\\notebook\\data\\sample.pdf', 'total_pages': 77, 'page': 0, 'page_label': '1'}, page_content='Llama 2: Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗ Louis Martin† Kevin Stone†\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan I

In [25]:
docs[0].metadata

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2023-07-20T00:30:36+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2023-07-20T00:30:36+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': 'c:\\Users\\ashut\\OneDrive\\Documents\\study material\\Agentic_AI_Krish\\Projects\\document_analysis_portal\\notebook\\data\\sample.pdf',
 'total_pages': 77,
 'page': 0,
 'page_label': '1'}

## Vectore Store

In [26]:
from langchain.vectorstores import FAISS

vectore_store = FAISS.from_documents(
    documents=docs,
    embedding=google_embedding_model
)

## Creating Retriever

In [27]:
relevant_doc = vectore_store.similarity_search("llama2 finetuning bechmark experiment")
relevant_doc

[Document(id='1ab7fe95-49e9-4978-9d27-dca62e7a73cd', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-07-20T00:30:36+00:00', 'author': '', 'keywords': '', 'moddate': '2023-07-20T00:30:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'c:\\Users\\ashut\\OneDrive\\Documents\\study material\\Agentic_AI_Krish\\Projects\\document_analysis_portal\\notebook\\data\\sample.pdf', 'total_pages': 77, 'page': 43, 'page_label': '44'}, page_content='Learning Representations, 2022.\n44'),
 Document(id='c45161a6-b0e4-454c-a9d2-6a037448d288', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-07-20T00:30:36+00:00', 'author': '', 'keywords': '', 'moddate': '2023-07-20T00:30:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3

In [28]:
relevant_doc[0]

Document(id='1ab7fe95-49e9-4978-9d27-dca62e7a73cd', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-07-20T00:30:36+00:00', 'author': '', 'keywords': '', 'moddate': '2023-07-20T00:30:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'c:\\Users\\ashut\\OneDrive\\Documents\\study material\\Agentic_AI_Krish\\Projects\\document_analysis_portal\\notebook\\data\\sample.pdf', 'total_pages': 77, 'page': 43, 'page_label': '44'}, page_content='Learning Representations, 2022.\n44')

In [29]:
relevant_doc[0].metadata

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2023-07-20T00:30:36+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2023-07-20T00:30:36+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': 'c:\\Users\\ashut\\OneDrive\\Documents\\study material\\Agentic_AI_Krish\\Projects\\document_analysis_portal\\notebook\\data\\sample.pdf',
 'total_pages': 77,
 'page': 43,
 'page_label': '44'}

In [30]:
relevant_doc[0].metadata

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2023-07-20T00:30:36+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2023-07-20T00:30:36+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': 'c:\\Users\\ashut\\OneDrive\\Documents\\study material\\Agentic_AI_Krish\\Projects\\document_analysis_portal\\notebook\\data\\sample.pdf',
 'total_pages': 77,
 'page': 43,
 'page_label': '44'}

## Creating Retirver

In [31]:
retriever = vectore_store.as_retriever(search_kwargs = {"k":10})

In [32]:
retriever.invoke('google_embedding_model')

[Document(id='1ab7fe95-49e9-4978-9d27-dca62e7a73cd', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-07-20T00:30:36+00:00', 'author': '', 'keywords': '', 'moddate': '2023-07-20T00:30:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'c:\\Users\\ashut\\OneDrive\\Documents\\study material\\Agentic_AI_Krish\\Projects\\document_analysis_portal\\notebook\\data\\sample.pdf', 'total_pages': 77, 'page': 43, 'page_label': '44'}, page_content='Learning Representations, 2022.\n44'),
 Document(id='c45161a6-b0e4-454c-a9d2-6a037448d288', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-07-20T00:30:36+00:00', 'author': '', 'keywords': '', 'moddate': '2023-07-20T00:30:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3

In [33]:
prompt_template = """
        Answer the question based on the context provided below. 
        If the context does not contain sufficient information, respond with: 
        "I do not have enough information about this."

        Context: {context}

        Question: {question}

        Answer:"""

In [34]:
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=['context','question']
)

parser = StrOutputParser()

def format_docs(docs):
    return "\n\n".join([doc.page_content for doc in docs])

In [35]:
from langchain_core.runnables import RunnablePassthrough

In [36]:
rag_chain = (
    {'context':retriever | format_docs,"question":RunnablePassthrough}
    | prompt
    | groq_llm
    | parser
)