# A Gentle Introduction to RAG Applications
This notebook creates a simple RAG (Retrieval-Augmented Generation) system to answer questions from a PDF document using an open-source model.
Download pdf file: https://www.researchgate.net/publication/337053615_Do_health_care_companies_of_India_fulfil_government's_new_orientation_towards_CSR_activities_A_special_consideration_towards_maternal_health

Steps: <br>
<b>1.Document Loaders:</b> Tools that import documents from various formats (like PDFs, Word files, or plain text) into a processable format.<br>
<b>2.Document Transformers:</b> These transform the imported documents, performing tasks like cleaning, formatting, or normalizing the text.<br>
<b>3.Text Splitting:</b> The process of dividing documents into smaller sections (such as sentences or paragraphs) to make analysis and processing easier.<br>
<b>4.Text Embedding Models:</b> Models that convert these text sections into numerical vectors, capturing their meanings to facilitate comparison and retrieval.<br>
<b>5.Vector Stores:</b> Databases designed to hold these vector representations, enabling efficient searches and similarity comparisons.<br>
<b>6.Prompt Template:</b> A standardized format for crafting prompts used with a language model, ensuring clarity and consistency in interactions.<br>
<b>7.LLM Model:</b> The large language model that processes these prompts and generates responses based on the embedded text and vectors.<br>
<b>8.Output:</b> The final results produced by the LLM, which can include summaries, answers, or other text based on the input and prompts.<br>

<img src='pic/0.png' width="500">

![1.gif](attachment:1.gif)

In [2]:
PDF_FILE = "Fulfillment_report_slides.pdf"
MODEL = "llama3.2"

## 1. Loading the PDF document
Let's start by loading the PDF document and breaking it down into separate pages.
<img src='pic/1.png' width="500">

In [4]:
# Import Document loader
from langchain_community.document_loaders import PyPDFLoader
# Load 'Fulfillment_report_slides.pdf'
loaders =[PyPDFLoader(PDF_FILE, extract_images=True)]
docs_pdf = []
for loader in loaders:
    docs_pdf.extend(loader.load())
print('Numbers of pages:',len(docs_pdf)) # No of pages in doc
print(docs_pdf[1].page_content[:200]) # Content of second page

Numbers of pages: 11
Introduction
In terms of Marquina[1] classification of stages of growth in corporate social responsibility
(CSR), Indian CSR needs to move towards a greater and detailed developmental stage.
According


## 3. Splitting the pages in chunks
Pages are too long, so let's split pages into different chunks.
<img src='pic/2.png' width="500">

Fixed-size chunking
<img src='pic/11.png' width="500">

In [5]:
# Document splitting
# Import the text splitter from LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize splitter with specified chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_documents(docs_pdf) # Split the documents into chunks
print('Numbers of chunks:', len(splits)) # Print total number of chunks created
# Display metadata of the first split
print('Data associated with each split 0:', splits[0].page_content[:200])
print('\nData associated with each split 1:', splits[1].page_content[:200])

Numbers of chunks: 33
Data associated with each split 0: Do health care companies of
India fulfil government ’s new
orientation towards CSR activities
A special consideration towards maternal health
Shashi Lata Yadav
Department of Obstetrical and Gynecologi

Data associated with each split 1: turnover of Rs 10bn or a market capitalization of Rs 5bn or a net profit of Rs 50m) are taken as an initial
frame of reference.Findings –In total, 89.83 per cent of these companies have initiatives re


## 4. Storing the chunks in a vector store
We can now generate embeddings for every chunk and store them in a vector store.
<img src='pic/3.png' width="500">

### 4. Text embedding models
Text embedding models generate numerical representations of text segments, capturing their semantic meaning Lang Chain.
<img src='pic/31.png' width="500">

In [8]:
from langchain_community.embeddings import OllamaEmbeddings
embeddings=OllamaEmbeddings(model="mxbai-embed-large") 
query_result=embeddings.embed_documents(splits)

### 5.	Vector stores:
After the text splitting and embedding processes are complete, embeddings are stored in vector stores.
Vector stores help store and retrieve text embeddings efficiently, enabling fast and context-aware data retrieval. They also embed unstructured data, allowing for similarity searches between queries and stored data.

<img src='pic/32.png' width="300">

In [9]:
from langchain_community.vectorstores import FAISS
db=FAISS.from_documents(documents=splits,embedding=embeddings)

### FAISS indexes can be saved and loaded, eliminating the need to recreate them each time.This feature is useful for efficient reuse and quick retrieval ,A folder will be created with name pdf_faiss_index.

In [10]:
### Saving
db.save_local("pdf_faiss_index")
#A folder will be created with name pdf_faiss_index in current directory

In [19]:
# allow_dangerous_deserialization` to `True` to enable deserialization, make this parameter True only 
# if you trust the data source. Do not set this to `True` if you are loading a file from an un trusted source
# like some random site on the internet
new_db=FAISS.load_local("pdf_faiss_index",embeddings,allow_dangerous_deserialization=True)
print('Numbers of chunks:',new_db.index.ntotal)

Numbers of chunks: 33


A vector similarity search is the process of finding items that are similar to a given query item based on their vector representations.The process involves measuring the distance or similarity between vectors using metrics like cosine similarity,Euclidean distance (L2 norm), Inner product (dot product).
The code below queries the vector database and returns the top 5 matching results.

In [12]:
query="Clinical study reports"
docs=new_db.similarity_search(query,k=5)
docs[0].page_content

'2013 were studied and results were outlined.\nKeywords Corporate social responsibility, Companies act, Healthcare sector, Maternal health,\nMaternal mortality\nPaper type Research paper\nJournal of Health Research\nEmerald Publishing Limited\n2586-940X\nDOI 10.1108/JHR-01-2019-0014Received 8 February 2019\nAccepted 24 June 2019The current issue and full text archive of this journal is available on Emerald Insight at:\nwww.emeraldinsight.com /2586-940X.htm\n© Shashi Lata Yadav, Debasis Patnaik and Babitha Vishwanath. Published in Journal of Health\nResearch . Published by Emerald Publishing Limited. This article is published under the Creative\nCommons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and createderivative works of this article (for both commercial and non-commercial purposes), subject to fullattribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcodeHe

The similarity_search_with_score method returns both the documents and the distance score of the query relative to them. The distance score represents the Euclidean (L2) distance, where a lower L2 distance indicates greater similarity.

In [16]:
docs_and_score=new_db.similarity_search_with_score(query,k=2)
docs_and_score[0]

(Document(metadata={'source': 'Fulfillment_report_slides.pdf', 'page': 0}, page_content='2013 were studied and results were outlined.\nKeywords Corporate social responsibility, Companies act, Healthcare sector, Maternal health,\nMaternal mortality\nPaper type Research paper\nJournal of Health Research\nEmerald Publishing Limited\n2586-940X\nDOI 10.1108/JHR-01-2019-0014Received 8 February 2019\nAccepted 24 June 2019The current issue and full text archive of this journal is available on Emerald Insight at:\nwww.emeraldinsight.com /2586-940X.htm\n© Shashi Lata Yadav, Debasis Patnaik and Babitha Vishwanath. Published in Journal of Health\nResearch . Published by Emerald Publishing Limited. This article is published under the Creative\nCommons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and createderivative works of this article (for both commercial and non-commercial purposes), subject to fullattribution to the original publication and authors. The full ter

The max_marginal_relevance_search method in new_db allows we to perform a search with the following parameters:

<b>query:</b> The search term or query .<br>
<b>k:</b> The number of results to return.<br>
<b>fetch_k:</b> The number of documents to consider for diversity.<br>
This method aims to retrieve documents that not only match the query but also provide a diverse set of results.

In [24]:
# Using MMR search criteria
mmr=new_db.max_marginal_relevance_search(query, k=5, fetch_k=2)
mmr[0].page_content

'2013 were studied and results were outlined.\nKeywords Corporate social responsibility, Companies act, Healthcare sector, Maternal health,\nMaternal mortality\nPaper type Research paper\nJournal of Health Research\nEmerald Publishing Limited\n2586-940X\nDOI 10.1108/JHR-01-2019-0014Received 8 February 2019\nAccepted 24 June 2019The current issue and full text archive of this journal is available on Emerald Insight at:\nwww.emeraldinsight.com /2586-940X.htm\n© Shashi Lata Yadav, Debasis Patnaik and Babitha Vishwanath. Published in Journal of Health\nResearch . Published by Emerald Publishing Limited. This article is published under the Creative\nCommons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and createderivative works of this article (for both commercial and non-commercial purposes), subject to fullattribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcodeHe

## Setting up a retriever
We can use a retriever to find chunks in the vector store that are similar to a supplied question.
<img src='pic/4.png' width="500">

In [33]:
retriever=new_db.as_retriever(search_type = "mmr") 
docs=retriever.invoke(query)
docs[0].page_content  # print first document content

'2013 were studied and results were outlined.\nKeywords Corporate social responsibility, Companies act, Healthcare sector, Maternal health,\nMaternal mortality\nPaper type Research paper\nJournal of Health Research\nEmerald Publishing Limited\n2586-940X\nDOI 10.1108/JHR-01-2019-0014Received 8 February 2019\nAccepted 24 June 2019The current issue and full text archive of this journal is available on Emerald Insight at:\nwww.emeraldinsight.com /2586-940X.htm\n© Shashi Lata Yadav, Debasis Patnaik and Babitha Vishwanath. Published in Journal of Health\nResearch . Published by Emerald Publishing Limited. This article is published under the Creative\nCommons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and createderivative works of this article (for both commercial and non-commercial purposes), subject to fullattribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcodeHe

## 6. Setting up a prompt
In addition to the question we want to ask, we also want to provide the model with the context from the PDF file. We can use a prompt template to define and reuse the prompt we'll use with the model.
<img src='pic/6.png' width="500">

In [26]:
# Build prompt
from langchain.prompts import PromptTemplate
template = """Use the following context to accurately answer the question at the end. If you don't
know the answer, say don't have information on it. Keep the answer concise, precise as possible.
{context} 
Question: {question} 
Helpful Answer:"""
QA_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template)
QA_PROMPT

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="Use the following context to accurately answer the question at the end. If you don't\nknow the answer, say don't have information on it. Keep the answer concise, precise as possible.\n{context} \nQuestion: {question} \nHelpful Answer:")

## 7.Configuring the model
We'll be using Ollama to load the local model in memory. After creating the model, we can invoke it with a question to get the response back.
<img src='pic/5.png' width="500">

In [28]:
#Ollama model
from langchain_community.llms import Ollama
local_llm = Ollama(model=MODEL)

In [30]:
# Define the text for prediction
text="What is the capital of India"
# Predicting with model
print(local_llm.predict(text))

The capital of India is New Delhi.


## Adding the prompt to the chain
We can now chain the prompt with the model and the parser.
<img src='pic/7.png' width="500">

In [34]:
# Establishing a QA chain
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(local_llm,
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": QA_PROMPT})

## Using the chain to answer questions
Finally, we can use the chain to ask questions that will be answered using the PDF document.

In [36]:
# Run chain
question = "Summarise the report Fulfillment_report_slides.pdf"
result = qa_chain({"query": question})
result["result"]

'The report, titled "Corporate Social Responsibility in Healthcare Sector of India," is a research paper published in the Journal of Health Research. It studied 59 healthcare sector companies in India that have formulated CSR policies and obtained approval from their Board of Directors.\n\nKey findings include:\n\n* Only 74.6% of the funds allocated for CSR activities were spent, with some companies exceeding the required amount.\n* The average spending on CSR was Rs 3.1787 billion (2% of total revenue), despite a sanctioned amount of Rs 40.55 crores.\n* Four companies had negative net profits and did not spend on CSR activities.\n* The CSR Committee size varies from 3 to 5 directors, with most companies having either one or two independent directors.\n\nThe report highlights the need for improvement in the implementation of CSR policies by healthcare companies in India.'

In [31]:
question = "What are the CSR activities undertaken by the companies in Goa?"
result = qa_chain({"query": question})
result["result"]

'According to the study, the CSR activities undertaken by companies in Goa include:\n\n* Promotion of education (100% of pharmaceutical and non-pharmaceutical companies)\n* Environmental sustainability and enhancing vocational skills (71.42% of pharmaceutical and non-pharmaceutical companies)\n* Social business projects and contribution to government funds (55.14% of pharmaceutical and non-pharmaceutical companies)\n\nHowever, there is a lack of data on specific CSR activities related to maternal health, antenatal care, immunization, health insurance, etc. Only one company provided medical assistance/cataract/dialysis services.'

To enable smooth conversations, it’s important to include conversation memory that can handle multiple questions in a chat-like format.<br>

<b>ConversationBufferMemory:</b> This class stores the history of the conversation, allowing the system to remember previous interactions. The memory_key specifies the key under which the chat history will be stored, and return_messages=True ensures that the stored messages are returned when retrieving memory.<br>
<b>ConversationalRetrievalChain:</b> This chain builds on the capabilities of the retrieval QA system by incorporating memory. It allows the model to refer back to prior messages in the conversation, enhancing context awareness and enabling more coherent and contextually relevant responses.<br>

<b>Benefits:</b><br>
<b>Contextual Awareness:</b> The model can provide responses that consider the entire conversation, leading to more meaningful interactions.<br>
<b>User Experience:</b> It creates a more natural and engaging experience for users, as the system can maintain context over multiple exchanges.<br>
<b>Efficiency:</b> By leveraging prior interactions, the system can avoid unnecessary repetition and streamline the conversation.

In [41]:
# Adding memory
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
memory = ConversationBufferMemory(memory_key="chat_history",return_messages=True)
# Chain that also accounts for conversational memory
qa = ConversationalRetrievalChain.from_llm(local_llm,retriever=retriever,memory=memory)

In [42]:
question = "Create a summary on how do the Indian health care companies fulfill government’s new orientation towards CSR activities"
result = qa({"question": question})
print(result['answer'])

Based on the provided context, it appears that Indian healthcare companies are slowly fulfilling the government's new orientation towards Corporate Social Responsibility (CSR) activities. The context mentions various reports and articles related to CSR in India, including those specifically focused on the healthcare sector.

However, I couldn't find any specific information or data on how Indian healthcare companies are fulfilling the government's new orientation towards CSR activities, particularly with a special consideration towards maternal health. The provided tables and sources do not provide clear answers to this question.

Therefore, I don't have enough information to provide a detailed summary on this topic.


In [35]:
# Follow-up question
question = "What are the different CSR policies that have been formulated and committees dedicated to implementing these policies?"
result = qa({"question": question})
result['answer']

'The text does not provide information on the specific CSR policies that have been formulated for the Indian healthcare sector or the committees dedicated to implementing them. However, it does mention that all 59 sample companies in the healthcare sector have formulated CSR policies and obtained approval from their Board of Directors (CSR Committee), according to Companies Act, 2013 guidelines.'