<a href="https://colab.research.google.com/github/sachinkumar99/Custom_RAG_LLM/blob/main/Custom_RAG_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's install required python libraries

In [1]:
!pip install -q langchain==0.1.11
!pip install -q torch
!pip install -q transformers
!pip install -q sentence-transformers
!pip install -q datasets
!pip install -q faiss-cpu
!pip install -q pypdf
!pip install unstructured
!pip install pdf2image
!pip install pdfminer.six
!pip install pillow_heif
!pip install pikepdf
!pip install unstructured_inference
!pip install pytesseract
!pip install poppler-utils
!apt-get install poppler-utils
!pip install fitz
!pip install pymupdf
!pip install langchainhub

Collecting urllib3==1.26.18 (from unstructured)
  Using cached urllib3-1.26.18-py2.py3-none-any.whl (143 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.2.1
    Uninstalling urllib3-2.2.1:
      Successfully uninstalled urllib3-2.2.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.15.0 requires wrapt<1.15,>=1.11.0, but you have wrapt 1.16.0 which is incompatible.
types-requests 2.31.0.20240311 requires urllib3>=2, but you have urllib3 1.26.18 which is incompatible.[0m[31m
[0mSuccessfully installed urllib3-1.26.18
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


Let's import the python libraries

In [5]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import PyMuPDFLoader

In [6]:
# RAG prompt
from langchain import hub

# Loads the latest version
prompt = hub.pull("rlm/rag-prompt", api_url="https://api.hub.langchain.com")

Create the pdf loader instance and load the external book data

---



In [7]:
loader = PyMuPDFLoader('ConceptsofBiology-WEB.pdf')

In [8]:
data = loader.load()

We will only consider first two chapters for sentence embeddings and storage. Book pages from 18 to 67 holds this data.

In [9]:
# Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
# It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_documents(data[18:68])

**Text Embeddings**

**Model - 'all-mpnet-base-v2'**  
It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

In [10]:
# Define the path to the pre-trained model
modelPath = 'sentence-transformers/all-mpnet-base-v2'

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  return self.fget.__get__(instance, owner)()


**Vector Stores**

I am using FAISS vector storage option by langchain to store embeddings and retreive them

In [11]:
db = FAISS.from_documents(docs, embeddings)

In [12]:
# Let's do similarity search against the vector storage
question = "What is biology?"
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)

Biology is the science that studies life. What exactly is life? This may sound like a silly question
with an obvious answer, but it is not easy to define life. For example, a branch of biology called
virology studies viruses, which exhibit some of the characteristics of living entities but lack others.
It turns out that although viruses can attack living organisms, cause diseases, and even reproduce,
FIGURE 1.1 This NASA image is a composite of several satellite-based views of Earth. To make the whole-Earth
image, NASA scientists combine observations of different parts of the planet. (credit: modification of work by NASA)
CHAPTER OUTLINE


**LLM Model for question answering**

I selected below question answering model from Huggingface based on the leaderboard results!

In [13]:
# Create a tokenizer object by loading the pretrained "Intel/dynamic_tinybert" tokenizer.
tokenizer = AutoTokenizer.from_pretrained('deepset/roberta-base-squad2')

# Create a question-answering model object by loading the pretrained "Intel/dynamic_tinybert" model.
model = AutoModelForQuestionAnswering.from_pretrained('deepset/roberta-base-squad2')

In [14]:
# Specify the model name you want to use
model_name = "deepset/roberta-base-squad2"

# Load the tokenizer associated with the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)

# Define a question-answering pipeline using the model and tokenizer
question_answerer = pipeline(
    "question-answering",
    model=model_name,
    tokenizer=tokenizer,
    return_tensors='pt',
)

# Create an instance of the HuggingFacePipeline, which wraps the question-answering pipeline
# with additional model-specific arguments (temperature and max_length)
llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={"temperature": 0.7, "max_length": 512},
)

In [15]:
# Create a retriever object from the 'db' using the 'as_retriever' method.
# This retriever is likely used for retrieving data or documents from the database.
retriever = db.as_retriever()

In [16]:
docs = retriever.get_relevant_documents("What is biology?")
print(docs[0].page_content)

Biology is the science that studies life. What exactly is life? This may sound like a silly question
with an obvious answer, but it is not easy to define life. For example, a branch of biology called
virology studies viruses, which exhibit some of the characteristics of living entities but lack others.
It turns out that although viruses can attack living organisms, cause diseases, and even reproduce,
FIGURE 1.1 This NASA image is a composite of several satellite-based views of Earth. To make the whole-Earth
image, NASA scientists combine observations of different parts of the planet. (credit: modification of work by NASA)
CHAPTER OUTLINE


In [18]:
# Create a retriever object from the 'db' with a search configuration where it retrieves up to 4 relevant splits/documents.
retriever = db.as_retriever(search_kwargs={"k": 2})

# Create a question-answering instance (qa) using the RetrievalQA class.
# It's configured with a language model (llm), a chain type "refine," the retriever we created, and an option to not return source documents.
qa = RetrievalQA.from_chain_type(llm=llm,  retriever=retriever, return_source_documents=False,chain_type_kwargs={"prompt": prompt} )
#chain_type="refine",

Note for Evaluators: I spent significant time in fixing below error. The issue is with the prompt template mismatch. Though, I am passing the question correctly, It expects the data in SquadExample format. After my thorough research I beleive this error is due to langchain version compatibility. If I had more time I would have rolled it back and tested it! Thanks  

Things I wish I could do and skipped due to time constraints:
1. spend more time cleaning the external data passed.
2. Fine tuning prompt template for the current use case
3. Building streamlit app and hosting this solution.

In [21]:
question = "What is Homeostasis?"
result = qa.run({'query':question})
print(result['result'])

ValueError: Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: What is Homeostasis? 
Context: FIGURE 1.4 Although no two look alike, these kittens have inherited genes from both parents and share many of the same characteristics.
(credit: Pieter & Renée Lanser)
Regulation/Homeostasis
Even the smallest organisms are complex and require multiple regulatory mechanisms to coordinate internal
functions, such as the transport of nutrients, response to stimuli, and coping with environmental stresses.
Homeostasis (literally, “steady state”) refers to the relatively stable internal environment required to maintain life.
For example, organ systems such as the digestive or circulatory systems perform specific functions like carrying
oxygen throughout the body, removing wastes, delivering nutrients to every cell, and cooling the body.
To function properly, cells require appropriate conditions such as proper temperature, pH, and concentrations of
diverse chemicals. These conditions may, however, change from one moment to the next. Organisms are able to

diverse chemicals. These conditions may, however, change from one moment to the next. Organisms are able to
maintain homeostatic internal conditions within a narrow range almost constantly, despite environmental changes,
by activation of regulatory mechanisms. For example, many organisms regulate their body temperature in a process
known as thermoregulation. Organisms that live in cold climates, such as the polar bear (Figure 1.5), have body
structures that help them withstand low temperatures and conserve body heat. In hot climates, organisms have
methods (such as perspiration in humans or panting in dogs) that help them to shed excess body heat.
FIGURE 1.5 Polar bears and other mammals living in ice-covered regions maintain their body temperature by generating heat and reducing
heat loss through thick fur and a dense layer of fat under their skin. (credit: "longhorndave"/Flickr)
Energy Processing 
Answer: argument needs to be of type (SquadExample, dict)