# Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG) is a technique that enhances generative AI models by integrating information retrieval capabilities. It allows the model to reference external knowledge bases, providing more accurate and contextually relevant responses. This approach optimizes the output of large language models without the need for extensive retraining.

- Refer below image for a simple RAG process flow:
![RAG Architecture](https://truera.com/wp-content/uploads/2023/09/truera-architecture-for-chatot-figure-1-1024x561.png)

- **Query Generation:** The user inputs a query or prompt.

- **Retrieval:** The system searches external knowledge bases or vector databases to retrieve relevant information.

- **Integration:** The retrieved information is integrated with the generative model.

- **Generation:** The model generates a response that incorporates both the retrieved information and the user query.

- **Output:** The final response is presented to the user, providing a more accurate and contextually relevant answer.

This process ensures that the generated content is enriched with up-to-date and relevant information from external sources.



###    **Assumptions & Considerations**


1.   I have used **Llama index** as a framework for the Retrieval Augmented Generation assignment
2.   Open source Vector database considered is **Chromadb**.
3.   **Phi-3.5b** from Microsoft is the LLM model being used from Huggingface. The model is loaded in 16bit to conserve memory and better response time.
4.   **Pypdf2** python library is used for handling pdf file to extract first two chapters from entire file.



### 1. Libraries installation

In [1]:
!pip install llama-index==0.11
!pip install chromadb
!pip install llama-index-vector-stores-chroma
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-huggingface
!pip install PyPDF2

Collecting llama-index==0.11
  Downloading llama_index-0.11.0-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.4.0,>=0.3.0 (from llama-index==0.11)
  Downloading llama_index_agent_openai-0.3.4-py3-none-any.whl.metadata (728 bytes)
Collecting llama-index-cli<0.4.0,>=0.3.0 (from llama-index==0.11)
  Downloading llama_index_cli-0.3.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core==0.11.0.post1 (from llama-index==0.11)
  Downloading llama_index_core-0.11.0.post1-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-embeddings-openai<0.3.0,>=0.2.0 (from llama-index==0.11)
  Downloading llama_index_embeddings_openai-0.2.5-py3-none-any.whl.metadata (686 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.3.0 (from llama-index==0.11)
  Downloading llama_index_indices_managed_llama_cloud-0.5.0-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index==0.11)
  Downloading llama_index_legacy-0.9.48.post4-

### 2. Libraries import

In [10]:
import os
import torch
import chromadb                                                       #Open source vector DB

from PyPDF2 import PdfWriter, PdfReader, PdfMerger                    #Extract only 2 chapters from entire PDF

from llama_index.llms.huggingface import HuggingFaceLLM               #Open source LLM from HF
from llama_index.core.node_parser import TokenTextSplitter            #For chunking of data
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader  #Index for querying, Directory to read files
from llama_index.vector_stores.chroma import ChromaVectorStore        #Chromadb integration with llama index
from llama_index.core import Settings
from llama_index.core import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding   #HuggingFace embedding model

from IPython.display import Markdown, display

### 3. Data Ingestion

#### 3.1 Downloading data

In [9]:
!mkdir -p 'data'
!wget 'https://assets.openstax.org/oscms-prodcms/media/documents/ConceptsofBiology-WEB.pdf?_gl=1*hrlp5i*_gcl_au*NjEzNDc3NjM5LjE3MzE0ODUyNTg.*_ga*MTcwOTY3OTQ2NC4xNzMxNDg1MjU5*_ga_T746F8B0QC*MTczMTczMzYxNy4yLjEuMTczMTczMzYxNy42MC4wLjA.' -O 'data/ConceptsofBiology_full.pdf'

--2024-11-16 05:16:29--  https://assets.openstax.org/oscms-prodcms/media/documents/ConceptsofBiology-WEB.pdf?_gl=1*hrlp5i*_gcl_au*NjEzNDc3NjM5LjE3MzE0ODUyNTg.*_ga*MTcwOTY3OTQ2NC4xNzMxNDg1MjU5*_ga_T746F8B0QC*MTczMTczMzYxNy4yLjEuMTczMTczMzYxNy42MC4wLjA.
Resolving assets.openstax.org (assets.openstax.org)... 3.168.132.43, 3.168.132.12, 3.168.132.35, ...
Connecting to assets.openstax.org (assets.openstax.org)|3.168.132.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 153179709 (146M) [application/pdf]
Saving to: ‘data/ConceptsofBiology_full.pdf’


2024-11-16 05:16:34 (31.6 MB/s) - ‘data/ConceptsofBiology_full.pdf’ saved [153179709/153179709]



#### 3.2 Selecting only First 2 chapters

In [16]:
!mkdir -p 'data/pdf_split'

In [30]:
file_path = 'data/ConceptsofBiology_full.pdf'
file_name = os.path.basename(file_path).split(".")[0]

#Chapter 1 starts at Pg 19 and chapter 2 ends at Pg 68.
start_pg = 19
last_pg = 68

if not os.path.isfile(file_path):
    print(f"The file {file_path} does not exist.")
else:
    inputpdf = PdfReader(open(file_path, "rb"))

    for i in range(start_pg-1, last_pg):
        writer = PdfWriter()
        writer.add_page(inputpdf.pages[i])
        with open(f"data/pdf_split/{file_name}-pg_{str(i).zfill(2)}.pdf", "wb") as output_pdf:
            writer.write(output_pdf)

    # Merge the split PDF files
    merge_pdf = PdfMerger()

    for i in range(start_pg-1, last_pg):
        merge_pdf.append(open(f"data/pdf_split/{file_name}-pg_{str(i).zfill(2)}.pdf", "rb"))

    with open(f"data/data.pdf", "wb") as output_pdf:
        merge_pdf.write(output_pdf)


#### 3.4 Embedding data into Vector DB

In [None]:
# load documents
documents = SimpleDirectoryReader(input_files=['data/data.pdf']).load_data()
splitter = TokenTextSplitter(chunk_size=200, chunk_overlap=20)
nodes = splitter.get_nodes_from_documents(documents)

In [18]:
# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.get_or_create_collection("Data_DB_01")

In [19]:
# define embedding function
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [38]:
# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
    nodes, storage_context=storage_context, embed_model=embed_model
)


### 4. Querying the Data

In [39]:
llm_model = 'microsoft/Phi-3.5-mini-instruct'

In [40]:
Settings.llm = HuggingFaceLLM(
        model_name= llm_model,
        tokenizer_name= llm_model,
        context_window=3900,
        max_new_tokens=2000,
        model_kwargs={"torch_dtype": torch.float16},
        generate_kwargs={"temperature": 0.1,  "top_k": 5, "top_p": 0.9},
        device_map='auto'
    )

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [41]:
# Query Data
query_engine = index.as_query_engine(
        llm=Settings.llm,
        streaming=False,
        similarity_top_k = 5
)

In [51]:
user_query = "explain Homeostasis mentioned in the book"

In [52]:
response = query_engine.query(user_query)


In [53]:
display(Markdown(f"{response}"))


Homeostasis, as mentioned in the book, refers to the relatively stable internal environment that living organisms maintain to function effectively. It is a regulatory mechanism that ensures the constancy of various physiological parameters such as temperature, pH, and nutrient levels, despite changes in the external environment. For instance, organisms like polar bears have adapted to cold climates by developing thick fur and dense layers of fat to generate and retain heat, thus maintaining their body temperature. Similarly, humans and other mammals regulate their body temperature through processes like sweating or panting to shed excess heat in hot climates. These examples illustrate how homeostasis is crucial for survival, allowing organisms to adapt to their surroundings and maintain a steady state of internal conditions.




# 5. Future Scope

#### Challenge 1:
* The current solution only uses semantic similarity while retrieving the chunks from DB.
* We can instead use hybrid search i.e. use keyword search type retriever like bm25.
* Since, this is a biology textbook, there might situations where user needs to search for a specific keyword where BM25 type retriever would help.

#### Challenge 2:
* Currently, entire data is in system memory.
* The recommended practice would be to store the database locally or on cloud so that everytime data ingestion pipeline is not required.

#### Challenge 3:
* Based on system constraints, I have used Gemma-2b model which was released a while ago.
* There are more and better LLM like models from Llama which can give rich response.

