# RAG application built on Gemini 

<img src="app.png">

## Project Overview

This project leverages Retrieval-Augmented Generation (RAG) to enhance the performance of large language models (LLMs) by incorporating external knowledge from external databases or knowledge sources. The project focuses on improving the quality of generated responses by combining both retrieval (finding relevant information) and generation (producing coherent, contextually-aware outputs) processes.

Through the use of retrieval mechanisms and RAG workflows, the goal is to create an intelligent system that can access external information in real time, providing more accurate and contextual responses for a range of natural language processing tasks.

## Key Objectives:
1. Integration of Retrieval and Generation:
The core objective is to seamlessly integrate retrieval techniques, such as ChromaDB, with a language model to provide high-quality, context-aware outputs. The retrieval step fetches relevant data from a knowledge base or documents, while the generation step uses this data to create responses.

2. Knowledge Base Integration:
A key part of the project is the use of a knowledge base or vector database (e.g., ChromaDB) to store and retrieve relevant embeddings or information. By utilizing semantic search, the system can find the most relevant information for any given query.

3. Enhanced Language Model Performance:
By augmenting traditional language models (like GPT) with external knowledge through the retrieval step, the model can produce more informed and accurate answers, especially for specific or domain-related queries.

## Components of the Project:
1. Retrieval Mechanism:
The retrieval phase involves searching a database of embeddings (e.g., ChromaDB), which has preprocessed knowledge or text. This helps identify the most relevant context for the language model to generate informed responses.

2. Generation Mechanism:
The generation phase involves using an LLM, such as GPT, to create coherent and accurate responses based on the retrieved information. The model utilizes both its internal knowledge and the external retrieved context to produce high-quality outputs.

3. Embedding and Vector Storage:
Embeddings (vector representations of data) are stored in a database, such as ChromaDB, which allows for fast retrieval and comparison to generate highly relevant responses based on queries.

# ChromaDB

ChromaDB is a vector database designed for storing and retrieving embeddings efficiently. Embeddings are dense vector representations of data (e.g., text, images, or other data types) that are generated by machine learning models. These embeddings allow for fast similarity search and retrieval, which is crucial for tasks like information retrieval or retrieval-augmented generation (RAG), where you need to find relevant information from large datasets or knowledge bases.

Key Features:

1. Efficient Storage: It stores embeddings and associated metadata (e.g., text, documents, etc.).
2. Similarity Search: It supports high-performance similarity search for vector-based queries, allowing you to find the closest matches to a given query.
3. Scalability: ChromaDB is optimized for scalability and can handle large-scale datasets efficiently.
4. Integration: It can easily integrate with machine learning pipelines, such as those used for language models and RAG, where it retrieves relevant information to generate contextually appropriate responses.
   
* Use Case in RAG: ChromaDB is used to store and retrieve relevant information (i.e., text data or embeddings) that can be fed into an LLM for more accurate and context-aware generation.

# LangChain
LangChain
LangChain is an open-source framework designed to simplify the development of applications that leverage large language models (LLMs) and external tools. It provides an abstraction layer to easily chain together different components, such as models, retrievers, and output processors, making it easier to create complex workflows involving LLMs.

Key Features:

1. Model Abstraction: LangChain provides interfaces to integrate various language models (e.g., GPT, BERT, etc.) and makes it easy to swap models or fine-tune them for specific tasks.
2. Multi-Step Workflows: It allows for chaining multiple actions, such as using LLMs for generation, interacting with external APIs, running custom logic, and so on. This is particularly useful for building end-to-end applications that require multiple steps beyond simple model inference.
3. Retrieval-Augmented Generation (RAG): LangChain simplifies the integration of retrieval mechanisms, like ChromaDB, into your LLM-based workflows. It can retrieve documents or context before running the model to generate a more accurate response.
4. Tool Integration: LangChain supports integration with other tools such as databases, APIs, and data processing tools (e.g., Pandas, SQL) to enhance model outputs.

Use Case in RAG: LangChain makes it easy to build retrieval-augmented generation workflows by chaining together LLMs, document retrieval tools (like ChromaDB), and custom logic to ensure the LLM has access to external data when generating responses.

## Loading Libraries

In [1]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("Research Paper.pdf")
data = loader.load()  # entire PDF is loaded as a single Document
#data

In [2]:
len(data)

16

### To split long documents into smaller, more manageable chunks that can be processed by an LLM.

##### chunk_size=1000: Each chunk will have a maximum of 1000 characters.

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split data
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(data)


print("Total number of documents: ",len(docs))

Total number of documents:  50


In [4]:
docs[7]

Document(metadata={'source': 'Research Paper.pdf', 'page': 2}, page_content='and European countries. COVID-19 is a broad community of viruses containing a nucleus of\ngenetic material surrounded by protein spikes envelope that gives crown appearance [3]. Airborne\nillness might spread from common cold to pneumonia, and symptoms tend to be mild in most\nindividuals.\nSevere Acute Respiratory Syndrome, first described by China in 2003, and Middle East\nRespiratory Syndrome, first by Saudi Arabia in 2012, are coronavirus forms that cause serious\nillness [4]. The COVID-19 has a near resemblance to the bat coronaviruses and is known as\nbats are the primary source. However, the origins of COVID-19 remain under investigation.\nRecent evidence indicates that the transmission has spread to humans from illicit wildlife sold\nin the Hunan Seafood Wholesale Market. The first time COVID-19 was reported in China in\n2019 and initially occurred in a group of pneumonia associated people in the city 

In [15]:
docs[48]

Document(metadata={'source': 'Research Paper.pdf', 'page': 15}, page_content='CMC, 2022, vol.70, no.3 5319\n[26] H. Thakkar, V. Shah, H. Yagnik and M. Shah “Comparative anatomisation of data mining and fuzzy\nlogic techniques used in diabetes prognosis,”ClinicalEHealth, vol. 4, pp. 12–23, 2020.\n[27] H. N. K. Al-Behadili and K. R. Ku-Mahamud, “Fuzzy unordered rule using greedy hill climb-\ning feature selection method: An application to diabetes classification,” Journal of Information and\nCommunicationTechnology, vol. 20, pp. 391–422, 2021.\n[28] N. Chandgude and S. Pawar, “Diagnosis of diabetes using fuzzy inference system,” in2016Int.Conf.\nonComputingCommunicationControlandAutomation(ICCUBEA) , India, pp. 1–6, 2016.\n[29] W. M. Shaban, A. H. Rabie, A. I. Saleh and M. Abo-Elsoud, “Detecting COVID-19 patients based on\nfuzzy inference engine and deep neural network,”AppliedSoftComputing , vol. 99, pp. 106906, 2021.\n[30] H. Garg and D. Rani, “New generalised Bonferroni mean aggregati

##### Chroma: A vector storage system to persist embeddings for efficient search and retrieval.
##### GoogleGenerativeAIEmbeddings: A class to generate embeddings using Google's generative AI models.

In [5]:
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings

from dotenv import load_dotenv
load_dotenv() 

#Get an API key: 
# Head to https://ai.google.dev/gemini-api/docs/api-key to generate a Google AI API key. Paste in .env file

# Embedding models: https://python.langchain.com/v0.1/docs/integrations/text_embedding/

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vector = embeddings.embed_query("hello, world!")
vector[:5]
#vector

  from .autonotebook import tqdm as notebook_tqdm


[0.05168594419956207,
 -0.030764883384108543,
 -0.03062233328819275,
 -0.02802734263241291,
 0.01813093200325966]

In [6]:
vectorstore = Chroma.from_documents(documents=docs, embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))

In [16]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})
retrieved_docs = retriever.invoke("What is new in FIS?")

In [17]:
len(retrieved_docs)

10

In [18]:
print(retrieved_docs[5].page_content)

CMC, 2022, vol.70, no.3 5309
Figure 1:IoT enabled smart monitoring of diseases empowered with FIS
This equation alters the membership functions of fuzzy sets of Fever, Cough, Respiratory
Rate, Headache, Sore Throat, Flu, Blood Pressure, and Diarrhea. We have defined some common
methods which are used within the context of disease diagnosis. We have two phases for the
disease diagnosis first phase is the training phase, and the second phase is the validation phase.
We formulated the training phase as taking different diseases as input and sending it to the IoT
layer. The raw data has been sent to the processing layer, where the outliers are removed from the
data. Missing values are filled with mean, mod and average values. After the processing layer, the
preprocessed data has been sent to the application layer; this layer comprises the prediction and
performance layers. First, the cleaned data has been passed through the prediction layer. After the


In [10]:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro",temperature=0.3, max_tokens=500)

In [11]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [12]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [13]:
response = rag_chain.invoke({"input": "Can you tell me about the architecture of this Fuzzy Inference System?"})
print(response["answer"])

The Fuzzy Inference System (FIS) architecture consists of a sensory layer collecting input parameters (fever, cough, etc.), a preprocessing layer handling noise and null values, and an application layer.  The application layer further divides into prediction and performance layers.  The FIS within the prediction layer uses fuzzy logic to diagnose diseases based on the input parameters.



In [14]:
response = rag_chain.invoke({"input": "For which type of diseases this Fuzzy Inference System is developed?"})
print(response["answer"])

This Fuzzy Inference System (FIS) is designed to diagnose various diseases, including COVID-19, Typhoid, Malaria, and Pneumonia.  It also has been used in other studies for diagnosing sepsis, heart disease, diabetes, and cholera.  The system uses symptoms and other patient attributes as input and predicts the likelihood of different diseases.



In [21]:
response = rag_chain.invoke({"input": "Is there any author with surname of Muhammad?"})
print(response["answer"])

Yes, there are two authors with Muhammad in their name.  Hafiz Muhammad Ehtisham Raza and Muhammad Idrees are listed as authors of the article "Disease Diagnosis System Using IoT Empowered with Fuzzy Inference System".  This article can be found in *Computers, Materials & Continua*.



In [23]:
response = rag_chain.invoke({"input": "In which Journel this paper is published?"})
print(response["answer"])

This article, "Disease Diagnosis System Using IoT Empowered with Fuzzy Inference System," is published in Computers, Materials & Continua (CMC).  The provided text specifies "CMC, 2022, vol.70, no.3" in multiple locations.  Additionally, the DOI is provided as "10.32604/cmc.2022.020344".

