# Part 1 Retrieval-Augmented Generation (RAG) Model for QA Bot

##**Summary :**

This Colab notebook demonstrates the implementation of a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA) bot designed for business use. The QA bot is capable of handling queries related to a provided document or dataset, retrieving relevant information, and generating coherent responses using a generative model (Cohere API).

The key components of the model include:

* Document Processing: Extracting text from the document and splitting it into manageable chunks.
* Vector Database: Using FAISS for efficient storage and retrieval of document embeddings.
* Query Handling: For each user query, the model retrieves relevant document chunks using FAISS and generates a contextually accurate answer with Cohere API.

**This notebook covers the entire pipeline from data loading, document embedding, query processing, retrieval, and answer generation. It includes several example queries to showcase how the system performs in retrieving relevant document segments and generating accurate answers.**

Deliverables:

An end-to-end demonstration of the RAG model pipeline.
Explanation of the architecture, retrieval approach, and generative response mechanism.
Examples showing the effectiveness of the model in answering queries based on the document content.







In [None]:
print("Hi this is a colab notebook for QA bot ")

Hi this is a colab notebook for QA bot 


# Step 1: Setup Environment

We'll use Python for this project in a Google Colab environment.

Packages Required:
* Transformers: To use pre-trained generative models.
* Cohere: Cohere API for text generation.
* Faiss (alternative to Pinecone if not available): For vector similarity search.
* Streamlit or Gradio: For an interactive UI in Part 2.
* PDFPlumber: To handle PDF document processing.

  

In [None]:

!pip install faiss-cpu




#Step 2: Load and Process the Dataset
* First, we need to load and pre-process the document or dataset for which we are building the QA bot.
* Assuming it's a text-based document:
 Extract content from the document (in the PDF format).
* Tokenize or segment the document into chunks suitable for embedding generation.

##Extracting text from the pdf

In [None]:
!pip install pdfplumber



In [None]:
import pdfplumber

# Function to load and extract text from a PDF
def extract_text_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ""
        for page in pdf.pages:
            text += page.extract_text()
    return text

document_path = "/content/Gen AI Engineer _ Machine Learning Engineer Assignment.pdf"
document_text = extract_text_from_pdf(document_path)
print(document_text) #uncomment to see the document text extracted from pdf

Gen AI Engineer / Machine Learning Engineer Assignment
Part 1: Retrieval-Augmented Generation (RAG) Model for QA Bot
Problem Statement:
Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)
bot for a business. Use a vector database like Pinecone DB and a generative model like
Cohere API (or any other available alternative). The QA bot should be able to retrieve
relevant information from a dataset and generate coherent answers.
Task Requirements:
1. Implement a RAG-based model that can handle questions related to a provided
document or dataset.
2. Use a vector database (such as Pinecone) to store and retrieve document
embeddings efficiently.
3. Test the model with several queries and show how well it retrieves and generates
accurate answers from the document.
Deliverables:
● A Colab notebook demonstrating the entire pipeline, from data loading to question
answering.
● Documentation explaining the model architecture, approach to retrieval, and how
generative 

#Step 3: Create Embeddings Using a Pre-trained Model :
* To handle document retrieval, we need to convert the document into embeddings using a model like SentenceTransformers.
* These embeddings capture the semantic meaning of the text and are used for similarity search.

In [None]:
!pip install sentence-transformers



* **Load the SentenceTransformer Model :**
A pre-trained SentenceTransformer model (all-MiniLM-L6-v2) is loaded. This model has been trained on a massive dataset and is capable of generating high-quality sentence embeddings.
* **Split Document Text into Chunks:**The document text is divided into smaller chunks of 300 characters with a 512 character stride. This ensures that the model can handle longer documents effectively.
Generate Embeddings:
* **The SentenceTransformer model** is used to encode each document chunk into a dense vector representation (embedding). These embeddings capture the semantic meaning of the text, allowing us to compare the similarity between different chunks and the user's query.

In [None]:
from sentence_transformers import SentenceTransformer

# Load pre-trained Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Split document into chunks for embedding
document_chunks = [document_text[i:i + 300] for i in range(0, len(document_text), 512)]
document_embeddings = model.encode(document_chunks)


# Step 4: Setup FAISS for Vector Storage



**FAISS (Facebook AI Similarity Search) is an efficient library used for similarity search and clustering of dense vectors. In this step, we utilize FAISS to store and index the document embeddings generated in the previous step.**




* Import the FAISS library (faiss) for similarity search.

* Define Embedding Dimension:
Set the embedding_dim variable to 384, which is the fixed dimension of the embeddings generated by the SentenceTransformer model.
* Build FAISS Index:
Create a FAISS index using IndexFlatL2. This index uses the L2 distance metric to measure similarity between vectors.
* Convert Embeddings to NumPy Array:
Convert the document embeddings into a NumPy array and ensure they are in float32 format, as required by FAISS.
* Add Embeddings to Index:
Add the NumPy array of document embeddings to the FAISS index. This allows FAISS to efficiently search for similar vectors.

**By indexing the embeddings, FAISS enables fast retrieval of relevant document chunks based on the similarity between the query embedding and the stored document embeddings.**

In [None]:
import faiss
import numpy as np
embedding_dim = 384  # SentenceTransformer embedding dimension
# Build FAISS index
index = faiss.IndexFlatL2(embedding_dim)  # L2 distance metric
faiss_embeddings = np.array(document_embeddings).astype(np.float32)
index.add(faiss_embeddings)

# Step 5: Query Processing and Retrieval
For the retrieval step, when a question is asked, we:

* Encode the query.
* Retrieve the top-k relevant document chunks based on the cosine similarity of the query and the document embeddings.

## Extract Relevant Text to the Query (question) from Document Text
By searching for the nearest neighbors in the FAISS index, the code identifies the most relevant document chunks based on semantic similarity.



Update : *We Increase the nearest neighbors to top_k = 5* to get better context for the model in real time

* **Encode the Query:**
The user's query is first encoded into an embedding using the same SentenceTransformer model used to encode the document chunks. This ensures that the query and document chunks are represented in the same vector space.
* **Retrieve Relevant Chunks:**
The query embedding is used to search for the most similar document chunks within the FAISS index. The retrieve_relevant_chunks_faiss function performs this search, returning the top_k most relevant chunks and their corresponding distances to the query.
* **Rank and Return Results:**
The retrieved chunks are ranked based on their similarity to the query embedding (distance). The closer the distance, the more relevant the chunk is to the query.

In [None]:
# Function to retrieve relevant chunks using FAISS
def retrieve_relevant_chunks_faiss(query, model, index, document_embeddings, document_texts, top_k=5):
    # Step 1: Encode the query into an embedding
    query_embedding = model.encode([query])[0].astype(np.float32)  # Convert to float32 for FAISS compatibility

    # Step 2: Reshape the query embedding for FAISS (it should be 2D)
    query_embedding = query_embedding.reshape(1, -1)

    # Step 3: Perform the search using the FAISS index
    distances, indices = index.search(query_embedding, 2)  # Search for 2 nearest neighbors

    # Step 4: Retrieve the corresponding text chunks based on indices
    relevant_chunks = [document_texts[i] for i in indices[0]]  # Use document_texts instead of embeddings

    return relevant_chunks, distances[0]



###Query 1
Initially Let's search for 2 nearest neighbors of the query "What is the problem statement?" *Direct relevance in document*

In [None]:

query = "What is the Problem Statement?"

# Use document_chunks as document_texts, since these are the actual text chunks
relevant_chunks, distances = retrieve_relevant_chunks_faiss(query, model, index, document_embeddings, document_chunks)

# Print the relevant text chunks and their distances
for i, chunk in enumerate(relevant_chunks):
    print(f"Chunk {i + 1}: {chunk} (Distance: {distances[i]})\n")

Chunk 1:  Provide several example queries and the corresponding outputs.Part 2: Interactive QA Bot Interface
Problem Statement:
Develop an interactive interface for the QA bot from Part 1, allowing users to input queries
and retrieve answers in real time. The interface should enable users to upload documents (Distance: 1.480650544166565)

Chunk 2: Gen AI Engineer / Machine Learning Engineer Assignment
Part 1: Retrieval-Augmented Generation (RAG) Model for QA Bot
Problem Statement:
Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)
bot for a business. Use a vector database like Pinecone DB and a generative model (Distance: 1.5173135995864868)



###Query 2
How is GitHub useful?
*Indirect relevance in document*

In [None]:
# Example usage:
query = "How is GitHub useful?"

# Use document_chunks as document_texts, since these are the actual text chunks
relevant_chunks, distances = retrieve_relevant_chunks_faiss(query, model, index, document_embeddings, document_chunks)

# Print the relevant text chunks and their distances
for i, chunk in enumerate(relevant_chunks):
    print(f"Chunk {i + 1}: {chunk} (Distance: {distances[i]})\n")

Chunk 1: proach thoroughly, explaining your decisions, challenges faced,
and solutions.
3. Provide a detailed ReadMe file in your GitHub repository, including setup and usage
instructions.
4. Submissions should include:
○ Source code for both the notebook and the interface.
○ A fully functional Colab noteboo (Distance: 1.2285795211791992)

Chunk 2: ions, and view the bot's
responses.
● Example interactions demonstrating the bot's capabilities.
Guidelines:
● Use Docker to containerize the application for easy deployment.
● Ensure the system can handle large documents and multiple queries without
significant performance drops.
● Share the code,  (Distance: 1.3773610591888428)



**AFTER UPDATE**

**Changes Made:**
* top k = 5 instead of 2 for better context


In [None]:
def retrieve_relevant_chunks_faiss(query, model, index, document_embeddings, document_texts, top_k=5):
    # Step 1: Encode the query into an embedding
    query_embedding = model.encode([query])[0].astype(np.float32)  # Convert to float32 for FAISS compatibility

    # Step 2: Reshape the query embedding for FAISS (it should be 2D)
    query_embedding = query_embedding.reshape(1, -1)

    # Step 3: Perform the search using the FAISS index
    distances, indices = index.search(query_embedding, top_k)  # Search for top_k nearest neighbors

    # Step 4: Retrieve the corresponding text chunks based on indices
    relevant_chunks = [document_texts[i] for i in indices[0]]  # Use document_texts instead of embeddings

    return relevant_chunks, distances[0]

# Step 6: Generate answer with Cohere API

In [None]:
!pip install cohere
import cohere
from google.colab import userdata
api_key = userdata.get('COHERE_API_KEY') #insert your cohere api key
cohere_client = cohere.Client(api_key)

Collecting cohere
  Downloading cohere-5.9.4-py3-none-any.whl.metadata (3.4 kB)
Collecting boto3<2.0.0,>=1.34.0 (from cohere)
  Downloading boto3-1.35.23-py3-none-any.whl.metadata (6.6 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx>=0.21.2 (from cohere)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting parameterized<0.10.0,>=0.9.0 (from cohere)
  Downloading parameterized-0.9.0-py2.py3-none-any.whl.metadata (18 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20240914-py3-none-any.whl.metadata (1.9 kB)
Collecting botocore<1.36.0,>=1.35.23 (from boto3<2.0.0,>=1.34.0->cohere)
  Downloading botocore-1.35.23-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3<2.0.

* This step utilizes the Cohere API to generate a comprehensive answer based on the user's query and the retrieved relevant chunks.

A Cohere client is initialized using an API key. This key grants access to Cohere's language model, which will be used for answer generation.

**Generate answer**
* The generate_answer function takes the user's query and the relevant chunks as input.
* It constructs a prompt by combining the query and the relevant chunks, providing context for the language model.
* The Cohere API is called with this prompt, generating a natural language response based on the provided information.
* The generated answer is extracted from the API response and returned.


In [None]:
# Generate answer based on relevant text
def generate_answer(query, relevant_chunks):
    # Directly join the relevant_chunks, as they are strings
    context = " ".join(relevant_chunks)
    response = cohere_client.generate(prompt=f"Answer the question: {query} using the document's relevant context:{context}")
    return response.generations[0].text

##TESTING with the SAMPLE SET ASSIGNMENT PDF

In [None]:
query = "What is the Problem Statement?"
relevant_chunks, distances = retrieve_relevant_chunks_faiss(query, model, index, document_embeddings, document_chunks)
answer = generate_answer(query, relevant_chunks)
print("Your query : ", query)
print("Answer:", answer)

Your query :  What is the Problem Statement?
Answer:  The provided context outlines two distinct problem statements centered around developing AI engineering solutions. Part 1 of the assignment focuses on creating a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA) bot. This entails harnessing a vector database like Pinecone DB and a generative model to construct a robust QA bot that adeptly retrieves and generates answers to user queries. 

Part 2, on the other hand, calls for the development of an interactive interface tailored for the QA bot conceived in Part 1. The interface must provide real-time query input and answer retrieval, enabling users to upload relevant documents to enhance the bot's performance.

These two parts collectively form a comprehensive assignment, targeting both the functional core (the RAG model) and the user-centric interface (interactive design) of the QA bot construction. 


##QA bot function for querying a document
To make Testing effective lets make a function which inputs the query to give answers from the **document provided**

In [None]:
def QAbot(query):
  relevant_chunks, distances = retrieve_relevant_chunks_faiss(query, model, index, document_embeddings, document_chunks)
  answer = generate_answer(query, relevant_chunks)
  print("Your query : ", query)
  print("Answer:", answer)
  return

##Testing QA bot for the document

In [None]:
QAbot("What is the purpose of the document?")

Your query :  What is the purpose of the document?
Answer:  The purpose of the document is to outline specifications for a project to develop a system that can efficiently integrate and process PDFs, store document embeddings, and provide real-time answers to user queries. The system should be able to handle multiple queries accurately and provide contextually relevant responses. The document also outlines specific steps for achieving the desired functionality and lists the expected deliverables. 


In [None]:
QAbot("What is Part 1 of the assignment and which proffesion is it intended towards?")

Your query :  What is Part 1 of the assignment and which proffesion is it intended towards?
Answer:  Part 1 of the assignment is to develop a retrieval-augmented generation (RAG) model for a question-answering (QA) bot for a business. It is intended towards the profession of a Gen AI Engineer or Machine Learning Engineer. 


In [None]:
QAbot("Which databases can be used ?")

Your query :  Which databases can be used ?
Answer:  Based on the provided information, the relevant context for the question "Which databases can be used ?" is primarily focused on document embedding storage and retrieval efficiencies. Here's how the different options are presented:

1. Using a standard, relational database (as mentioned in paragraph [3] ): This option is suitable for storing structured data but may not be the most efficient for storing and retrieving document embeddings. Standard databases are designed for structured queries and may not handle unstructured data like documents as effectively.

2. Using a vector database (such as Pinecone)**: This type of database is specifically designed to work with vector data, which makes it more efficient for storing and retrieving document embeddings. Vector databases can handle the mathematical operations required for working with vector data, providing faster performance and better compatibility with embedding representations o

In [None]:

QAbot("Is there a deadline?")


Your query :  Is there a deadline?
Answer:  The provided text does not contain any explicit reference to a deadline for the assignment. However, it is always a good idea to confirm deadlines with your instructor, as they may have provided specific due dates for various parts of the assignment or project. To receive accurate and up-to-date information regarding deadlines, please reach out to the appropriate individual, such as your teacher or professor, who can provide you with the exact deadline or any flexibility regarding the submission date. 


##"QA bot Doc" function for inputting other document pdfs and a query

In [None]:
def QAbotdoc(document_path, query):
  top_k = 5
  document_text = extract_text_from_pdf(document_path)
  document_chunks = [document_text[i:i + 300] for i in range(0, len(document_text), 512)]
  document_embeddings = model.encode(document_chunks)
  index = faiss.IndexFlatL2(embedding_dim)  # L2 distance metric
  faiss_embeddings = np.array(document_embeddings).astype(np.float32)
  index.add(faiss_embeddings)
  query_embedding = model.encode([query])[0].astype(np.float32)  # Convert to float32 for FAISS compatibility
  query_embedding = query_embedding.reshape(1, -1)  # Reshape to be 2D for FAISS
  distances, indices = index.search(query_embedding, top_k)  # Search for 2 nearest neighbors
  relevant_chunks = [document_chunks[i] for i in indices[0]]  # Use document_texts instead of embeddings
  relevant_chunks, distances = retrieve_relevant_chunks_faiss(query, model, index, document_embeddings, document_chunks)

  answer = generate_answer(query, relevant_chunks)
  print("Your Query : \n", query)
  print("Answer:\n", answer)
  return


#Step 7: Testing with other documents


## 1. Online program offer pdf

In [None]:
document_path_1 = "/content/OnlineProgramOffer.pdf"
query = "What is the purpose of the document?"
QAbotdoc(document_path_1, query)
#

Your Query : 
 What is the purpose of the document?
Answer:
  The purpose of the document is to communicate the program fee schedule for a Post Graduate Program in Artificial Intelligence and Machine Learning, as well as important details regarding additional costs, hardware requirements, and providing contact information for admission-related queries. 


In [None]:
query = "Important dates and deadlines?"
QAbotdoc(document_path_1, query)

Your Query : 
 Important dates and deadlines?
Answer:
  The answer is can be found in the provided document under the section B. Commencement Date:
The Post Graduate Program in Artificial Intelligence and Machine Learning: Business Applications will
commence in the month of September 2024. The commencement details and other login credentials will
be shared with all admitted candidates soon. In case there is a  eks before the Commencement Date
are eligible for a full refund of the amount paid in excess of the admission fee
2. Refund or dropout requests requested more than 2 weeks before the Commencement Date
are eligible for a 75% refund of the amount paid in excess of the admission fee
3. Refund or dropout answers requested within 2 weeks prior to the Commencement date forfeit all paid amounts. 


In [None]:
query = "I dont have a laptop, what should i do?"
QAbotdoc(document_path_1, query)

Your Query : 
 I dont have a laptop, what should i do?
Answer:
  You must have a laptop or desktop PC to participate in the program. You can borrow one from a friend or family member if you do not possess one. Contact the admission helpline if you need assistance or have any further questions. 


##2. Resume pdf

In [None]:
document_path_2 = "/content/resume.pdf"


In [None]:
query = "Would this candidate be relevant for AI engineer in SampleSet?"
QAbotdoc(document_path_2, query)

Your Query : 
 Would this candidate be relevant for AI engineer in SampleSet?
Answer:
  Yes, this candidate would be relevant for an AI engineer role in SampleSet, given their experience in generative AI tools and Agile environments, as well as their technical skills and education in electronic communications. 


In [None]:
query = "How many years of experience does the candidate have?"
QAbotdoc(document_path_2, query)

Your Query : 
 How many years of experience does the candidate have?
Answer:
  Tejaswi Reddy has 1 year of experience in the industry. 


In [None]:
query = "What are the candidates educational qualifications?"
QAbotdoc(document_path_2, query)

Your Query : 
 What are the candidates educational qualifications?
Answer:
  The candidate whose resume is presented here completed a Bachelor of Engineering with a focus on Electronics and Communication at the Hyderabad campus of the BITS Pilani university in India in 2024. 


In [None]:
query = "What are the skills of the candidate?"
QAbotdoc(document_path_2, query)

Your Query : 
 What are the skills of the candidate?
Answer:
  Here is what the candidate's skills are, according to the resume:

- Technical knowledge of programming languages Python and Java
- Proficiency with Agile methodologies, specifically scaled Agile frameworks (as evidenced by courses completed)
- Knowledge of generative AI and prompt engineering
- Experience with designing and developing REST APIs
- Advanced writing skills, possibly specialized in AWS (Amazon Web Services)

It is worth noting that the candidate is also a USCitizen, however it is unclear whether this is relevant information regarding their skillset. 
Let me know if you would like me to clarify any of the skills listed, or rearrange the information in a more coherent manner.  I am always happy to help. 


# To get the relevant chunks AND answer from the query for the document

Output : the exact document text it is extracting information from to generate answers for  the query


In [None]:
def process_document(document_path):
  """Processes a document to extract text, create chunks, and build a FAISS index."""
  document_text = extract_text_from_pdf(document_path)
  document_chunks = [document_text[i:i + 300] for i in range(0, len(document_text), 512)]
  document_embeddings = model.encode(document_chunks)
  index = faiss.IndexFlatL2(embedding_dim)
  faiss_embeddings = np.array(document_embeddings).astype(np.float32)
  index.add(faiss_embeddings)
  return document_chunks, document_embeddings, index

In [None]:
def QAbotdoc(document_path, query):
    """Processes a query by retrieving relevant document chunks and generating an answer."""
    # Process the document if not already processed
    document_chunks, document_embeddings, index = process_document(document_path)

    # Retrieve relevant document chunks
    relevant_chunks, distances = retrieve_relevant_chunks_faiss(query, model, index, document_embeddings, document_chunks)

    # Encode the query and search for relevant chunks in FAISS index
    query_embedding = model.encode([query])[0].astype(np.float32)
    query_embedding = query_embedding.reshape(1, -1)
    distances, indices = index.search(query_embedding,5)

    # Retrieve relevant document chunks
    relevant_chunks = [document_chunks[i] for i in indices[0]]

    # Generate the answer
    answer = generate_answer(query, relevant_chunks)
    print("Answer extracted from document text : ", relevant_chunks)

    print("Answer:\n", answer)
    return #relevant_chunks, answer

##The function would now output the answer as well as the relevant chunks of text from the document.

In [None]:
docpath = '/content/OnlineProgramOffer.pdf'
query="what is the program fee schedule?"
QAbotdoc(docpath,query)

Answer extracted from document text :  [' Fee 05-Sep-2024 USD 800\n1st Installment 07-Oct-2024 USD 1100\n2nd Installment 07-Nov-2024 USD 1100\n3rd Installment 07-Dec-2024 USD 1200\nTotal USD 4200\nNote: You are entitled to a discount of USD 500 (Scholarship) and USD 200 (One\nTime Full Payment). This will be adjusted against the appropriate inst', 'entioned fee schedule will lead to disqualification from the program.\nD. Cancellation Policy:\nPlease note that submitting the admission fee does constitute enrolling in the program and the below\ncancellation penalties will be applied.\n1. Full refund can only be issued within 48 hours of enrollment.\n', '\nhand book.\nBy accepting this offer, you agree to our Terms of Use and Privacy Policy\nDelivered in Collaboration with:Post Graduate Program in Artificial Intelligence and Machine\nLearning:\nBusiness Applications\nAnnexure 2\nProgram Fee Schedule\nThe program fee for candidates pursuing Post Graduate Pro', ' for a refund.\nCancellation

In [None]:
docpath = '/content/Gen AI Engineer _ Machine Learning Engineer Assignment.pdf'
query="What is Part 1 ?"
QAbotdoc(docpath,query)


Answer extracted from document text :  ['tegrate the backend from Part 1 to process the PDF, store document embeddings,\nand provide real-time answers to user queries.\n3. Ensure that the system can handle multiple queries efficiently and provide accurate,\ncontextually relevant responses.\n4. Allow users to see the retrieved document segments', 'proach thoroughly, explaining your decisions, challenges faced,\nand solutions.\n3. Provide a detailed ReadMe file in your GitHub repository, including setup and usage\ninstructions.\n4. Submissions should include:\n○ Source code for both the notebook and the interface.\n○ A fully functional Colab noteboo', 'Gen AI Engineer / Machine Learning Engineer Assignment\nPart 1: Retrieval-Augmented Generation (RAG) Model for QA Bot\nProblem Statement:\nDevelop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)\nbot for a business. Use a vector database like Pinecone DB and a generative model', ' Provide several example queries 

# Part 2 : User interface for the QA bot is in https://colab.research.google.com/drive/1NrnVZIBMROlVMVbGLanN2YxUabyjsYTd#