<a href="https://colab.research.google.com/github/wajihh/genai_projects/blob/main/rag_bot_pdf_qdrant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A QUESTION ANSWER BOT IMPLEMENATION USING RAG FOR PDF DOCUMENT

RAG implementation with questioning bot, that works on Google Colab, loads the .env file for API keys, uploads a PDF from a Google Drive folder, stores Qdrant embeddings on Google Drive, and includes a simple bot interaction for querying.
#How the Code Works:
Google Drive Integration:

The code mounts your Google Drive to access and store files like the .env file, PDF, and embeddings.

#Loading the API Key:

The API key is loaded from a .env file stored in Google Drive to keep it secure.
#PDF Chunking:

The PDF is split into manageable chunks of text to stay within OpenAI’s token limits during embedding and querying.
Embedding Generation:

The embeddings for each chunk are generated using the text-embedding-ada-002 model, and stored in both Qdrant (for retrieval) and Google Drive (for persistence).
#Bot Interaction:

The bot interaction at the end allows users to ask questions based on the document’s content. The bot retrieves the most relevant chunks and generates an answer using GPT-4.
Storage Paths:
PDF File: /content/drive/MyDrive/your_folder/your_pdf_file.pdf
Embeddings File: /content/drive/MyDrive/your_folder/pdf_embeddings.npy
Qdrant Database: /content/drive/MyDrive/your_folder/qdrant.db
This setup allows you to handle large PDFs, efficiently retrieve content, and interact with a bot that answers queries based on the document!


##Setup:
-Load .env file: We'll use python-dotenv to securely load the API key from Google Drive.

-Google Drive integration: We'll mount Google Drive to access and store the PDF and embeddings.

-Qdrant storage: The embeddings will be stored in Google Drive for persistence.

-Bot interaction: A simple chatbot will take user questions and retrieve relevant chunks from the PDF.

Step 1: Install necessary libraries

In [2]:
# Install required libraries
!pip install openai==0.28 PyPDF2 qdrant-client python-dotenv

from google.colab import drive
import openai
import PyPDF2
import os
import qdrant_client
import numpy as np
from dotenv import load_dotenv

# Mount Google Drive to access and store files
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Step 2: Load the API Key from .env File Stored in Google Drive:

In [3]:
# Specify the path to your .env file in Google Drive
env_path = '/content/drive/MyDrive/Colab Notebooks/.env'

# Load the API key from the .env file
load_dotenv(env_path)
openai.api_key = os.getenv("OPENAI_API_KEY")


##Step 3: Upload file from google drive

In [4]:
# Path to the PDF file in your Google Drive
pdf_path = '/content/drive/MyDrive/2024_Advance_AI/Data/The_Sublime_Quran.pdf'

# Function to read and chunk the PDF document
def extract_pdf_text(file_path, chunk_size=2000):
    with open(file_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()

    # Split text into chunks
    text_chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return text_chunks

# Extract and chunk the PDF content
text_chunks = extract_pdf_text(pdf_path)


##Step 4:Generate embeddings and store them in google drive

In [8]:
# Install the latest version of the OpenAI package if needed
#!pip install openai==0.28

import openai
import numpy as np

# Function to generate embeddings for text chunks
def generate_embeddings(text_chunks):
    embeddings = []

    # Iterate through text chunks and generate embeddings
    for chunk in text_chunks:
        response = openai.Embedding.create(  # Correct method name is still `Embedding.create`
            model="text-embedding-ada-002",
            input=chunk
        )
        # Extract the embedding directly from the response object
        embedding = response.data[0].embedding  # Access embedding using attribute syntax
        embeddings.append(embedding)

    return embeddings

# Generate embeddings for the text chunks from the PDF
embeddings = generate_embeddings(text_chunks)

# Save embeddings in Google Drive
np.save('/content/drive/MyDrive/2024_Advance_AI/Data/pdf_embeddings.npy', embeddings)


In [9]:
np.save('/content/drive/MyDrive/2024_Advance_AI/Data/text_chunks.npy', text_chunks)

##Step 5 Retirve Embeddings

In [10]:
# Load embeddings and text chunks from Google Drive
embeddings = np.load('/content/drive/MyDrive/2024_Advance_AI/Data/pdf_embeddings.npy', allow_pickle=True)
text_chunks = np.load('/content/drive/MyDrive/2024_Advance_AI/Data/text_chunks.npy', allow_pickle=True)


##Step 6 Qdrant Setup and Store Embeddings:

In [15]:
from qdrant_client import QdrantClient, models
from qdrant_client.models import PointStruct # Import PointStruct from qdrant_client.models

# Set up Qdrant client and store embeddings in Google Drive
qdrant_path = '/content/drive/MyDrive/2024_Advance_AI/qdrant2.db'
qdrant = QdrantClient(path=qdrant_path)

# Store embeddings in Qdrant
def store_embeddings_in_qdrant(embeddings, text_chunks):
    qdrant.recreate_collection(
        collection_name="pdf_chunks",
        vectors_config=models.VectorParams(size=len(embeddings[0]), distance=models.Distance.COSINE) # Added vectors_config
    )

    for idx, (embedding, text) in enumerate(zip(embeddings, text_chunks)):
        qdrant.upsert(
            collection_name="pdf_chunks",
            points=[PointStruct(id=idx, vector=embedding.tolist(), payload={"text": text})] # Use PointStruct directly and convert embedding to list
        )

store_embeddings_in_qdrant(embeddings, text_chunks)

  qdrant.recreate_collection(


##Step 7:Retrieve Relevant Chunks and Answer Queries Using GPT-4:

In [16]:
# Function to retrieve relevant chunks based on the user's query
def retrieve_relevant_chunks(query, top_k=5):
    query_embedding = openai.Embedding.create(
        input=query,
        model="text-embedding-ada-002"
    )['data'][0]['embedding']

    search_result = qdrant.search(
        collection_name="pdf_chunks",
        query_vector=query_embedding,
        limit=top_k
    )

    return [result.payload['text'] for result in search_result]

# Function to generate answers using GPT-4
def generate_answer(retrieved_chunks, query):
    context = "\n\n".join(retrieved_chunks)
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Context:\n\n{context}\n\nQuestion: {query}"}
        ],
        temperature=0.2,
        max_tokens=500
    )
    return response['choices'][0]['message']['content']


##Step 8: Bot Interaction for Querying

In [19]:
# Simple bot interaction to answer questions from the PDF
def ask_bot():
    while True:
        user_input = input("Ask me a question (or type 'exit' to quit): ")
        if user_input.lower() == 'exit':
            break

        retrieved_chunks = retrieve_relevant_chunks(user_input, top_k=5)
        answer = generate_answer(retrieved_chunks, user_input)
        print(f"Answer: {answer}\n")

# Run the bot
ask_bot()


Ask me a question (or type 'exit' to quit): what does quran say about jihad?
Answer: The term "Jihad" in the Quran is often misunderstood. It is derived from the Arabic root "Jahada" which means to strive or struggle. In the Quran, Jihad is used in different contexts, primarily referring to the struggle in the path of God. It can be a personal struggle to maintain faith, to improve one's moral character, or to achieve righteous goals. 

In a broader social context, Jihad can refer to the struggle for social justice, to fight against oppression, or to spread the message of Islam. It can also refer to a physical struggle or warfare in self-defense or against aggression, but only under strict conditions and as a last resort. It's important to note that the Quran strongly emphasizes that any form of aggression or violence must be proportional, and non-combatants should not be harmed.

The Quran also highlights the importance of peace, mercy, and forgiveness. For instance, in Quran 2:190, i