# **Retrieval-Augmented Generation (RAG) Model for QA Bot**

# **Installing Required Libraries**

This cell installs all the necessary Python libraries required for the RAG (Retrieval-Augmented Generation) QA Bot. Here's a breakdown of each library:


* langchain: A framework for working with language models to build more advanced NLP applications.
* langchain_community: An extension of LangChain that includes community-driven features.
* langchain_core: Core components of LangChain to support LLM-powered applications.
* chromadb: A vector database that stores embeddings, which are useful for retrieving relevant documents.
*sentence_transformers: A library to generate sentence embeddings, useful for comparing and retrieving text.
*cohere: A library that allows interaction with Cohere's LLMs for natural language processing tasks.
*python-dotenv: Manages environment variables, making it easier to load sensitive information like API keys.
*pypdf: A library to read and process PDF files, allowing the bot to extract text from documents.


In this step, we're installing all the required libraries for building our Retrieval-Augmented Generation (RAG) QA bot. Each library serves a specific function, from interacting with language models to managing vector databases and handling PDF files.

In [None]:
!pip install langchain langchain_community langchain_core chromadb sentence_transformers cohere python-dotenv pypdf

# **Importing Required Modules and Libraries**

This cell imports all the necessary Python modules and libraries used throughout the code. Here's a brief overview:

* Standard libraries: Modules like base64, logging, io, tempfile, os, and typing provide utility functions for encoding, logging, handling input/output streams, and managing temporary files.
*cohere: Connects to Cohere’s language models.
*dotenv: Loads environment variables from a .env file.
*langchain_core: Provides core components such as document handling, vector retrieval, runnable processes, and output parsing.
*langchain_community: Adds community-driven features like embedding generation using Hugging Face models and document storage using Chroma.
*google.colab.files: Facilitates file handling within Google Colab.

In [10]:
import base64
import logging
from typing import List
from io import BytesIO
import tempfile
import os
import cohere
from dotenv import load_dotenv
from langchain_core.documents import Document
from langchain_core.vectorstores import VectorStoreRetriever
from langchain_core.runnables.base import Runnable
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.prompt_values import StringPromptValue
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from google.colab import files

# **Configuring Logging and Defining Prompt Template**

This cell sets up logging and defines a prompt template for the RAG QA bot.

* Logging Configuration: The logging is configured to track information and events throughout the program. The level is set to INFO, meaning all messages at this level and higher will be logged, helping with debugging and monitoring the bot's operations.

* Prompt Template: A prompt template is created to guide the bot in generating responses. It instructs the model to use the provided context to answer the user's question. If the answer is unknown, the bot is directed to simply acknowledge that, avoiding any fabricated responses. This structure ensures that answers are helpful and relevant to the context given.

In [11]:
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

TEMPLATE = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Be helpful in your answer and be sure to reference the following context when possible.

{context}

Question: {question}

Answer:"""

prompt = PromptTemplate.from_template(TEMPLATE)

# **Setting Up Environment Variables and Cohere Client**

This cell sets up the environment variable for the Cohere API key and initializes the Cohere client.

* Environment Variable: The %env command assigns the Cohere API key to an environment variable. This allows the program to securely access sensitive information without hardcoding it into the script.

* Loading Environment Variables: The load_dotenv() function loads environment variables from a .env file, making them accessible in the code.

* Retrieving the API Key: The code retrieves the Cohere API key using os.getenv(). If the key is not found, it raises a ValueError, ensuring that the program does not proceed without the necessary credentials.

* Cohere Client Initialization: Finally, a client instance for the Cohere API is created using the retrieved API key, allowing the bot to interact with Cohere’s language models.

In [None]:
%env COHERE_API_KEY=3teePOSAq4M3tIFOHKvrzLww6WfBGhwNp7AS1MkS

load_dotenv()
COHERE_API_KEY = os.getenv("COHERE_API_KEY")
if not COHERE_API_KEY:
    raise ValueError("Cohere API key not found in environment variables.")

co = cohere.Client(COHERE_API_KEY)

# **Defining Functions for Document Handling and Retrieval**

* get_embedding_function(): Initializes and returns a Hugging Face model for generating embeddings, logging the creation process.

* get_embedding_retriever(splits: List[Document]): Embeds the document chunks using the embedding function and stores them in a Chroma vector store. It logs the embedding process and returns a retriever configured for similarity searches.

* save_doc_locally(pdf_string: str): Decodes a base64-encoded PDF string and saves it as a BytesIO object. It logs the operation and checks for valid PDF content.

* load_and_split_doc(pdf_file: BytesIO): Writes the PDF to a temporary file, then loads and splits it into manageable chunks using a specified text splitter. It ensures the temporary file is deleted afterward.

* get_chain(retriever: VectorStoreRetriever, prompt: PromptTemplate): Constructs a processing chain that formats retrieved documents and passes them to the prompt template for answer generation.

* format_docs(docs: List[Document]): Formats the content of document objects into a single string, separating them with newlines.

* generate_answer_cohere(prompt_output: StringPromptValue): Sends the prompt output to Cohere to generate an answer, logging the generation process and handling potential errors.

* upload_pdf(): Facilitates the uploading of a PDF file from the user's local system, encoding its content in base64 format for further processing. If no file is uploaded, it logs this information.

In [13]:
def get_embedding_function():
    logger.info("Creating new embedding function")
    return HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

def get_embedding_retriever(splits: List[Document]) -> VectorStoreRetriever:
    logger.info("Embedding document chunks")
    embedding_function = get_embedding_function()
    vectorstore = Chroma.from_documents(
        documents=splits,
        embedding=embedding_function,
        persist_directory=None
    )
    return vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

def save_doc_locally(pdf_string: str) -> BytesIO:
    logger.info("Saving pdf document locally")
    try:
        decoded_bytes = base64.b64decode(pdf_string)
        if decoded_bytes[:4] != b"%PDF":
            raise ValueError("Invalid PDF file received.")
        return BytesIO(decoded_bytes)
    except Exception as e:
        logger.error(f"Error saving PDF: {e}")
        raise

def load_and_split_doc(pdf_file: BytesIO) -> List[Document]:
    logger.info("Splitting pdf document into chunks")
    try:
        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
            temp_file.write(pdf_file.getvalue())
            temp_file_path = temp_file.name

        loader = PyPDFLoader(temp_file_path)
        splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50, add_start_index=True)
        return loader.load_and_split(text_splitter=splitter)
    finally:
        os.remove(temp_file_path)

def get_chain(retriever: VectorStoreRetriever, prompt: PromptTemplate) -> Runnable:
    logger.info("Getting RAG chain")
    return (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough()
        }
        | prompt
        | generate_answer_cohere
        | StrOutputParser()
    )

def format_docs(docs: List[Document]) -> str:
    return "\n\n".join(doc.page_content for doc in docs)

def generate_answer_cohere(prompt_output: StringPromptValue) -> str:
    logger.info("Generating answer with Cohere")
    try:
        response = co.generate(
            model="command-r-plus-04-2024",
            prompt=prompt_output.to_string(),
            max_tokens=300,
            temperature=0.7
        )
        return response.generations[0].text.strip()
    except Exception as e:
        logger.error(f"Error generating answer with Cohere: {e}")
        raise

def upload_pdf():
    uploaded = files.upload()
    if not uploaded:
        print("No file was uploaded.")
        return None

    file_name = next(iter(uploaded))
    pdf_content = uploaded[file_name]
    return base64.b64encode(pdf_content).decode('utf-8')

# **PDF Processing Function**

This cell defines the process_pdf function, which handles the entire workflow for processing a PDF document and generating a response based on a user’s question or a summary request.

* Function Purpose: It takes a base64-encoded PDF string and an optional question, processing the PDF to either answer the question or summarize the document.

* Logging: The function logs the start of the PDF processing and any subsequent actions, such as answer generation or summarization.

* Saving the PDF: It first saves the PDF locally by calling save_doc_locally.

* Loading and Splitting: The function then loads and splits the PDF into smaller chunks for easier processing using load_and_split_doc.

* Embedding and Retrieval: It creates an embedding retriever for the document chunks through get_embedding_retriever.

* Chain Setup: The function constructs a processing chain that will handle the interaction with the model using get_chain.

* Question Handling: If a question is provided, it invokes the chain to generate an answer. If no question is given, it defaults to generating a summary of the document.

* Error Handling: In case of any errors during processing, it logs the error and returns a user-friendly error message



In [14]:
def process_pdf(pdf_string: str, question: str = None) -> str:
    logger.info("Processing PDF")
    try:
        pdf_file = save_doc_locally(pdf_string)
        splits = load_and_split_doc(pdf_file)
        retriever = get_embedding_retriever(splits)
        chain = get_chain(retriever, prompt)

        if question:
            logger.info("Generating answer")
            return chain.invoke(question)
        else:
            logger.info("Generating summary")
            return chain.invoke("Summarize the document")
    except Exception as e:
        logger.error(f"Error processing PDF: {e}")
        return f"An error occurred: {str(e)}"

# **Main Function for User Interaction**

This cell defines the main function, which serves as the interactive entry point for the user to upload a PDF and engage with the RAG QA bot.

* User Prompt: The function begins by prompting the user to upload a PDF file. It calls the upload_pdf function to handle this.

* PDF Validation: If no PDF is uploaded, the function returns early, preventing further interaction.

* Continuous Interaction: A while loop allows the user to choose between asking a question, summarizing the document, or exiting the program.

* Input Handling:

   * If the user enters 'e', the loop breaks, and the program exits.
   * If 'q' is entered, the user is prompted to input their question, which is then processed using the process_pdf function. The answer is printed.
   * If 's' is chosen, the function generates a summary of the document and prints it.
* Invalid Input Handling: If the user enters an invalid option, a message is displayed, prompting them to try again.

In [15]:
def main():
    print("Please upload a PDF file.")
    pdf_string = upload_pdf()
    if pdf_string is None:
        return

    while True:
        action = input("Enter 'q' to ask a question, 's' to summarize, or 'e' to exit: ").lower()
        if action == 'e':
            break
        elif action == 'q':
            question = input("Enter your question: ")
            answer = process_pdf(pdf_string, question)
            print(f"Answer: {answer}")
        elif action == 's':
            summary = process_pdf(pdf_string)
            print(f"Summary: {summary}")
        else:
            print("Invalid input. Please try again.")

# **How to Access the RAG QA Bot**

* Run the Code: Start by running the entire notebook. This will set up everything needed for the bot to function.

* Upload a PDF: When prompted, upload a PDF file. This will be the document the bot will process.

* Choose an Action:

     * Ask a Question: Type 'q' and hit Enter. Then, enter your question about the PDF.
     * Summarize the Document: Type 's' and hit Enter to get a summary of the PDF.
     * Exit: Type 'e' to exit the program.
* Processing Time: On your first use, processing the PDF may take some time, especially if the document is large. Please be patient while the bot works.

* Error Handling: If you see any errors, follow the messages to troubleshoot. You can always restart the process.

Enjoy interacting with your RAG QA bot!





In [None]:
if __name__ == "__main__":
    main()

Please upload a PDF file.


Saving f.pdf to f (2).pdf




Summary: Football is a popular sport played at various levels, including high schools, colleges, and professional stadiums. The game involves running, passing, kicking, and bodily contact, with two teams of 11 players each attempting to move the ball across the opposing team's goal line to score points.
