# RAG with PDF Documents 📄

Build a Retrieval-Augmented Generation (RAG) system that processes PDF documents, using LangChain, ChromaDB, and Gradio to build an interactive question-answering interface.

## Content 📚

- Use **LangChain** for building RAG pipelines 🔧.
- Manage and query document embeddings using **ChromaDB** 🔍.
- Build an interactive user interface for querying documents using **Gradio** 🖥️.


In [3]:
import os
import glob
import shutil
import gradio as gr

from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from dotenv import load_dotenv, find_dotenv

## Setting Up the Environment

In this step, we ensure that the required environment variables are properly loaded for the lab.

### Key Points:
- The `dotenv` library is used to load environment variables from a `.env` file.
- The code checks if the `OPENAI_API_KEY` is set, which is required for accessing OpenAI's API.
- If the API key is found, a confirmation message is printed.

In [4]:
_ = load_dotenv(find_dotenv())

if os.getenv("OPENAI_API_KEY"):
	print("OpenAI API key found!")

OpenAI API key found!


## Defining Global Variables

This cell sets up important global variables that will be used throughout the lab:

- **`PDF_DOCS_PATH`**: Specifies the path to the directory containing the PDF documents.

- **`EMBEDDINGS`**: Initializes the embeddings model using OpenAI's embeddings.

In [5]:
PDF_DOCS_PATH = "./docs"

EMBEDDINGS = OpenAIEmbeddings()

## Listing PDF Files

This cell retrieves all PDF files from the specified directory:

- **`glob.glob`**: Searches for files matching the `.pdf` pattern in the directory defined by `PDF_DOCS_PATH`.

- The resulting list, `pdf_files`, contains the paths of all PDF documents found.

The output displays the identified PDF files, which will be processed in subsequent steps.


In [6]:
# Get all .pdf files in the base directory and its subdirectories
pdf_files = glob.glob(os.path.join(PDF_DOCS_PATH, "*.pdf"))

pdf_files

['./docs/C1M1_scripts.pdf',
 './docs/C1M2_scripts.pdf',
 './docs/C1M3_scripts.pdf',
 './docs/C1M4_scripts.pdf',
 './docs/C2M1_scripts.pdf',
 './docs/C2M2_scripts.pdf',
 './docs/C2M3_scripts.pdf',
 './docs/C2M4_scripts.pdf',
 './docs/C3M1_scripts.pdf',
 './docs/C3M2_scripts.pdf',
 './docs/C3M3_scripts.pdf',
 './docs/Technical-Gardening-Manual-FINAL.pdf']

In this case you will be working with a gardening manual provided by [High Rocks](https://highrocks.org/). However, once you are done with this assignment you can reuse all of this code for your own documents, notice that the code above will locate all PDF files inside the `./docs/` directory which is super useful if you have more than one file.

## Helper Function: `clean_text`

This pre-defined function is provided for your convenience and requires no additional modifications. It preprocesses text to ensure consistent formatting by:

- Replacing newline characters (`\n`) with spaces.
- Removing extra spaces to create clean, well-structured text.

You can directly use this function later in the lab for cleaning text extracted from the PDF documents.

Whenever you are dealing with text you will usually encounter a function similar to this one which will help with the preprocessing to yield better results. This one is pretty basic but these kinds of functions can be done in any way required.


In [7]:
def clean_text(text: str) -> str:
    # Replace '\n' with spaces and remove multiple spaces
    cleaned = " ".join(text.split("\n"))  # Join on newlines first
    cleaned = " ".join(cleaned.split())  # Remove extra spaces
    return cleaned

## Implement the `load_pdf` Function

Write a function to process and clean the content of a PDF document. The function should:

1. Load the PDF file from the specified path.
2. Extract the content of the document, processing each page individually.
3. Use the `clean_text` helper function to ensure the text is properly formatted.

### Hints:
- Utilize [`PyPDFLoader`](https://python.langchain.com/docs/integrations/document_loaders/pypdfloader/) to handle the PDF loading.
- Iterate through the documents to apply text cleaning.
- Return the cleaned content in a format suitable for further processing.

This function will be a foundational part of the pipeline for working with the PDF data. After completing this, if you wish to use this code for your own RAG pipelines you might opt to use a different loader depending on the format of your data. Langchain provides a bunch of loaders which you can check in the [docs](https://python.langchain.com/docs/how_to/#document-loaders) and cover formats such as HTML, JSON and CSV.


In [8]:
def load_pdf(pdf_path):

	# Use the PyPDFLoader by specifying the correct path
	loader = PyPDFLoader(pdf_path)

	# Use the load method from the loader to get the documents
	documents = loader.load()

	# Iterate over the documents
	for document in documents:
		# Apply the clean_text function to the page_content attribute of each document
		document.page_content = clean_text(document.page_content)

	return documents


Now, the `load_pdf` function will be applied to all the PDF files, combining the cleaned content into a single list. In this case there is a single document so this process is quite fast but it might be slower depending on the amount and size of the files.

In [9]:
docs = [doc for pdf in pdf_files for doc in load_pdf(pdf)]

print(f"There are a total of {len(docs)} documents")

There are a total of 549 documents


## Implement the `split_documents` function

Write a function to split the loaded document content into smaller chunks for easier processing. The function should:

1. Use the `RecursiveCharacterTextSplitter` class to split the document content into chunks. You can read more about it [here](https://python.langchain.com/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries).
2. Adjust the chunk size and overlap to ensure that each chunk is appropriately sized while maintaining context between chunks.
3. Return the resulting list of document splits.

### Hints:
- The `RecursiveCharacterTextSplitter` class allows you to configure how the text is split. You’ll need to adjust the following parameters:

  - **`chunk_size`**: Set the maximum size of each chunk (use 1500 characters for this exercise).

  - **`chunk_overlap`**: Define the number of characters to overlap between chunks (use 150 characters for this exercise).
  - **`separators`**: Specify the separators that should be used to break the text into chunks. These can include sentence-ending punctuation or newlines. These are already provided.
  - **`keep_separator`**: Set to `True` to retain the separator in the split text.

- To perform the actual splitting you need to use the [`split_documents`](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain_text_splitters.character.RecursiveCharacterTextSplitter.split_documents) method from `RecursiveCharacterTextSplitter`.


This function will help break down long documents into manageable pieces that are suitable for embedding and processing.


In [11]:
def split_docs(docs):

	### START CODE HERE ###

	# Instantiate the RecursiveCharacterTextSplitter class with the appropriate parameters
	text_splitter = RecursiveCharacterTextSplitter( 
		chunk_size=1500,
		chunk_overlap=150,
		separators=[". ", "? ", "! ", "\n\n", "\n", " ", ""], 
		keep_separator=True,
	)

	# Use the splitter to split the documents (use the split_documents method)
	splits = text_splitter.split_documents(docs)

	### END CODE HERE ###

	return splits

Now apply the splitting function to the actual documents:

In [12]:
splitted_docs = split_docs(docs)

print(f"There are a total of {len(splitted_docs)} documents after splitting")

There are a total of 1091 documents after splitting


## Implement the `setup_vectordb` function

Write a function to set up a vector database for efficient document retrieval. The function should:

1. Use the `Chroma.from_documents` method to create a vector database from the split documents.
2. Embed the documents using the specified embeddings model.
3. Store the vector database in the specified directory for persistent storage.

### Hints:

- The method accepts three parameters:

  - **`splits`**: The list of document splits to be embedded.
  
  - **`embeddings`**: The embeddings model used to convert documents into vectors. Here you can use the global variable for embeddings defined earlier.
  - **`db_docs_path`**: The directory where the vector database will be stored. The default is `db/chroma/`. Use the provided parameter `db_docs_path` to ensure a proper grading.


This function sets up the vector database that will be used for document retrieval in subsequent steps.


In [14]:
def setup_vectordb(splitted_docs, db_docs_path="db/chroma/"):

	# Delete the in-memory directory that will hold the data
	# This is done in case you run this function multiple times to avoid duplicated documents
	if os.path.exists(db_docs_path) and os.path.isdir(db_docs_path):
		shutil.rmtree(db_docs_path)

	### START CODE HERE ###

	# Create an instance of the vector database
	vectordb = Chroma.from_documents( 
		documents=splitted_docs,
		embedding=EMBEDDINGS,
		persist_directory=db_docs_path,
	)

	### END CODE HERE ###

	return vectordb

Now run the function to create the vector database that contains the post-splitting documents:

In [15]:
DATABASE = setup_vectordb(splitted_docs)

if os.path.exists("./db/chroma/"):
	print("Successfully created the vector database!")
else:
	print("The directory to store the vector database was not created, double check your code.")

Successfully created the vector database!


Now try asking the database to retrieve the top k (5 in this case) documents given a question. You might get duplicated results but that is ok!

In [17]:
question = "How can I plant tomatoes?"
retrieved_docs = DATABASE.similarity_search(question, k=5)

for rd in retrieved_docs:
	print(rd)

page_content='32 Tomatoes... It’s difficult to imagine a complete garden without them. And in the kitchen, they’re a favorite--a rich, versatile vegetable that is equally at home in nutritious salad or on a satisfying burger. And yet when you’re growing them, tomatoes can often be problematic. Their culinary versatility also translate into their growing habits as well, as they show a willingness to grow however they please, seeming to defy both gravity and the nearness of your garden plot. Biologically, your tomato plants want to mature and spread their seeds, dropping the tomatoes to the earth, so gardeners often seek out ways to elevate their tomatoes. Other reasons to support your tomatoes include: • Your tomatoes are less prone to insect and disease damage. If you choose to support your tomatoes, then you’ve got another problem: What option to do you choose? Do you follow in the footsteps of your grandfather and create a cage for your tomatoes? Do you stake them? • Your tomatoes st

## Helper Function: `format_docs`

To be able to visualize the retrieved documents in a more organized manner, the `format_docs` function is provided. This function prints out some information about the documents such as their number, contents, document of origin and page:


In [18]:
def format_docs(docs):
    results = []
    for i, doc in enumerate(docs, 1):
        source_path = doc.metadata.get("source", "Unknown")
        filename = (
            os.path.basename(source_path) if source_path != "Unknown" else "Unknown"
        )

        result = (
            f"Result {i}:\n"
            f"{doc.page_content}\n\n"
            f"Document: {filename}\n"
            f"Page: {doc.metadata.get('page', 'Unknown')}"
        )
        results.append(result)

    return "\n---\n".join(results)

Try it out with the retrieved documents:

In [19]:
print(format_docs(retrieved_docs))

Result 1:
32 Tomatoes... It’s difficult to imagine a complete garden without them. And in the kitchen, they’re a favorite--a rich, versatile vegetable that is equally at home in nutritious salad or on a satisfying burger. And yet when you’re growing them, tomatoes can often be problematic. Their culinary versatility also translate into their growing habits as well, as they show a willingness to grow however they please, seeming to defy both gravity and the nearness of your garden plot. Biologically, your tomato plants want to mature and spread their seeds, dropping the tomatoes to the earth, so gardeners often seek out ways to elevate their tomatoes. Other reasons to support your tomatoes include: • Your tomatoes are less prone to insect and disease damage. If you choose to support your tomatoes, then you’ve got another problem: What option to do you choose? Do you follow in the footsteps of your grandfather and create a cage for your tomatoes? Do you stake them? • Your tomatoes stay d

## Implement the `process_query` function

Ccomplete the logic for handling user queries in a retrieval-augmented generation (RAG) pipeline.

1. Initialize a language model (LLM) for generating responses.

2. Define a custom prompt template to ensure detailed, thorough answers.


### Hints:
- Use **`PromptTemplate.from_template`** to convert the provided template into a prompt object.
- Retrieve the source documents:
  - Use the **`as_retriever`** method on `DATABASE` to set up a retriever.
  - Invoke the retriever with the given question to get the relevant documents.
  - Use the `format_docs` helper function to process and format the retrieved documents.
- Set up the QA pipeline:
  - Define a chain where:
    - **Context** is set to the retriever output.
    - **Question** is directly passed through with `RunnablePassthrough`.
    - **Prompt** is piped into the QA chain.
    - **LLM** processes the prompt.
    - **StrOutputParser** is used to parse the LLM's output.
  - Use the `invoke` method on the QA chain to generate the LLM's response.

This task will test your ability to integrate multiple components in the RAG pipeline effectively. For more info be sure to check the [docs](# https://python.langchain.com/docs/how_to/inspect/).



In [24]:
def process_query(question):
	
	# Initialize the LLM
	llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

	# Define a template for the QA
	template = """Use the following pieces of context to answer the question at the end. Provide a detailed, thorough response that:
		1. Answers the main question
		2. Provides relevant examples or details from the context
		3. Explains any important concepts mentioned
		4. If relevant, discusses implications or applications

		If you don't know the answer, provide a detailed explanation of what aspects you're uncertain about and why.

		{context}
		Question: {question}
		Detailed Answer:"""

	### START CODE HERE ###

	# Instantiate a PromptTemplate using the template given
	prompt = PromptTemplate.from_template(template)

	# Use the as_retriever method to use the DATABASE as a retriever
	retriever = DATABASE.as_retriever()
	
	# Get the source documents by using the invoke method on the retriever and passing the question
	source_documents = retriever.invoke(question)

	# Format the source documents using the format_docs helper function
	doc_references = format_docs(source_documents)

	# Set up the QA chain
	qa_chain = ( 
	# Use the retriever as context and a RunnablePassthrough as question
	{"context": retriever, "question": RunnablePassthrough()}
	# Pipe to the prompt
	| prompt
	# Pipe to the llm
	| llm
	# Pipe to the StrOutputParser
	| StrOutputParser()
	) 

	# Get response from qa_chain by using the invoke method and passing the question
	llm_response = qa_chain.invoke(question)

	### END CODE HERE ###

	return llm_response, doc_references

In [25]:
question = "How can I plant flowers?"

llm_response, doc_references = process_query(question)

print(f"### LLM Response #################\n\n{llm_response}\n")
print(f"### References ###################\n\n{doc_references}")

### LLM Response #################

Planting flowers involves several steps to ensure successful growth and blooming. Here are some key considerations based on the provided context:

1. Soil Collection: Before planting flowers, it is essential to understand the soil composition of your garden. Collect soil samples from different areas of your garden, mix them in a bucket, and send them for analysis. The results will help you determine the amendments needed for optimal flower growth.

2. Garden Planning: Create a layout of your garden on paper or using computer software. Consider the mature size of the flowers you intend to plant to avoid overcrowding or sparse arrangements. Professionals often collect data on plant growth each year to make informed decisions for future plantings.

3. Crop Rotation: While primarily discussed in the context of vegetable gardening, crop rotation can also benefit flower gardens. By rotating different types of flowers in the same spot, you can reduce the ri

## Gradio User Interface

This cell creates an interactive Gradio interface for you to interact with the Q&A Assistant. The interface allows users to ask questions and receive detailed answers along with source document references.

### Key Features:
- **Input Section**: A textbox for entering questions, with a "Submit Question" button for submission.
- **Tabs for Output**:
  - **AI Response**: Displays the detailed answer generated by the AI.
  - **Document References**: Shows the source documents referenced in the response.
- **Interactivity**:
  - The "Submit Question" button triggers the `process_query` function to process the input and display the results.
  - Pressing "Enter" in the question textbox also submits the question.

This user-friendly interface enables interaction with the QA chatbot in a structured and visually organized way.


In [27]:
with gr.Blocks(theme=gr.themes.Monochrome()) as iface:
    gr.Markdown(
        """
        # 📚 DAG Scripts Q&A Assistant
        Ask any question about the DAG C1 & C2 content and get detailed answers with source references.
        """
    )

    with gr.Column():
        gr.Markdown("### Your Question")
        question_input = gr.Textbox(
            lines=3,
            placeholder="Enter your question here...",
            label="",  # Removed the label since we're using Markdown
        )
        submit_btn = gr.Button("Submit Question", variant="primary", size="lg")

    with gr.Tabs():
        with gr.TabItem("📝 Response"):
            gr.Markdown("### AI Response")
            response_output = gr.Textbox(
                lines=15,
                label="",  # Removed the label since we're using Markdown
                show_copy_button=True,
            )
        with gr.TabItem("🔍 Document References"):
            gr.Markdown("### Source Documents")
            references_output = gr.Textbox(
                lines=15,
                label="",  # Removed the label since we're using Markdown
                show_copy_button=True,
            )

    # Add submit button click event and enter key functionality
    submit_btn.click(
        fn=process_query,
        inputs=[question_input],
        outputs=[response_output, references_output],
    )
    question_input.submit(
        fn=process_query,
        inputs=[question_input],
        outputs=[response_output, references_output],
    )

--------


In [28]:
# Close the server (in case you run this cell multiple times)
iface.close()

# Spin up the gradio app
iface.launch(server_name="0.0.0.0", share=False)

Running on local URL:  https://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.


