In [1]:
!pip install -r /content/requirements.txt



In [2]:
!apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.9).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


# Task
Amend the `app.py` file to:
1. Display the full text of the document after OCR.
2. Add a RAG (Retrieval-Augmented Generation) function after the summary is provided. This function should allow for multiple queries of all uploaded documents until the user types 'exit'.
3. Include instructions for users on how to use the RAG functionality.

## Display extracted text

### Subtask:
Modify the `app.py` file to display the full text extracted from the PDF(s) after the OCR process is complete. This will appear before the summary.


**Reasoning**:
I will modify the `app.py` file to display the extracted text from the PDF. I will add a subheader and a text area to show the `all_pdf_text` content before the summarization part, as requested in the instructions.



## Implement rag chat

### Subtask:
Add a new section to the Streamlit interface titled "Chat with your Documents (RAG)". This section will appear after the summary and will include a text input field for you to ask questions about the uploaded documents. The application should also allow for multiple questions until the user types "exit".


**Reasoning**:
I will overwrite the `app.py` file with the provided code, which includes the necessary changes for this subtask. I will add a new section with the subheader "Chat with your Documents (RAG)" after the summary section in the `main` function. Inside this new section, I will add a `st.text_input` field to get the user's query and implement a `while` loop that continues to prompt for input until the user enters "exit". Inside the loop, if the query is not "exit", I will use the existing `qa_chain` to get the answer and display it using `st.write`.



**Reasoning**:
The `app.py` file has been updated. Now, I will restart the Streamlit application to apply the changes.



In [3]:
%%writefile app.py
import os
import openai
import pdf2image
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
import streamlit as st
import tiktoken

def pdf_to_img(pdf_file):
    """Converts a PDF file to a list of PIL Images."""
    return pdf2image.convert_from_path(pdf_file)


def ocr_core(file):
    """Performs OCR on a single image and returns the extracted text."""
    return pytesseract.image_to_string(file)


def extract_text_from_pdf(pdf_file):
    """Extracts text from all pages of a PDF file."""
    images = pdf_to_img(pdf_file)
    extracted_text = ""
    for img in images:
        extracted_text += ocr_core(img) + "\n\n"
    return extracted_text


def count_tokens(text: str) -> int:
    """Counts the number of tokens in a string."""
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

def main():
    """Orchestrates the script's execution with a Streamlit GUI."""
    openai_api_key = os.getenv("OPENAI_API_KEY")
    openai.api_key = openai_api_key

    st.title("PDF OCR & Summarize & RAG")

    with st.expander("IMPORTANT NOTICE"):
        st.write("""
        This web application is a prototype developed for educational purposes only. The information provided here is NOT intended for real-world usage and should not be relied upon for making any decisions, especially those related to financial, legal, or healthcare matters.

        Furthermore, please be aware that the LLM may generate inaccurate or incorrect information. You assume full responsibility for how you use any generated output.

        Always consult with qualified professionals for accurate and personalized advice.
        """)

    uploaded_files = st.file_uploader(
        "Upload PDF files", type="pdf", accept_multiple_files=True
    )

    if "qa_chain" not in st.session_state:
        st.session_state.qa_chain = None

    if uploaded_files:
        all_pdf_text = ""
        for uploaded_file in uploaded_files:
            try:
                with open(uploaded_file.name, "wb") as f:
                    f.write(uploaded_file.getbuffer())
                pdf_text = extract_text_from_pdf(uploaded_file.name)
                all_pdf_text += pdf_text
                st.success(f"Successfully extracted text from '{uploaded_file.name}'.")
            except Exception as e:
                st.error(f"An error occurred while processing '{uploaded_file.name}': {e}")

        if all_pdf_text:
            st.subheader("Extracted Text")
            st.text_area("Full text from PDF(s)", all_pdf_text, height=300)

            text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=1000, chunk_overlap=200
            )
            splits = text_splitter.split_text(all_pdf_text)

            embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
            vectordb = FAISS.from_texts(splits, embeddings)

            llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=openai_api_key)
            st.session_state.qa_chain = RetrievalQA.from_chain_type(
                llm, retriever=vectordb.as_retriever()
            )
            query = "Summarize the content of the PDF(s)."
            summary = st.session_state.qa_chain.run(query)
            st.write("---")
            st.subheader("Summary")
            st.write(summary)
            st.write("---")
            st.write(f"Token count: {count_tokens(all_pdf_text)}")

    if st.session_state.qa_chain:
        st.subheader("Chat with your Documents (RAG)")
        st.write("Ask questions about the uploaded documents. Type 'exit' to stop.")

        if "messages" not in st.session_state:
            st.session_state.messages = []

        for message in st.session_state.messages:
            with st.chat_message(message["role"]):
                st.markdown(message["content"])

        if prompt := st.chat_input("Ask a question:"):
            st.session_state.messages.append({"role": "user", "content": prompt})
            with st.chat_message("user"):
                st.markdown(prompt)

            if prompt.lower() != "exit":
                with st.chat_message("assistant"):
                    response = st.session_state.qa_chain.run(prompt)
                    st.markdown(response)
                st.session_state.messages.append({"role": "assistant", "content": response})

if __name__ == "__main__":
    main()

Overwriting app.py


In [4]:
!mkdir pages

mkdir: cannot create directory ‘pages’: File exists


In [5]:
%%writefile pages/AboutUs.py
import streamlit as st

st.title("About Us")

st.write("""

My initial project scope on Padlet was a simple RAG that reads the FAQ section of Archives Online, and provides information to users of the RAG Tool. But later on I wanted to try something else that is way beyond my capability, but if successful, would be a practical tool that I will see myself using. That is to convert the PDFs of speeches on the Archives Online website into text using OCR function. Here's a link to the speeches where you can download the PDFs to try: https://www.nas.gov.sg/archivesonline/speeches/search-result?search-type=advanced&speaker=Ong+Teng+Cheong

I had some limitations - AISAY, which I had initially wanted to try out, were still in the midst of preparing the tool for use. Hence, I tried PyTesseract OCR as recommended by Mr Aldrian. Unfortunately, I soon found out that I had a bigger problem - the version of my MAC OS is too outdated. I am unable to download and install the Tesseract packages successfully in my local machine, and I am also unable to install and run Visual Studio Code. The browser version of VS Code is unable to use the terminal function, so that also takes away access. Hence I am trying a workaround by using Google Colab and running Streamlit from Colab instead. But this will mean that the Streamlit link can only be used when the Colab notebook is being run.

The main objective of this tool is to extract text from the PDF documents on Archives Online using OCR.

These are the features that I wanted:
- To allow users to upload their own PDFs into the tool
- To allow for uploading of multiple files
- To summarize the text extracted from the PDFs

Bonus:
- To display the full text of the document after OCR.
- Add a RAG (Retrieval-Augmented Generation) function after the summary is provided. This function should allow for multiple queries of all uploaded documents until the user types 'exit'.
- Include instructions for users on how to use the RAG functionality.

I MUST admit that Gemini helped to make the functions possible. At my current level, I am definitely unable to code all these by myself. As you can see from this notebook, Gemini was the one who helped me with much of the code.

""")

Overwriting pages/AboutUs.py


In [6]:
!pip install graphviz



In [7]:
%%writefile pages/Methodology.py
import streamlit as st
import graphviz

st.title("Methodology")

st.header("Process Flowchart")

# Create a new directed graph
graph = graphviz.Digraph()

# Add nodes for each step in the process
graph.node("A", "User uploads PDF(s)")
graph.node("B", "Extract text using OCR")
graph.node("C", "Display extracted text")
graph.node("D", "Summarize the text")
graph.node("E", "Display summary")
graph.node("F", "Chat with documents (RAG)")

# Add edges to show the flow
graph.edge("A", "B")
graph.edge("B", "C")
graph.edge("C", "D")
graph.edge("D", "E")
graph.edge("E", "F")

# Display the flowchart
st.graphviz_chart(graph)

Overwriting pages/Methodology.py


In [8]:
from google.colab import userdata
import os

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

!streamlit run /content/app.py &>/content/logs.txt &
!npx localtunnel --port 8501 & curl ipv4.icanhazip.com

34.168.2.60
[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0Kyour url is: https://evil-newt-100.loca.lt
