# Automating PDF Document Summarization Using AI: Streamlining Information Retrieval

## Introduction

Documents are an integral part of everyday activities, whether in business, academic, or personal contexts. Each year, and even in every meeting, new documents are generated. However, a challenge arises when we need to access the information contained within these documents. The process of manually opening documents and reading from start to finish is sometimes inefficient, especially if only a small portion of the document contains the required information. This manual process consumes time and energy, and has the potential to slow down decision-making or appropriate actions.

To address this issue, the proposed solution is to utilize artificial intelligence (AI) capable of **reading and understanding PDF documents, and performing automatic text summarization processes**. By leveraging AI's capabilities in natural language processing (NLP), we can easily access relevant information from these documents without having to read them in their entirety. This technology will enable users to quickly and efficiently obtain comprehensive summaries of documents, speeding up the decision-making process and enhancing overall productivity.

Through the implementation of AI in reading and summarizing PDF documents, it is expected that the information access process will become more efficient, fast, and accurate. Thus, this technology will not only assist individuals or organizations in managing and utilizing their documents more effectively but also open up opportunities to enhance efficiency and productivity across various fields.

## Method and Approach

1. **PDF Reader for File Extraction:**
   - In this project, we employed a PDF reader to extract text content from PDF files. This step is crucial as it allows us to access the textual data within the documents and prepare it for further processing.

2. **Text Preprocessing with Chroma:**
   - Employ Chroma, a text processing library, for preprocessing the extracted text data. This includes tasks such as removing special characters, punctuation, and irrelevant symbols, as well as normalizing the text for better analysis.

3. **LangChain Implementation:**
   - The core of our summarization approach relied on the LangChain framework. LangChain provided us with a powerful toolkit for natural language processing tasks, including tokenization, sentence segmentation, and semantic analysis. We leveraged LangChain's functionalities to preprocess the extracted text data and prepare it for summarization.

4. **OpenAI for Summary Generation:**
   - To generate concise and informative summaries from the preprocessed text data, we utilized OpenAI's language model. OpenAI's state-of-the-art model demonstrated remarkable capabilities in understanding context and generating coherent text. By fine-tuning the model on summarization tasks, we were able to tailor it to our specific needs and enhance the quality of the generated summaries.


## Workflow Overview

 Our workflow begins with the PDF reader extracting text content from input PDF files. Next, Chroma assists in processing the extracted text, ensuring accurate representation of the document's content. Subsequently, LangChain comes into play for text preprocessing tasks, preparing the data for summarization. Finally, OpenAI's language model generates concise summaries based on the preprocessed text, providing users with key insights from the documents in a succinct format.


### Read PDF Files

We begin by employing `PyPDF2`'s `PdfReader` module to access the content of PDF files. PDF files are prevalent in various domains, and extracting text from them programmatically is essential for automated summarization tasks. By utilizing `PdfReader`, we can efficiently iterate through the pages of the PDF document and extract text content.

In [1]:
from PyPDF2 import PdfReader

pdf_file_path = "assets/Laporan-Keuangan-Tahunan-BI-2022.pdf"
loader = PdfReader(pdf_file_path)

In [2]:
raw_text = ""

for page in loader.pages:
    content = page.extract_text()
    if content:
        raw_text += content

**Explanation:**

- `raw_text = ""`: We initialize an empty string variable `raw_text` to store the extracted text from the PDF pages.

- `for page in loader.pages:`: This loop iterates through each page of the PDF document loaded by the `loader` object.

- `content = page.extract_text()`: Within each iteration, we use the `extract_text()` method to retrieve the text content of the current page. This method returns the text content as a string.

- `if content:`: We check if the extracted content is not empty. This condition is necessary because some pages may contain only images or non-text elements.

- `raw_text += content`: If the extracted content is not empty, we append it to the `raw_text` variable. This concatenates the text content of all pages into a single string.

The reason behind this approach is to consolidate all text content from each page of the PDF document into one cohesive string (`raw_text`). This enables further processing, such as text preprocessing and summarization, to be performed on the entire document's content collectively.

By aggregating the text content from individual pages, we ensure that no information is overlooked or missed during the summarization process. Additionally, it simplifies subsequent operations, such as splitting the text into smaller chunks for efficient processing or generating embeddings for semantic analysis.

In [3]:
raw_text



### Text Preprocessing

After extracting raw text from the PDF, we perform preprocessing to prepare it for summarization. Preprocessing is crucial for enhancing the quality of summarization results. Here, we utilize LangChain's `CharacterTextSplitter` to segment the text into smaller chunks. This segmentation improves processing efficiency and helps prevent memory overflow issues when dealing with large documents.

In [4]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator = "\n", 
                                      chunk_size = 1000, 
                                      chunk_overlap = 10, 
                                      length_function = len)
text = text_splitter.split_text(raw_text)

Created a chunk of size 1049, which is longer than the specified 1000
Created a chunk of size 1752, which is longer than the specified 1000
Created a chunk of size 2647, which is longer than the specified 1000
Created a chunk of size 3523, which is longer than the specified 1000
Created a chunk of size 1362, which is longer than the specified 1000
Created a chunk of size 1989, which is longer than the specified 1000
Created a chunk of size 2687, which is longer than the specified 1000
Created a chunk of size 2243, which is longer than the specified 1000
Created a chunk of size 1344, which is longer than the specified 1000
Created a chunk of size 4742, which is longer than the specified 1000
Created a chunk of size 19900, which is longer than the specified 1000
Created a chunk of size 3590, which is longer than the specified 1000
Created a chunk of size 7644, which is longer than the specified 1000
Created a chunk of size 7706, which is longer than the specified 1000
Created a chunk of 

**Explanation:**

- `from langchain.text_splitter import CharacterTextSplitter`: We import the `CharacterTextSplitter` class from the LangChain library. This class provides functionality for splitting text into smaller chunks based on specified parameters.

- `text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=10, length_function=len)`: We instantiate an object of the `CharacterTextSplitter` class with the following parameters:
  - `separator="\n"`: Specifies the separator used to split the text. In this case, we use the newline character (`\n`) as the separator, indicating that each chunk will represent a separate line of text.
  - `chunk_size=1000`: Defines the maximum size (in characters) of each chunk of text. Chunks larger than this size will be split into smaller chunks.
  - `chunk_overlap=10`: Specifies the number of characters by which adjacent chunks overlap. Overlapping chunks help ensure that no information is lost during the splitting process.
  - `length_function=len`: Specifies the function used to calculate the length of the text. In this case, we use the built-in `len()` function to determine the length of the text.

- `text = text_splitter.split_text(raw_text)`: We call the `split_text()` method of the `text_splitter` object, passing the `raw_text` variable as the input. This method splits the raw text into smaller chunks based on the specified parameters and returns a list of chunks.

The purpose of this code snippet is to preprocess the raw text extracted from the PDF documents before proceeding with summarization. By splitting the text into smaller, manageable chunks, we facilitate more efficient processing and analysis. 


### Extracting Text Using Chroma

To enhance the summarization process, we integrate Chroma, a vector store component of LangChain. Chroma enables us to generate embeddings for text data, which are essential for capturing semantic information. By utilizing `SentenceTransformerEmbeddings`, we transform the textual data into high-dimensional vector representations. These embeddings preserve semantic relationships between sentences, facilitating more accurate summarization.

In [5]:
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

embedding_function = SentenceTransformerEmbeddings(model_name = "all-MiniLM-L6-v2")

- `from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings`: We import the `SentenceTransformerEmbeddings` class from the LangChain library. This class provides functionality for generating sentence embeddings using pre-trained Transformer models.
- `from langchain.vectorstores import Chroma`: We import the `Chroma` class from the LangChain library. `Chroma` is a vector store component that allows for efficient storage and retrieval of text embeddings.
- `embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")`: We instantiate an object of the `SentenceTransformerEmbeddings` class with the specified pre-trained model name, "all-MiniLM-L6-v2". This model is based on the MiniLM architecture and has been fine-tuned on various natural language understanding tasks, making it suitable for generating high-quality embeddings for our summarization task.

In [6]:
vectordb = Chroma.from_texts(text, embedding_function)

- `vectordb = Chroma(persist_directory="assets/chroma_db", embedding_function=embedding_function)`: We instantiate an object of the `Chroma` class with the following parameters:
  - `persist_directory="assets/chroma_db"`: Specifies the directory where the vector store will be persisted. This directory will store the embeddings generated by the embedding function.
  - `embedding_function=embedding_function`: Specifies the embedding function to be used by Chroma for generating embeddings. We provide the `embedding_function` object instantiated earlier, which utilizes the SentenceTransformer model for generating embeddings.


The purpose of this code snippet is to set up the infrastructure for generating embeddings from text data and storing them efficiently for later retrieval during the summarization process.

- **Embedding Generation:** By using the SentenceTransformer model, we generate embeddings for each sentence in the text data. These embeddings capture the semantic meaning and contextual information of the sentences, enabling more accurate summarization.

- **Vector Store Initialization:** The `Chroma` vector store is initialized with the specified parameters, including the directory for persisting embeddings and the embedding function to be used. Chroma efficiently manages the storage and retrieval of embeddings, allowing for fast and effective access during the summarization process.

- **Model Selection:** The choice of the "all-MiniLM-L6-v2" model for embedding generation is based on its performance and suitability for our summarization task. This model has been pre-trained on a diverse range of text data and fine-tuned on relevant tasks, making it well-suited for capturing the nuances of natural language text.


### Summarization

For generating summaries, we leverage OpenAI's language model, known for its advanced capabilities in natural language understanding and generation. In combination with a retrieval-based question-answering approach, we construct a summarization pipeline using LangChain's `RetrievalQA` module. This approach allows us to formulate questions based on the content of the PDF documents and retrieve relevant information for summarization.

In [7]:
from dotenv import load_dotenv
from langchain import OpenAI
from langchain.chains import RetrievalQA

# please provide your OpenAI API key
load_dotenv()

retriever = vectordb.as_retriever()
llm = OpenAI(temperature = 0.9)

qa_chain = RetrievalQA.from_chain_type(llm = llm,
                                       chain_type = "stuff",
                                       retriever = retriever,
                                       return_source_documents = True,
                                       verbose = False)

  warn_deprecated(


**Explanation:**

- `from dotenv import load_dotenv`: We import the `load_dotenv` function from the dotenv library. This function loads environment variables from a .env file into the script's environment. Environment variables are useful for storing sensitive information or configuration settings.

- `from langchain import OpenAI`: We import the `OpenAI` class from the LangChain library. This class provides an interface for interacting with OpenAI's language model API.

- `from langchain.chains import RetrievalQA`: We import the `RetrievalQA` class from the LangChain library. This class represents a retrieval-based question-answering chain, which combines a retriever and a language model to generate answers to questions based on retrieved documents.

- `from langchain.chains.question_answering import load_qa_chain`: We import the `load_qa_chain` function from the question_answering module of the LangChain library. This function allows us to load a pre-configured question-answering chain from a file.

- `load_dotenv()`: We call the `load_dotenv()` function to load environment variables from a .env file. This step is necessary if the script relies on environment variables for configuration settings.

- `retriever = vectordb.as_retriever()`: We initialize the retriever component of the question-answering chain by converting the Chroma vector store (`vectordb`) into a retriever object. This retriever is responsible for retrieving relevant documents based on user queries.

- `llm = OpenAI(temperature=0.9)`: We instantiate an object of the `OpenAI` class with a temperature parameter of 0.9. The temperature parameter controls the randomness of the language model's output during text generation.

- `qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True, verbose=False)`: We construct a retrieval-based question-answering chain using the `from_chain_type()` method of the `RetrievalQA` class. This method takes parameters such as the language model (`llm`), the retriever (`retriever`), the type of chain ("stuff"), whether to return source documents (`return_source_documents`), and verbosity (`verbose`).


The code snippet sets up a powerful question-answering system capable of retrieving information from PDF documents and providing insightful answers to user queries, enhancing the utility and accessibility of the document summarization process.

### Display Summarization

Finally, we demonstrate the summarization process by providing a question to the retrieval QA chain. The chain processes the question, retrieves relevant information from the PDF documents using embeddings generated by Chroma, and generates a concise summary. This summary provides users with key insights from the PDF documents, facilitating faster information retrieval and decision-making.

In [8]:
chain_result = qa_chain("Can you give me the summary")
answer = chain_result["result"]
print(answer)

  warn_deprecated(


 The summary is that this document is the Bank Indonesia Annual Financial Statements for the year 2022. It includes the Bank's revenues from various activities such as monetary policy implementation and payment system services, as well as its expenses including interest and remuneration on deposits. The statement also shows a surplus or deficit for the year and the Bank's accumulated surplus or deficit as of December 31, 2022 and December 31, 2021. 


**Explanation:**

- `chain_result = qa_chain("Can you give me the summary")`: We execute the question-answering chain (`qa_chain`) by passing a question as input. In this case, the question is "Can you give me the summary". The question is formulated based on the user's request for a summary of the document content.

- `answer = chain_result["result"]`: We retrieve the result of the question-answering process from the `chain_result` dictionary. The result typically contains the generated answer to the input question. Here, we access the answer using the key "result" and assign it to the variable `answer`.

By integrating these methods and components, our workflow demonstrates an effective approach to automate PDF document summarization. Each step in the workflow contributes to improving the accuracy and efficiency of the summarization process, ultimately enhancing user experience and productivity.

## Summary

The project aims to address the challenge of efficiently summarizing information from PDF documents by leveraging artificial intelligence (AI) techniques. PDF documents are ubiquitous in various domains, but accessing and extracting relevant information from them can be time-consuming and labor-intensive, particularly when dealing with large volumes of documents. To streamline this process, we propose a solution that utilizes AI to automatically read PDF documents, extract key information, and generate concise summaries.

Overall, the project aims to provide a comprehensive solution for automating the summarization of PDF documents, enabling users to quickly access key insights and information without the need for manual reading and analysis.

## Future Directions

Moving forward, several avenues for future development and enhancement of the project can be explored:

1. **Enhanced Summarization Techniques:** Investigate advanced natural language processing (NLP) techniques and models to improve the quality and coherence of generated summaries. This could involve fine-tuning language models on specific summarization tasks or exploring novel approaches for summarization.

2. **Multi-Modal Summarization:** Extend the project to support summarization of multi-modal content, including images, graphs, and tables, in addition to text. This would enable more comprehensive summarization of diverse types of documents.

3. **User Interface Improvements:** Develop a user-friendly interface for interacting with the summarization system, allowing users to input queries, view summaries, and navigate through document content seamlessly.

4. **Scalability and Performance:** Optimize the system's performance and scalability to handle large volumes of PDF documents efficiently. This could involve parallel processing, distributed computing, or integration with cloud-based services.

5. **Domain-Specific Summarization:** Tailor the summarization system to specific domains or industries by training it on domain-specific data and fine-tuning the language model accordingly. This would improve the relevance and accuracy of generated summaries for users in those domains.

By pursuing these avenues for future development, the project can evolve into a robust and versatile tool for automating document summarization tasks, catering to a wide range of users and use cases.

## Reference

Here are some references that were instrumental in the development of this project:

1. [PyPDF2 Documentation:](https://pypdf2.readthedocs.io/en/3.0.0/)
   - PyPDF2 library was used for parsing PDF files and extracting text content.

2. [LangChain Documentation:](https://langchain.readthedocs.io/)
   - LangChain library provided various components for text preprocessing, embedding generation, and question-answering.

3. [SentenceTransformer Documentation:](https://www.sbert.net/)
   - SentenceTransformer library was utilized for generating sentence embeddings using pre-trained transformer models.

4. [OpenAI API Documentation:](https://beta.openai.com/docs/)
   - OpenAI API was integrated into the project for language modeling and question-answering capabilities.

5. [Python-dotenv Documentation:](https://pypi.org/project/python-dotenv/)
   - Python-dotenv library was used for loading environment variables from a .env file.

6. GitHub Repositories:
   - PyPDF2: [https://github.com/mstamy2/PyPDF2](https://github.com/mstamy2/PyPDF2)
   - Sentence-Transformers: [https://github.com/UKPLab/sentence-transformers](https://github.com/UKPLab/sentence-transformers)

These resources provided valuable documentation, tutorials, and code examples that guided the implementation of the project's various components. Additionally, community forums, research papers, and online discussions were consulted for troubleshooting and gaining insights into best practices for document summarization and natural language processing tasks.