**Coursebook: Developing a PDF Summarizer and Q&A System**

- Course Length: 6 hours
- Last Updated: May 2024

___

Developed by Algoritma's Product Team

# Developing a PDF Summarizer and Q&A System

## Background

In today's digital age, the volume of information stored in PDF documents has exponentially grown across various industries, including education, research, legal, and corporate sectors. PDF files serve as an essential medium for sharing, archiving, and disseminating knowledge. However, the sheer volume and complexity of these documents often make it challenging for users to extract relevant information efficiently.

**Summarizer Challenge:** 
Users often face long PDFs with both important and irrelevant details. Reading and summarizing these manually takes time and can lead to mistakes. PDFs can also have various content types like text, tables, images, and graphs, making it tricky to summarize them neatly. Plus, everyone's needs for summaries are different, so a one-size-fits-all approach might not work well.

**Q&A System Challenge:** 
Understanding what users are asking about in relation to PDF content can be tough due to language differences and context. To give users the right answers, the system needs to pull the relevant info from the PDFs quickly and accurately. The answers also need to be spot-on to build trust and satisfaction with the system.

**LLM Integration:** 
Using a large language model (LLM) like GPT-3 can help tackle these challenges. LLMs are great at understanding language, summarizing text, and answering questions. But to make this work with PDFs, we need to:

1. **Document Pre-processing:** Prepare the PDF content for LLMs by converting it into a usable format, handling text from images or tables, and keeping the document's structure intact.
  
2. **Fine-tuning and Customization:** Adjust the LLM to better suit specific topics or user needs. This makes the summarization and Q&A features more accurate.
  
3. **Scalability and Efficiency:** Make sure the system can manage lots of PDFs smoothly and respond to user questions quickly.

To solve these issues, we'll need expertise in natural language processing, machine learning, document handling, and user interface design. Creating a PDF summarizer and Q&A system using an LLM could greatly improve how we access and manage information across many fields.


## Objective

The objective of this coursebook is to provide learners with a comprehensive understanding and practical skills in working with PDF files and Large Language Models (LLMs) for developing a Q&A system and summarizer.

**Coursebook Outline:**

1. **Introduction to PDF File:**
   - Learn what PDF files are and why they're important.
   - Find out how to open and use PDF files in code.

2. **Introduction to LLM:**
   - Get to know Large Language Models like GPT-3, GPT-2, and BERT.
   - Understand what LLMs can and can't do in language tasks.
   - Learn about LangChain for using LLMs effectively.

3. **Text Preprocessing:**
   - Learn basic steps to get text ready for analysis.
   - Understand how to clean and organize text.

4. **Extracting Text Using Vector Database:**
   - Use Chroma to pull text from different sources, including PDFs.
   - Set up API keys and handle environment settings with .env files.
   - Practice extracting text with Chroma.

5. **Q&A System and Summarizer Development:**
   - Build a Q&A system that uses LLMs to answer questions from PDFs.
   - Create a summarization tool to make short summaries of PDFs with LLMs.

By the end of this coursebook, you'll know how to use these tools to get information from PDF files and make Q&A and summarization systems using Large Language Models.

# 1. Introduction to PDF File

PDF stands for Portable Document Format. It's a file format developed by Adobe that captures all the elements of a printed document as an electronic image that can be viewed, printed, or transmitted easily. PDF files are widely used because they preserve the formatting, fonts, and layout of the original document, making them ideal for sharing documents across different platforms and devices without losing their appearance.

PDF files have become essential in today's digital world for several reasons:

- **Universal Compatibility:** PDFs can be opened and viewed on virtually any device and operating system using free software like Adobe Acrobat Reader, making them universally accessible.
  
- **Document Preservation:** Unlike other file formats, PDFs preserve the original layout, fonts, and graphics of a document, ensuring that it looks the same regardless of where or how it's viewed.
  
- **Security Features:** PDFs can be encrypted and password-protected, allowing users to control who can access, edit, or print the document.
  
- **Multi-page Support:** PDFs can contain multiple pages, making them suitable for creating reports, presentations, and ebooks.


**Opening and Using PDF Files in Python:**

To work with PDF files in Python, we can use libraries that provide functionalities to manipulate PDF documents. One popular library for this purpose is `PyPDF2`. Here's a guide on how to open and use PDF files in Python using `PyPDF2`:



In [1]:
from PyPDF2 import PdfReader

pdf_file_path = "data_input/bei_annual_report_2022.pdf"
loader = PdfReader(pdf_file_path)

In [2]:
raw_text = ""

for page in loader.pages:
    content = page.extract_text()
    if content:
        raw_text += content

When running the above code, we instantiate the object to read the PDF document. This object will extract every page of the PDF document. We iterate through each page and extract the text contained within.

# 2. Introduction to Large Language Models

**Large Language Models (LLMs)** like GPT-3 offer powerful capabilities in natural language processing, making them well-suited for tasks involving **text analysis, summarization, and question answering**. Their ability to understand and generate human-like text can greatly enhance the efficiency and accuracy of systems that work with textual data. By integrating LLMs into our task of **PDF summarization and Q&A system development**, we can leverage their advanced language understanding capabilities to create more intelligent and effective solutions.


**What is LLM?**

A Large Language Model (LLM) is a type of artificial intelligence model trained on vast amounts of text data to understand and generate human-like text. LLMs, such as GPT-3 (Generative Pre-trained Transformer 3), are designed to perform various natural language processing tasks, including text generation, translation, summarization, and question answering, among others. These models learn from the patterns and structures in the data they are trained on, allowing them to generate coherent and contextually relevant text.

**History of LLM**

The development of Large Language Models has been a significant milestone in the field of artificial intelligence and natural language processing. The history of LLMs can be traced back to earlier language models like recurrent neural networks (RNNs) and long short-term memory networks (LSTMs). However, the breakthroughs in transformer architectures, particularly with models like GPT (Generative Pre-trained Transformer), have led to the development of more powerful and scalable LLMs.

The evolution of LLMs has been marked by advancements in training techniques, model architectures, and data sources. With each iteration, these models have become larger, more capable, and better at understanding and generating human-like text, driving innovations across various applications and industries.

**Understand What LLMs Can and Can't Do in Language Tasks**

LLMs excel at many natural language processing tasks, thanks to their ability to understand context, generate coherent text, and perform complex language tasks. Here are some tasks LLMs are good at:

- **Text Generation:** Generating human-like text based on the input and context.
  
- **Translation:** Translating text between different languages with reasonable accuracy.
  
- **Summarization:** Creating concise summaries of longer texts.
  
- **Question Answering:** Providing relevant answers to questions based on the input text.

However, LLMs also have limitations:

- **Context Understanding:** While they are good at understanding context within a single passage, they may struggle with broader or multi-document contexts.
  
- **Fact-checking:** They may generate plausible but incorrect information if not guided by accurate data.
  
- **Ethical Considerations:** LLMs can sometimes generate biased or inappropriate content if not carefully controlled and monitored.



## LangChain 🦜🔗
[LangChain](https://python.langchain.com/docs/get_started/introduction.html) is a framework for developing applications powered by language models that refers to the integration of multiple language models and APIs to create a powerful and flexible language processing pipeline. It involves connecting different language models, such as OpenAI's GPT-3 or GPT-2, with other tools and APIs to enhance their functionality and address specific business needs. 

The LangChain concept aims to leverage the strengths of each language model and API to create a comprehensive language processing system. It allows developers to combine different models for tasks like question answering, text generation, translation, summarization, sentiment analysis, and more.

In the context of our task, LangChain could involve using a combination of LLMs for different stages of PDF summarization and Q&A system development. For example, one LLM could be used for text extraction and preprocessing, while another could handle question answering and summarization. By chaining these models together effectively, we can leverage their complementary strengths to create a more powerful and efficient system.

# 3. Text Preprocessing

Previously, we have read and assigned the content of the PDF file to the `raw_text` variable. Let's inspect the length of characters within this variable!

In [3]:
len(raw_text)

1440092

Our loaded PDF content contains more than 1 million characters. This huge amount of data requires us to conduct additional processing to make it easier to provide the information within the PDF document to the LLM. We will preserve this raw text into a more manageable form by chunking it into smaller units. To achieve this, we will use the `CharacterTextSplitter()` object from the `langchain.text_splitter` module.

In [4]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator = "\n", 
                                      chunk_size = 1000, 
                                      chunk_overlap = 10, 
                                      length_function = len)
text = text_splitter.split_text(raw_text)

In [5]:
len(text)

1500

> Visit [this documentation](https://python.langchain.com/docs/modules/data_connection/document_transformers/character_text_splitter/) for more information about `CharacterTextSplitter()`.

`CharacterTextSplitter()` is the most straightforward method Langchain provides for processing text data. The splitting process of the raw text into chunks is based on single characters. In our case, it is based on the newline character (`\n`). Furthermore, the size of the chunk is measured by the number of characters in each chunk. In our code, we set the chunk size to be 1000 characters.

By the end of this process, we will have 1500 chunks, each with around 1000 characters.

# 4. Extracting Text Using Vector Database

When working with PDF files, we often deal with a large amount of unstructured data. To efficiently handle and retrieve information from such data, we need a structured and optimized approach. This is where **vector databases** come into play.

**Vector databases** are specialized databases designed to store and manipulate vector data efficiently. In the context of text extraction from PDF files, vector databases provide a structured storage mechanism that allows us to store text data in a way that facilitates quick and accurate retrieval.

**Why Use Chroma for Text Extraction?**

Chroma is a powerful tool designed to extract text from various sources, including PDF files. Its advanced algorithms and features make it particularly well-suited for dealing with unstructured data like the content found in PDF documents. Here's why we use Chroma:

1. **Efficiency:** Chroma is highly efficient at extracting text from PDF files, even when dealing with large volumes of data. Its optimized algorithms ensure fast processing times, allowing us to extract text quickly and effectively.

2. **Accuracy:** Chroma provides accurate text extraction results, minimizing errors and ensuring the reliability of the extracted information. This is crucial, especially when dealing with important or sensitive data contained within PDF documents.

3. **Versatility:** Chroma is capable of extracting text from various sources, including PDFs, images, scanned documents, and more. Its versatility makes it a valuable tool for handling different types of unstructured data and extracting valuable insights from them.


The specific vector database that we will use is the **ChromaDB** vector database.

> Visit [this website](https://docs.trychroma.com/getting-started#:~:text=Chroma%20is%20a%20database%20for,hosted%20version%20is%20coming%20soon!) for more information about Chroma.


As a vector database, Chroma stores embeddings that represent various types of data, including text and images. In simple terms, embeddings convert our data into a format that is processable by computers (numbers). In the following code, we will perform actions to convert the previous chunks into numerical representations (embeddings) and store them in Chroma.

In preparation for interfacing Chroma functionalities using LangChain, we will undertake the following prerequisite step: environment setup.

## Environment Set-up

Using LangChain will usually require integrations with one or more model providers, data stores, APIs, etc. For this example, we'll use OpenAI's model APIs.

### Setting API key and `.env`

Accessing the API requires an API key, which you can get by creating an account and heading here. When setting up an API key and using a .env file in your Python project, you follow these general steps:

1. **Obtain an API key**: If you're working with an external API or service that requires an API key, you need to obtain one from the provider. This usually involves signing up for an account and generating an API key specific to your project.

2. **Create a .env file**: In your project directory, create a new file and name it ".env". This file will store your API key and other sensitive information securely.

3. **Store API key in .env**: Open the .env file in a text editor and add a line to store your API key. The format should be `API_KEY=your_api_key`, where "API_KEY" is the name of the variable and "your_api_key" is the actual value of your API key. Make sure not to include any quotes or spaces around the value.

4. **Load environment variables**: In your Python code, you need to load the environment variables from the .env file before accessing them. Import the dotenv module and add the following code at the beginning of your script:

```python
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
```

> `dotenv` library is a popular Python library that simplifies the process of loading environment variables from a .env file into your Python application. It allows you to store configuration variables separately from your code, making it easier to manage sensitive information such as API keys, database credentials, or other environment-specific settings.


In [6]:
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

True

This output verifies that our OpenAI API key is successfully loaded. We don't need to pass any arguments regarding the API key when calling certain functionalities related to OpenAI.

## Embedding and Storing

Before storing the chunks of information in Chroma, we need to define which embedding model we will utilize to convert the text to the vector or its numerical representation. Since we will use the chat model developed by OpenAI, we will use the embedding model from the same source.

In [7]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embedding_function = OpenAIEmbeddings(model = "text-embedding-3-large")

Chroma, via LangChain, provides a convenient way to store text embeddings. We simply pass the texts that we intend to convert to vectors and the embedding function. This process may take some time, depending on the thickness of the document.

In [8]:
vectordb = Chroma.from_texts(text, embedding_function)

Our PDF document has been converted into embeddings in `vectordb`. Now, we will provide this `vectordb` as additional knowledge for the LLM.

# 5. Q&A System and Summarizer Development

## Q&A

When developing a Q&A system with LangChain, we require several basic building blocks to construct a unified chain. This chain will receive an input and pass it through every component of the chain to ultimately achieve the desired outcome.

To further understand each of these components, let's import the necessary tools to build a Q&A chain!

In [9]:
# preparing prompt
from langchain_core.prompts import PromptTemplate
# LLM
from langchain_openai import ChatOpenAI
# passing the query
from langchain_core.runnables import RunnablePassthrough
# parsing the output
from langchain_core.output_parsers import StrOutputParser

In [10]:
template = """Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    Use three sentences minimum and give the answer in the complete way.
    Always say "thanks for asking!" at the end of the answer.

    {context}

    Question: {question}

    Helpful Answer:"""

custom_rag_prompt = PromptTemplate.from_template(template)

The first component of the chain is the prompt. We can think of the prompt as an instruction on what output should be generated by the LLM. In the case of creating a Q&A system, we only want the LLM to answer based on the provided knowledge context. In this case, the LLM does not use general knowledge that it already possesses. 

In [11]:
llm = ChatOpenAI(model_name = 'gpt-3.5-turbo-0125', temperature = 0)

The second component of the chain is the LLM. The LLM will perform the information retrieval task based on the question it receives and generate an appropriate response based on the defined prompt.

In [12]:
retriever = vectordb.as_retriever()

The third component of our Q&A chain is the retriever. A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents; it only needs to return (or retrieve) them. This component plays an important role as it provides the base knowledge for the LLM. The query passed to our chain will be computed in similarity with the additional context in the database to enable the return of appropriate answers.

In [13]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [14]:

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
) 


Finally, the components of the Q&A chain will be declared using LangChain Expression Language, or LCEL. The `|` symbol is similar to a Unix pipe operator, which chains together the different components, feeding the output from one component as input into the next component.

* We pass in a query (question) we would like to ask about the PDF document.
* The prompt component takes the user input, which is then used to construct a `PromptValue` after using the query to construct the prompt.
* The model component takes the generated prompt and passes it into the OpenAI LLM model for evaluation. The generated output from the model is a `ChatMessage` object.
* Finally, the StrOutputParser() component takes in a `ChatMessage` and transforms this into a Python string, which is returned from the `.invoke` method.

Let's test out the Q&A chain functionality using the `.invoke()` method! In the following code, we will use the `textwrap` package functionality to print the chain response in a more convenient appearance.

In [15]:
import textwrap

print(textwrap.fill(rag_chain.invoke("What is Indonesian Stock Exchange?"), width = 85))

The Indonesian Stock Exchange (IDX) was formed on November 30, 2007, through the
merger of the Jakarta Stock Exchange (JSX) and the Surabaya Stock Exchange (SSX). It
aims to become a credible exchange that drives financial deepening and empowers
Indonesia to become the 5th largest economy by 2045. IDX is located at Gedung Bursa
Efek Indonesia, Tower I, Jakarta, Indonesia. Thanks for asking!


In [16]:
print(textwrap.fill(rag_chain.invoke("Berapakah nilai IHSG pada 13 September 2022?"), width = 85))

IHSG pada 13 September 2022 mencapai level 7.318,016. Thanks for asking!


## Summarizer

To build the summarizer functionality, we will use the `RetrievalQA()` object from the `langchain.chains` module. This chain is used for question-answering against an index. This class extends the functionality of Q&A cases, and we will adapt it to provide a general summary of the document.

Let's import the required object!

In [17]:
from langchain.chains import RetrievalQA

In [18]:
qa_chain = RetrievalQA.from_chain_type(llm = llm,
                                       chain_type = "stuff",
                                       retriever = retriever,
                                       return_source_documents = True,
                                       verbose = False)

Pay attention to where we define `qa_chain` above! With the `from_chain_type()` method, we pass the LLM that will generate the summary and the document we want to summarize via the `llm` and `retriever` parameters consecutively. When we declare `chain_type = "stuff"`, it takes a list of documents, inserts them all into a prompt, and passes that prompt to an LLM.

In [19]:
chain_result = qa_chain("Give me the summary in general!")

  warn_deprecated(


We execute the question-answering chain (`qa_chain`) by passing a question as input. In this case, the question is "Give me the summary in general!" The question is formulated based on the user's request for a summary of the document content.

In [20]:
chain_result

{'query': 'Give me the summary in general!',
 'result': 'The summary of the provided context is about the activities and initiatives undertaken by a company, particularly related to organizational development, information dissemination, and system development. It includes details on various divisions, workshops, e-learning sessions, capacity building programs, and information dissemination services. Additionally, there is information on the number of agreements and memorandums issued by the Legal Division, as well as the development phases of a reporting system and special notation module.',
 'source_documents': [Document(page_content='membership), as well as for boositng future collaboration. The \ndetails of the activities are as follows:1. Organization Enabler Part 1:\n ›Organization & HC Updates;\n ›Corporate Strategy & Subsidiary Management;\n ›Corporate Secretary;\n ›Governance, Risk, and Compliance.\n2. Organization Enabler Part 2:\n ›Finance & Accounting Updates;\n ›General Aff

When we set `return_source_document = True`, the resulting `chain_result` returns the `source_document`, which signifies the source LLM used to infer a summary. To extract only the response from LLM, subset only the `result` key.

In [21]:
print(textwrap.fill(chain_result['result']))

The summary of the provided context is about the activities and
initiatives undertaken by a company, particularly related to
organizational development, information dissemination, and system
development. It includes details on various divisions, workshops,
e-learning sessions, capacity building programs, and information
dissemination services. Additionally, there is information on the
number of agreements and memorandums issued by the Legal Division, as
well as the development phases of a reporting system and special
notation module.


# Summary

In this coursebook, we have learned the introductory concepts of Large Language Models (LLMs) and explored how we can extend their capabilities to text-based specific use cases, such as PDF Q&A and summarization. We have delved into the workflows for using LLMs for these cases, starting from text preprocessing, embedding the text, and storing the embeddings in a vector database. Finally, we provided these embeddings as context for the LLM, enabling effective information retrieval and the extraction of a general summary of the document.

Lastly, this coursebook has provided a brief overview of LLM implementation for unstructured data. Find more resources in further readings to explore other astonishing implementations of LLMs using LangChain.

# Further Readings

* [LangChain AI Handbook](https://www.pinecone.io/learn/series/langchain/).
* [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction).