**Coursebook: Developing a PDF Summarizer and Q&A System**

- Course Length: 9 hours
- Last Updated: April 2024

___

Developed by Algoritma's Product Team

# Developing a PDF Summarizer and Q&A System

## Background

In today's digital age, the volume of information stored in PDF documents has exponentially grown across various industries, including education, research, legal, and corporate sectors. PDF files serve as an essential medium for sharing, archiving, and disseminating knowledge. However, the sheer volume and complexity of these documents often make it challenging for users to extract relevant information efficiently.

**Summarizer Challenge:** 
Users often face long PDFs with both important and irrelevant details. Reading and summarizing these manually takes time and can lead to mistakes. PDFs can also have various content types like text, tables, images, and graphs, making it tricky to summarize them neatly. Plus, everyone's needs for summaries are different, so a one-size-fits-all approach might not work well.

**Q&A System Challenge:** 
Understanding what users are asking about in relation to PDF content can be tough due to language differences and context. To give users the right answers, the system needs to pull the relevant info from the PDFs quickly and accurately. The answers also need to be spot-on to build trust and satisfaction with the system.

**LLM Integration:** 
Using a large language model (LLM) like GPT-3 can help tackle these challenges. LLMs are great at understanding language, summarizing text, and answering questions. But to make this work with PDFs, we need to:

1. **Document Pre-processing:** Prepare the PDF content for LLMs by converting it into a usable format, handling text from images or tables, and keeping the document's structure intact.
  
2. **Fine-tuning and Customization:** Adjust the LLM to better suit specific topics or user needs. This makes the summarization and Q&A features more accurate.
  
3. **Scalability and Efficiency:** Make sure the system can manage lots of PDFs smoothly and respond to user questions quickly.

To solve these issues, we'll need expertise in natural language processing, machine learning, document handling, and user interface design. Creating a PDF summarizer and Q&A system using an LLM could greatly improve how we access and manage information across many fields.


## Objective

The objective of this coursebook is to provide learners with a comprehensive understanding and practical skills in working with PDF files and Large Language Models (LLMs) for developing a Q&A system and summarizer.

**Coursebook Outline:**

1. **Introduction to PDF File:**
   - Learn what PDF files are and why they're important.
   - Find out how to open and use PDF files in code.

2. **Introduction to LLM:**
   - Get to know Large Language Models like GPT-3, GPT-2, and BERT.
   - Understand what LLMs can and can't do in language tasks.
   - Learn about LangChain for using LLMs effectively.

3. **Text Preprocessing:**
   - Learn basic steps to get text ready for analysis.
   - Understand how to clean and organize text.

4. **Extracting Text Using Vector Database:**
   - Use Chroma to pull text from different sources, including PDFs.
   - Practice extracting text with Chroma.

5. **Q&A System and Summarizer Development:**
   - Set up API keys and handle environment settings with .env files.
   - Build a Q&A system that uses LLMs to answer questions from PDFs.
   - Create a summarization tool to make short summaries of PDFs with LLMs.

By the end of this coursebook, you'll know how to use these tools to get information from PDF files and make Q&A and summarization systems using Large Language Models.

# 1. Introduction to PDF File

PDF stands for Portable Document Format. It's a file format developed by Adobe that captures all the elements of a printed document as an electronic image that can be viewed, printed, or transmitted easily. PDF files are widely used because they preserve the formatting, fonts, and layout of the original document, making them ideal for sharing documents across different platforms and devices without losing their appearance.

PDF files have become essential in today's digital world for several reasons:

- **Universal Compatibility:** PDFs can be opened and viewed on virtually any device and operating system using free software like Adobe Acrobat Reader, making them universally accessible.
  
- **Document Preservation:** Unlike other file formats, PDFs preserve the original layout, fonts, and graphics of a document, ensuring that it looks the same regardless of where or how it's viewed.
  
- **Security Features:** PDFs can be encrypted and password-protected, allowing users to control who can access, edit, or print the document.
  
- **Multi-page Support:** PDFs can contain multiple pages, making them suitable for creating reports, presentations, and ebooks.


**Opening and Using PDF Files in Python:**

To work with PDF files in Python, we can use libraries that provide functionalities to manipulate PDF documents. One popular library for this purpose is `PyPDF2`. Here's a guide on how to open and use PDF files in Python using `PyPDF2`:



In [None]:
from PyPDF2 import PdfReader

pdf_file_path = "assets/Laporan-Keuangan-Tahunan-BI-2022.pdf"
loader = PdfReader(pdf_file_path)

In [None]:
raw_text = ""

for page in loader.pages:
    content = page.extract_text()
    if content:
        raw_text += content

# 2. Introduction to Large Language Models

**Large Language Models (LLMs)** like GPT-3 offer powerful capabilities in natural language processing, making them well-suited for tasks involving **text analysis, summarization, and question answering**. Their ability to understand and generate human-like text can greatly enhance the efficiency and accuracy of systems that work with textual data. By integrating LLMs into our task of **PDF summarization and Q&A system development**, we can leverage their advanced language understanding capabilities to create more intelligent and effective solutions.


**What is LLM?**

A Large Language Model (LLM) is a type of artificial intelligence model trained on vast amounts of text data to understand and generate human-like text. LLMs, such as GPT-3 (Generative Pre-trained Transformer 3), are designed to perform various natural language processing tasks, including text generation, translation, summarization, and question answering, among others. These models learn from the patterns and structures in the data they are trained on, allowing them to generate coherent and contextually relevant text.

**History of LLM**

The development of Large Language Models has been a significant milestone in the field of artificial intelligence and natural language processing. The history of LLMs can be traced back to earlier language models like recurrent neural networks (RNNs) and long short-term memory networks (LSTMs). However, the breakthroughs in transformer architectures, particularly with models like GPT (Generative Pre-trained Transformer), have led to the development of more powerful and scalable LLMs.

The evolution of LLMs has been marked by advancements in training techniques, model architectures, and data sources. With each iteration, these models have become larger, more capable, and better at understanding and generating human-like text, driving innovations across various applications and industries.

**Understand What LLMs Can and Can't Do in Language Tasks**

LLMs excel at many natural language processing tasks, thanks to their ability to understand context, generate coherent text, and perform complex language tasks. Here are some tasks LLMs are good at:

- **Text Generation:** Generating human-like text based on the input and context.
  
- **Translation:** Translating text between different languages with reasonable accuracy.
  
- **Summarization:** Creating concise summaries of longer texts.
  
- **Question Answering:** Providing relevant answers to questions based on the input text.

However, LLMs also have limitations:

- **Context Understanding:** While they are good at understanding context within a single passage, they may struggle with broader or multi-document contexts.
  
- **Fact-checking:** They may generate plausible but incorrect information if not guided by accurate data.
  
- **Ethical Considerations:** LLMs can sometimes generate biased or inappropriate content if not carefully controlled and monitored.



## LangChain 🦜🔗
[LangChain](https://python.langchain.com/docs/get_started/introduction.html) is a framework for developing applications powered by language models that refers to the integration of multiple language models and APIs to create a powerful and flexible language processing pipeline. It involves connecting different language models, such as OpenAI's GPT-3 or GPT-2, with other tools and APIs to enhance their functionality and address specific business needs. 

The LangChain concept aims to leverage the strengths of each language model and API to create a comprehensive language processing system. It allows developers to combine different models for tasks like question answering, text generation, translation, summarization, sentiment analysis, and more.

In the context of our task, LangChain could involve using a combination of LLMs for different stages of PDF summarization and Q&A system development. For example, one LLM could be used for text extraction and preprocessing, while another could handle question answering and summarization. By chaining these models together effectively, we can leverage their complementary strengths to create a more powerful and efficient system.

# 3. Text Preprocessing

# 4. Extracting Text Using Vector Database

When working with PDF files, we often deal with a large amount of unstructured data. To efficiently handle and retrieve information from such data, we need a structured and optimized approach. This is where **vector databases** come into play.

**Vector databases** are specialized databases designed to store and manipulate vector data efficiently. In the context of text extraction from PDF files, vector databases provide a structured storage mechanism that allows us to store text data in a way that facilitates quick and accurate retrieval.

**Why Use Chroma for Text Extraction?**

Chroma is a powerful tool designed to extract text from various sources, including PDF files. Its advanced algorithms and features make it particularly well-suited for dealing with unstructured data like the content found in PDF documents. Here's why we use Chroma:

1. **Efficiency:** Chroma is highly efficient at extracting text from PDF files, even when dealing with large volumes of data. Its optimized algorithms ensure fast processing times, allowing us to extract text quickly and effectively.

2. **Accuracy:** Chroma provides accurate text extraction results, minimizing errors and ensuring the reliability of the extracted information. This is crucial, especially when dealing with important or sensitive data contained within PDF documents.

3. **Versatility:** Chroma is capable of extracting text from various sources, including PDFs, images, scanned documents, and more. Its versatility makes it a valuable tool for handling different types of unstructured data and extracting valuable insights from them.


The specific vector database that we will use is the **ChromaDB** vector database.

[Chroma Website](https://docs.trychroma.com/getting-started#:~:text=Chroma%20is%20a%20database%20for,hosted%20version%20is%20coming%20soon!):

> Chroma is a database for building AI applications with embeddings. It comes with everything you need to get started built in, and runs on your machine. ChromaDB

In [None]:
# silahkan diganti kalau tidak sesuai yaa
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

embedding_function = SentenceTransformerEmbeddings(model_name = "all-MiniLM-L6-v2")

In [None]:
vectordb = Chroma(persist_directory = "assets/chroma_db", embedding_function = embedding_function)

# 5. Q&A System and Summarizer Development

## Environment Set-up

Using LangChain will usually require integrations with one or more model providers, data stores, APIs, etc. For this example, we'll use OpenAI's model APIs.

### Setting API key and `.env`

Accessing the API requires an API key, which you can get by creating an account and heading here. When setting up an API key and using a .env file in your Python project, you follow these general steps:

1. **Obtain an API key**: If you're working with an external API or service that requires an API key, you need to obtain one from the provider. This usually involves signing up for an account and generating an API key specific to your project.

2. **Create a .env file**: In your project directory, create a new file and name it ".env". This file will store your API key and other sensitive information securely.

3. **Store API key in .env**: Open the .env file in a text editor and add a line to store your API key. The format should be `API_KEY=your_api_key`, where "API_KEY" is the name of the variable and "your_api_key" is the actual value of your API key. Make sure not to include any quotes or spaces around the value.

4. **Load environment variables**: In your Python code, you need to load the environment variables from the .env file before accessing them. Import the dotenv module and add the following code at the beginning of your script:

```python
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
```

> `dotenv` library is a popular Python library that simplifies the process of loading environment variables from a .env file into your Python application. It allows you to store configuration variables separately from your code, making it easier to manage sensitive information such as API keys, database credentials, or other environment-specific settings.


In [None]:
from dotenv import load_dotenv

load_dotenv()