# Extracting Text from PDF Files Using Three Methods: PyPDF2, pdfplumber, and PyMuPDF
This code demonstrates three different methods for extracting text from PDF files using Python libraries: **PyPDF2**, **pdfplumber**, and **PyMuPDF (fitz)**. Each of these libraries offers unique strengths, depending on the complexity of the PDF file and the type of data extraction needed.

In [5]:
# Define the path to the PDF file
path = r"C:\Users\Quynh Pham\Desktop\Import pdf\Newsletter.pdf"

## 1. PyPDF2
**PyPDF2** is a lightweight and simple-to-use library for working with PDFs. It allows you to extract raw text from PDFs but has limited functionality for handling complex layouts, such as tables or images. It is best suited for simple text extraction from PDFs that contain primarily plain text without much formatting or structure.

- **Use Case**: Extracting text from basic PDFs (e.g., reports, documents).
- **Limitations**: Cannot handle complex layouts like tables or embedded images.

In [4]:
from pypdf import PdfReader

In [6]:
# Create a PdfReader object to read the PDF
reader = PdfReader(path)

# Print the number of pages in the PDF
print(f"There are {len(reader.pages)} Pages")

There are 3 Pages


In [10]:
# Get the first page (index 0) 
page = reader.pages[0]

# Use extract_text() to get the text of the page
print(page.extract_text())

Drylab Newsfor in vestors & friends · Ma y 2017
Welcome to our first newsletter of 2017! It's
been a while since the last one, and a lot has
happened. W e promise to k eep them coming
every two months hereafter , and permit
ourselv es to mak e this one r ather long. The
big news is the beginnings of our launch in
the American mark et, but there are also
interesting updates on sales, de velopment,
mentors and ( of course ) the in vestment
round that closed in January .
New c apital: The in vestment round was
successful. W e raised 2.13 MNOK to matchthe 2.05 MNOK loan from Inno vation
Norwa y. Including the de velopment
agreement with Filmlance International, the
total new capital is 5 MNOK, partly tied to
the successful completion of milestones. All
formalities associated with this process are
now finalized.
New o wners: We would especially lik e to
warmly welcome our new owners to the
Drylab family: Unni Jacobsen, T orstein Jahr ,
Suzanne Bolstad, Eivind Bergene, T urid Brun,
Vigdis T 

In [8]:
# Extract text from all pages in the PDF and store in a list
for i in range(len(reader.pages)):
    page = reader.pages[i]
    print(page.extract_text())

Drylab Newsfor in vestors & friends · Ma y 2017
Welcome to our first newsletter of 2017! It's
been a while since the last one, and a lot has
happened. W e promise to k eep them coming
every two months hereafter , and permit
ourselv es to mak e this one r ather long. The
big news is the beginnings of our launch in
the American mark et, but there are also
interesting updates on sales, de velopment,
mentors and ( of course ) the in vestment
round that closed in January .
New c apital: The in vestment round was
successful. W e raised 2.13 MNOK to matchthe 2.05 MNOK loan from Inno vation
Norwa y. Including the de velopment
agreement with Filmlance International, the
total new capital is 5 MNOK, partly tied to
the successful completion of milestones. All
formalities associated with this process are
now finalized.
New o wners: We would especially lik e to
warmly welcome our new owners to the
Drylab family: Unni Jacobsen, T orstein Jahr ,
Suzanne Bolstad, Eivind Bergene, T urid Brun,
Vigdis T 

As you can see, **PyPDF2** was able to successfully extract all the text from the PDF, including the multiple columns. However, there are some issues with the formatting. Extra spaces and line breaks were introduced in places where they shouldn't be, which may affect the readability and structure of the content. This is a common challenge when dealing with multi-column layouts in PDFs, as the extracted text often does not preserve the original document’s formatting perfectly.

## 2. pdfplumber

**pdfplumber** is a powerful tool specifically designed for handling complex PDF layouts. It can accurately extract structured data like tables, as well as text, images, and metadata. It is the best choice when you need to extract data from PDFs that contain tables or complex formatting.

- **Use Case**: Extracting tables and structured text from complex PDFs (e.g., financial reports, forms).
- **Strength**: Can handle tables and complex document structures effectively.

In [11]:
import pdfplumber

with pdfplumber.open(path) as pdf:
    # iterate over each page
    for page in pdf.pages:
        # extract text
        text = page.extract_text()
        print(text)

DrylabNews
for investors & friends · May 2017
Welcome to our first newsletter of 2017! It's the 2.05 MNOK loan from Innovation
been a while since the last one, and a lot has Norway. Including the development
happened. We promise to keep them coming agreement with Filmlance International, the
every two months hereafter, and permit total new capital is 5 MNOK, partly tied to
ourselves to make this one rather long. The the successful completion of milestones. All
big news is the beginnings of our launch in formalities associated with this process are
the American market, but there are also now finalized.
interesting updates on sales, development,
New owners:We would especially like to
mentors and (of course) the investment
warmly welcome our new owners to the
round that closed in January.
Drylab family: Unni Jacobsen, Torstein Jahr,
New capital:The investment round was Suzanne Bolstad, Eivind Bergene, Turid Brun,
successful. We raised 2.13 MNOK to match Vigdis Trondsen, Lea Blindheim, Kri

**pdfplumber** is usually praised for handling complex PDF layouts such as tables, but in this particular case, it performed poorly. Unlike PyPDF2, pdfplumber failed to accurately preserve the content structure. It read the content across multiple pages, causing sections of text to be mixed between different pages. This made the result difficult to follow and created a lot of confusion in the flow of the text. Therefore, pdfplumber is less suitable for this specific document.

## 3. PyMuPDF (fitz)

**PyMuPDF (fitz)** is a highly versatile and efficient library for working with PDFs, allowing you to extract text, images, and other content from PDF documents. It provides excellent performance for text extraction and gives you access to the coordinates of the extracted text, which can be especially helpful when working with complex layouts or interactive PDFs.

- **Use Case**: Extracting text and metadata from PDFs, especially when dealing with complex or interactive files.
- **Strength**: Can handle text, images, and even coordinates of the content, enabling advanced document handling.

In [12]:
import fitz  # PyMuPDF

# Open the PDF document
pdf_document = fitz.open(path)

# Initialize an empty string to store the extracted text
raw_text = ""

# Loop through each page and extract text
for page_num in range(len(pdf_document)):
    page = pdf_document.load_page(page_num)  # Load the page
    raw_text += page.get_text("text")  # Extract text from the page

# Close the PDF document
pdf_document.close()

# Print the extracted text
print(raw_text)

DrylabNews
for investors & friends · May 2017
Welcome to our first newsletter of 2017! It's
been a while since the last one, and a lot has
happened. We promise to keep them coming
every two months hereafter, and permit
ourselves to make this one rather long. The
big news is the beginnings of our launch in
the American market, but there are also
interesting updates on sales, development,
mentors and (of course) the investment
round that closed in January.
New capital: The investment round was
successful. We raised 2.13 MNOK to match
the 2.05 MNOK loan from Innovation
Norway. Including the development
agreement with Filmlance International, the
total new capital is 5 MNOK, partly tied to
the successful completion of milestones. All
formalities associated with this process are
now finalized.
New owners: We would especially like to
warmly welcome our new owners to the
Drylab family: Unni Jacobsen, Torstein Jahr,
Suzanne Bolstad, Eivind Bergene, Turid Brun,
Vigdis Trondsen, Lea Blindheim, K

**PyMuPDF** performed exceptionally well in this case. It was able to accurately extract the text from the PDF while handling the multi-column layout correctly. It also preserved the spacing and line breaks within each block of text, making the output text closely resemble the original format of the document. 

In this particular case, PyMuPDF outperformed both PyPDF2 and pdfplumber. It was able to handle the multi-column layout effectively, preserving both the spacing and line breaks within the text. This ensured that the extracted content closely mirrored the original document’s structure and readability, making PyMuPDF the best choice for this document extraction task.

However, it is important to note that the effectiveness of each tool depends on the specific use case. For instance, PyPDF2 or pdfplumber may perform better with simpler text layouts, structured data, or tables. Therefore, it is recommended to use each tool with caution and evaluate them based on your particular PDF extraction needs. Always try different methods and choose the one that works best for your specific scenario.