### **pdfplumber**
In today’s digital age, many critical documents—ranging from financial reports to academic papers—are available only in PDF format. Extracting data from PDFs can be challenging due to their complex layouts and structures. **pdfplumber** is a powerful Python library that simplifies the process of scraping text, tables, and metadata from PDF files.  

While **pdfplumber** provides customizable methods for extracting text and tables, it does not include Optical Character Recognition (OCR) capabilities. Therefore, it cannot extract text from PDFs generated from image files.

Read the [documentation](https://github.com/jsvine/pdfplumber) for further details.

For our practice we will use https://github.com/jsvine/pdfplumber/blob/c562774331905a9770f03c0aaba13a69c7c6d683/examples/pdfs/ca-warn-report.pdf


In [None]:
"""
Objective: Read metadata from a document
"""

import pdfplumber

# TODO: Download the pdf file and replace it to the current folder

filename = "ca-warn-report.pdf"

pdf = pdfplumber.open(filename)
metadata = pdf.metadata
print("PDF Metadata:")
print(metadata)

# TODO: What information is available in the metadata?

In [None]:
"""
Objective: Open PDF file and display the page as image
"""

first_page = pdf.pages[0] # get the first page
img = first_page.to_image(resolution=300) # convert the page file to image file
img.show() # this will show the image in your default image viewer
# this will help you to validate the page that you are getting instead of opening it manually

# TODO: You might realize that the image file have low resolution, improve the resolution by using resolution parameter
# TODO: Use pages method to count the number of pages

In [None]:
"""
Objective: Check if pdfplumber able to detect the table
"""

img.debug_tablefinder()
# TODO: What is the output? What is the difference from the origin page?

In [None]:
"""
Objective: Extract table for the first page
"""

tables = first_page.extract_table()
tables
# TODO: You might realize that the output is not clean. What is the issue?
# TODO: pdfplumber provides methods for `extract_tables()` and `extract_table()`. What is the difference, and when should each be used?

In [None]:
"""
Objective: Data Cleansing
"""

# TODO: Analyze the previous output structure, decide which is column (header) and which is row (data)

import pandas as pd

header = tables[0] # TODO: Replace the value with the index slicing from the previous output
data = tables[1:] # TODO: Replace the value with the index slicing from the previous output

df = pd.DataFrame(data, columns=header)
for column in ["Effective", "Received"]:
    df[column] = df[column].str.replace(" ", "") # this will remove unnecassary space

df

# TODO: Convert above code to function
# TODO: Make sure its working for the first page, middle page and end page

In [None]:
"""
Objective: Extract all the tables
"""

result = []

for i in range(3, 7): # sample page range
    table = pdf.pages[i].extract_table() # extract the table
    df = pd.DataFrame(data) # convert to dataframe
    result.append(df) # append to the list

df = pd.concat(result) # concat all the dataframe
df
    

# TODO: Improve the code to extract all the tables from all the pages
# TODO: Make sure the header and the footer is included
# TODO: Clean the data

In [None]:
"""
Objective: Extract words
"""

first_page.extract_words()

# TODO: pdfplumber have methods for `extract_words()`, `extract_text()`. What is the difference, and when should each be used?

In [None]:
"""
Objective: Handling undetectable table
"""
# TODO: Download this pdf file and replace it to the current folder 
download_this = "https://www.bps.go.id/id/publication/2024/09/13/c96f4e54ed31739e138ea6f1/tinjauan-regional-berdasarkan-pdrb-kabupaten-kota-2019-2023-buku-1-pulau-sumatera.html"

# TODO: Open page 75 and find the table using debug_tablefinder
# TODO: 

pdf2 = pdfplumber.open("tinjauan-regional-berdasarkan-pdrb-kabupaten-kota-2019-2023--buku-1-pulau-sumatera.pdf")
page = pdf2.pages[75]
page = page.to_image(resolution=300)
page.debug_tablefinder()

# TODO: Notice that the output shows the table is found, but the rows are not properly highlighted.

In [None]:
"""
Objective: Handling undetectable table
"""
page = pdf2.pages[75]
for char in page.extract_words():
    print(f"Text: {char['text']}, Coordinates: (x0={char['x0']}, y0={char['top']}, x1={char['x1']}, y1={char['bottom']})")

# TODO: Using extract_words() method to locate the coordinate in table

In [None]:
"""
Objective: Extracting with explicit coordinates
"""
xCoordinates = [353, 396, 438, 480, 522, 554] # the coordinates is generated using manual inspection on each word using extract_words()

table = page.extract_table(table_settings={
    "vertical_strategy": "explicit",
    "explicit_vertical_lines": xCoordinates,
    "horizontal_strategy": "text",
})
table

# TODO: Clean this data to be more readable
# TODO: Each page have 2 table, all table should be able to be merge in single table, how to do it? explain in the comment section below

In [None]:
"""
Objective: Extracting non-table document (Opsional)
"""

# TODO: Download this pdf file and replace it to the current folder
download_this = "https://www.bps.go.id/id/publication/2023/09/29/8c2d8435fe0c552c6ffdc528/direktori-industri-manufaktur-indonesia-2023.html"

# TODO: Extract information you can get from this document

### **Reflection**
What challenges that might araise when extracting text from different documents?

(answer here)

### **Exploration**
For further learning, explore the integration of Optical Character Recognition (OCR) to extract text from scanned PDFs or those with image-based content. Consider using the **pytesseract** module alongside libraries like **Pillow** for image processing.