Replies: 5 comments 4 replies
-
A typical "Discussions" item, so let me convert this first. |
Beta Was this translation helpful? Give feedback.
-
There is no (reliable / foolproof) way to identify headers or footers: it's all just text on the page.
So as soon as you are able to provide rules that allow filtering text in this way, removal as such is easy-peasy in PyMuPDF when using redaction annotations. |
Beta Was this translation helpful? Give feedback.
-
Since different PDFs might have different headers and footers with various positions and margins, it is not recommended to use the hardcoded rule-based algorithms. Here are two example results of the categorization for headers/footers and body texts without any improvements: Note that there are only "two" clusters, because I choose the cluster which has most points as the "body text" cluster, and combine the others to one as the "headers/footers" cluster. Here are my scripts, you might need to fill in your real-world values in some places.
from collections import Counter
from sklearn.cluster import DBSCAN
import numpy as np
class PDFTextBlockCategorizer:
def __init__(self, blocks):
self.blocks = blocks
def run(self):
X = np.array(
[(x0, y0, x1, y1, len(text)) for x0, y0, x1, y1, text in self.blocks]
)
dbscan = DBSCAN()
dbscan.fit(X)
labels = dbscan.labels_
self.n_clusters = len(np.unique(labels))
label_counter = Counter(labels)
most_common_label = label_counter.most_common(1)[0][0]
labels = [0 if label == most_common_label else 1 for label in labels]
self.labels = labels
print(f"{self.n_clusters} clusters for {len(self.blocks)} blocks") and import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import fitz
from pathlib import Path
from itertools import islice
from utils.categorizer import PDFTextBlockCategorizer
class PDFExtractor:
pdf_root = "..."
def __init__(self):
pdf_filename = "***.pdf"
self.pdf_fullpath = self.pdf_root / pdf_filename
self.pdf_doc = fitz.open(self.pdf_fullpath)
def calc_rect_center(self, rect, reverse_y=False):
if reverse_y:
x0, y0, x1, y1 = rect[0], -rect[1], rect[2], -rect[3]
else:
x0, y0, x1, y1 = rect
x_center = (x0 + x1) / 2
y_center = (y0 + y1) / 2
return (x_center, y_center)
def extract_all_text_blocks(self):
# * https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractBLOCKS
rect_centers = []
rects = []
visual_label_texts = []
categorize_vectors = []
for page_idx, page in islice(enumerate(self.pdf_doc), len(self.pdf_doc)):
blocks = page.get_text("blocks")
page_cnt = page_idx + 1
print(f"=== Start Page {page_cnt}: {len(blocks)} blocks ===")
block_cnt = 0
for block in blocks:
block_rect = block[:4] # (x0,y0,x1,y1)
x0, y0, x1, y1 = block_rect
rects.append(block_rect)
block_text = block[4]
block_num = block[5]
# block_cnt += 1
block_cnt = block_num + 1
rect_center = self.calc_rect_center(block_rect, reverse_y=True)
rect_centers.append(rect_center)
# visual_label_text = f"{block_text.split()[-1]}({page_cnt}.{block_cnt})"
visual_label_text = f"({page_cnt}.{block_cnt})"
visual_label_texts.append(visual_label_text)
block_type = "text" if block[6] == 0 else "image"
print(f"Block: {page_cnt}.{block_cnt}")
print(f"<{block_type}> {rect_center} - {block_rect}")
print(block_text)
categorize_vectors.append((*block_rect, block_text))
print(f"=== End Page {page_cnt}: {len(blocks)} blocks ===\n")
categorizer = PDFTextBlockCategorizer(categorize_vectors)
categorizer.run()
fig, ax = plt.subplots()
colors = ["b", "r", "g", "c", "m", "y", "k"]
for i, rect_center in enumerate(rect_centers):
label_idx = categorizer.labels[i]
color = colors[label_idx]
x0, y0, x1, y1 = rects[i]
rect = Rectangle((x0, -y0), x1 - x0, -y1 + y0, fill=False, edgecolor=color)
ax.add_patch(rect)
x, y = rect_center
plt.scatter(x, y, color=color)
plt.annotate(visual_label_texts[i], rect_center)
plt.show()
def run(self):
self.extract_all_text_blocks()
if __name__ == "__main__":
pdf_extractor = PDFExtractor()
pdf_extractor.run() Run with:
You would get similar results to mine. |
Beta Was this translation helpful? Give feedback.
-
I've implemented an effective solution to remove headers/footers. You can check it out here: https://medium.com/@hussainshahbazkhawaja/paper-implementation-header-and-footer-extraction-by-page-association-3a499b2552ae |
Beta Was this translation helpful? Give feedback.
-
I am also working on a solution to define the text box boundaries excluding headers and footers: https://github.com/mirix/retrieval-augmented-generation/blob/main/test_hdbscan_fitz.py Tested on the following document: https://links.imagerelay.com/cdn/2958/ql/general-terms-and-conditions-sqbe-en A few hints are given for a more robust solution. It uses HDBSCAN but actually DBSCAN is probably better. |
Beta Was this translation helpful? Give feedback.
-
🤔 Is your feature request related to a problem? Please describe.
Most AI models are not trained on PDF data since parsing it is difficult. I'm working on a PDF parsing project that removes tables, charts headers, etc., so extraction libraries like PyMuPDF can improve significantly.
I solved table removal; I would love to solve header removal now.
💡 Describe the solution you'd like
Can we remove headers/footers on PDFs so the output of
page.get_text()
is cleaner?Beta Was this translation helpful? Give feedback.
All reactions