Is there a way to delete headers/footers in PDF documents? #2259

sergenti · 2023-02-25T21:52:18Z

sergenti
Feb 25, 2023

🤔 Is your feature request related to a problem? Please describe.
Most AI models are not trained on PDF data since parsing it is difficult. I'm working on a PDF parsing project that removes tables, charts headers, etc., so extraction libraries like PyMuPDF can improve significantly.
I solved table removal; I would love to solve header removal now.

💡 Describe the solution you'd like
Can we remove headers/footers on PDFs so the output of page.get_text() is cleaner?

JorjMcKie · 2023-02-26T09:24:42Z

JorjMcKie
Feb 26, 2023
Maintainer

A typical "Discussions" item, so let me convert this first.

0 replies

JorjMcKie · 2023-02-26T09:36:07Z

JorjMcKie
Feb 26, 2023
Maintainer

There is no (reliable / foolproof) way to identify headers or footers: it's all just text on the page.

Although PDF spec does have ways to identify these things - including tables - however nobody seems to use this.

So as soon as you are able to provide rules that allow filtering text in this way, removal as such is easy-peasy in PyMuPDF when using redaction annotations.
For example if you can say "everything above y = 72 points is a header" or "everything below page height - 72" is a footer, everything is roses.
You could get more sophisticated with those rules of course and do an analysis across all pages, looking for page similarities at the top and the bottom areas, etc., ... but the statement as such keeps the same.

4 replies

sergenti Feb 27, 2023
Author

Hi @JorjMcKie, I'll tinker with the numbers and tell you in a few days. Does not seem a robust approach since padding/margins can vary between documents, especially for those rotated.

We could work on the fact that PyMuPDF extracts text page by page and check the hamming/edit distance between the first X and last X lines on the pages. X defined by another function that tries to minimize said distance

Who knows? I'll test both and get back to you. Have a great day!

JorjMcKie Feb 27, 2023
Maintainer

Ok, good luck with this!
Please be aware, that PyMuPDF always extracts coordinates as if a page had not been rotated. You must use the page matrices page.rotation_matrix / page.derotation_matrix to get the (de-) rotated positon of something.
So a point p = fitz.Point(50, 72) near the top-left corner of an A4 page, has these coordinates on the rotated page: p * page.rotation_matrix:

In [1]: import fitz
In [2]: doc = fitz.open()
In [3]: page=doc.new_page()
In [4]: page.set_rotation(90)
In [5]: p=fitz.Point(50,72)
In [6]: p * page.rotation_matrix
Out[6]: Point(770.0, 50.0)

Same for a rectangle:

In [7]: r = fitz.Rect(100, 100, 300, 200)
In [8]: r * page.rotation_matrix
Out[8]: Rect(642.0, 100.0, 742.0, 300.0)

sergenti Feb 27, 2023
Author

Wait, I have to test the library on slides / rotated pages. Does it work by default, or do I have to detect if the page is rotated and then do something else?

JorjMcKie Feb 28, 2023
Maintainer

Whether a page is rotated or not:
Point fitz.Point(0, 0) is always the top-left corner of the unrotated page.
But to compute the rectangle of a possible page header better use the width of the page's CropBox and not the page rectangle, because page.rect is / may be rotated, while page.cropbox is not, something like header_area = fitz.Rect(0, 0, page.cropbox.width, 72).
This is a horizontal stripe at the (unrotated) page's top with a height of 1 inch.

Hansimov · 2023-08-08T13:05:33Z

Hansimov
Aug 8, 2023

Since different PDFs might have different headers and footers with various positions and margins, it is not recommended to use the hardcoded rule-based algorithms.
One practical method is to use clustering algorithms (such as DBSCAN) based on some metrics (such as the coordinates of the bound points for the block rectangle, and the length of the block text, in short: (x0,y0,x1,y1,len(text))).

Here are two example results of the categorization for headers/footers and body texts without any improvements:
The red parts are the headers/footers, while the blue parts are the body texts.

Note that there are only "two" clusters, because I choose the cluster which has most points as the "body text" cluster, and combine the others to one as the "headers/footers" cluster.
You might also notice that some headers/footers are not categorized correctly, because they are much close to real body texts, or are too far away from any other clusters.
You could try to fine-tune the parameters in dbscan = DBSCAN(), such as eps and min_samples, to improve the results.
Once the headers and footers are recognized, the removing step is not that hard, so it is no need to include the script here.

Here are my scripts, you might need to fill in your real-world values in some places.

src/utils/categorizer.py:

from collections import Counter
from sklearn.cluster import DBSCAN
import numpy as np


class PDFTextBlockCategorizer:
    def __init__(self, blocks):
        self.blocks = blocks

    def run(self):
        X = np.array(
            [(x0, y0, x1, y1, len(text)) for x0, y0, x1, y1, text in self.blocks]
        )

        dbscan = DBSCAN()
        dbscan.fit(X)
        labels = dbscan.labels_
        self.n_clusters = len(np.unique(labels))
        label_counter = Counter(labels)
        most_common_label = label_counter.most_common(1)[0][0]
        labels = [0 if label == most_common_label else 1 for label in labels]
        self.labels = labels

        print(f"{self.n_clusters} clusters for {len(self.blocks)} blocks")

and src/utils/pdf_parser.py:

import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import fitz
from pathlib import Path
from itertools import islice
from utils.categorizer import PDFTextBlockCategorizer


class PDFExtractor:
    pdf_root = "..."

    def __init__(self):
        pdf_filename = "***.pdf"
        self.pdf_fullpath = self.pdf_root / pdf_filename
        self.pdf_doc = fitz.open(self.pdf_fullpath)

    def calc_rect_center(self, rect, reverse_y=False):
        if reverse_y:
            x0, y0, x1, y1 = rect[0], -rect[1], rect[2], -rect[3]
        else:
            x0, y0, x1, y1 = rect

        x_center = (x0 + x1) / 2
        y_center = (y0 + y1) / 2
        return (x_center, y_center)

    def extract_all_text_blocks(self):
        # * https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractBLOCKS

        rect_centers = []
        rects = []
        visual_label_texts = []
        categorize_vectors = []

        for page_idx, page in islice(enumerate(self.pdf_doc), len(self.pdf_doc)):
            blocks = page.get_text("blocks")
            page_cnt = page_idx + 1
            print(f"=== Start Page {page_cnt}: {len(blocks)} blocks ===")
            block_cnt = 0
            for block in blocks:
                block_rect = block[:4]  # (x0,y0,x1,y1)
                x0, y0, x1, y1 = block_rect
                rects.append(block_rect)
                block_text = block[4]
                block_num = block[5]
                # block_cnt += 1
                block_cnt = block_num + 1

                rect_center = self.calc_rect_center(block_rect, reverse_y=True)
                rect_centers.append(rect_center)
                # visual_label_text = f"{block_text.split()[-1]}({page_cnt}.{block_cnt})"
                visual_label_text = f"({page_cnt}.{block_cnt})"
                visual_label_texts.append(visual_label_text)

                block_type = "text" if block[6] == 0 else "image"
                print(f"Block: {page_cnt}.{block_cnt}")
                print(f"<{block_type}> {rect_center} - {block_rect}")
                print(block_text)
                categorize_vectors.append((*block_rect, block_text))

            print(f"=== End Page {page_cnt}: {len(blocks)} blocks ===\n")

        categorizer = PDFTextBlockCategorizer(categorize_vectors)
        categorizer.run()

        fig, ax = plt.subplots()
        colors = ["b", "r", "g", "c", "m", "y", "k"]

        for i, rect_center in enumerate(rect_centers):
            label_idx = categorizer.labels[i]
            color = colors[label_idx]
            x0, y0, x1, y1 = rects[i]
            rect = Rectangle((x0, -y0), x1 - x0, -y1 + y0, fill=False, edgecolor=color)
            ax.add_patch(rect)
            x, y = rect_center
            plt.scatter(x, y, color=color)
            plt.annotate(visual_label_texts[i], rect_center)
        plt.show()

    def run(self):
        self.extract_all_text_blocks()


if __name__ == "__main__":
    pdf_extractor = PDFExtractor()
    pdf_extractor.run()

Run with:

python -m utils.pdf_parser

You would get similar results to mine.

0 replies

hskhawaja · 2023-09-20T23:31:39Z

hskhawaja
Sep 20, 2023

I've implemented an effective solution to remove headers/footers. You can check it out here: https://medium.com/@hussainshahbazkhawaja/paper-implementation-header-and-footer-extraction-by-page-association-3a499b2552ae

0 replies

mirix · 2024-01-02T08:54:06Z

mirix
Jan 2, 2024

I am also working on a solution to define the text box boundaries excluding headers and footers:

https://github.com/mirix/retrieval-augmented-generation/blob/main/test_hdbscan_fitz.py

Tested on the following document:

https://links.imagerelay.com/cdn/2958/ql/general-terms-and-conditions-sqbe-en

A few hints are given for a more robust solution.

It uses HDBSCAN but actually DBSCAN is probably better.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to delete headers/footers in PDF documents? #2259

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Is there a way to delete headers/footers in PDF documents? #2259

sergenti Feb 25, 2023

Replies: 5 comments · 4 replies

JorjMcKie Feb 26, 2023 Maintainer

JorjMcKie Feb 26, 2023 Maintainer

sergenti Feb 27, 2023 Author

JorjMcKie Feb 27, 2023 Maintainer

sergenti Feb 27, 2023 Author

JorjMcKie Feb 28, 2023 Maintainer

Hansimov Aug 8, 2023

hskhawaja Sep 20, 2023

mirix Jan 2, 2024

sergenti
Feb 25, 2023

Replies: 5 comments 4 replies

JorjMcKie
Feb 26, 2023
Maintainer

JorjMcKie
Feb 26, 2023
Maintainer

sergenti Feb 27, 2023
Author

JorjMcKie Feb 27, 2023
Maintainer

sergenti Feb 27, 2023
Author

JorjMcKie Feb 28, 2023
Maintainer

Hansimov
Aug 8, 2023

hskhawaja
Sep 20, 2023

mirix
Jan 2, 2024