Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redaction Annotation Fill Not Matching Up With Redacted Section #3575

Closed
lyon-tonic opened this issue Jun 13, 2024 · 4 comments
Closed

Redaction Annotation Fill Not Matching Up With Redacted Section #3575

lyon-tonic opened this issue Jun 13, 2024 · 4 comments

Comments

@lyon-tonic
Copy link

Description of the bug

I am trying to redact words from a PDF, based on OCR-generated rectangles.

PyMuPdf has worked well for us, but I have run into a strange situation with a specific file that has some strange properties. (I've attached the file). The pages in this file are an abnormal size (8.5 x 6.5 in) and some of them are rotated.

I would like to have the coordinates in the rectangles relative to the top left, but even before I do that, I have noticed that the redacted rectangle is not in the same place as the fill.

If this is not a bug, I would like to understand why these appear to be being drawn on separate coordinate systems, and how to reconcile them.

image

How to reproduce the bug

This is a simple script that shows the problem in the files below:

Input:
input.pdf

Output:
output.pdf

import fitz  # PyMuPDF

def process_pdf(input_pdf_path, output_pdf_path):
    # Open the input PDF file
    document = fitz.open(input_pdf_path)
    
    # Iterate through each page
    for page_num in range(len(document)):
        page = document.load_page(page_num)  # load page
        
        # 234 is half of the width of the page
        rect = fitz.Rect(0, 0, 234, 234)

        redact_annot = page.add_redact_annot(rect)
        redact_annot.update(fill_color=(0, 0, 0))  # set fill color to black
        page.apply_redactions()
        page.insert_textbox(rect, f"Page {page_num + 1}", fontsize=12, fontname="helv", color=(1, 0, 0))


    document.save(output_pdf_path)

if __name__ == "__main__":
    input_pdf_path = "input.pdf"  # Replace with the path to your input PDF
    output_pdf_path = "output.pdf"  # Replace with the path to your output PDF
    
    process_pdf(input_pdf_path, output_pdf_path)
    print(f"Processed PDF saved to {output_pdf_path}")

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.11

@JorjMcKie
Copy link
Collaborator

Inserting / Adding stuff to rotated pages can be confusing. For most methods in PyMuPDF you must pass rotated coordinates (for points, rectangles, ...) to get them in the right place.
I think this script does what you want:

import pymupdf as fitz  # PyMuPDF

RED = fitz.pdfcolor["red"]


def process_pdf(input_pdf_path, output_pdf_path):
    # Open the input PDF file
    document = fitz.open(input_pdf_path)

    # Iterate through each page
    for page in document:
        # 234 is half of the width of the page
        rect = fitz.Rect(0, 0, 234, 234)
        rot_rect = rect * page.derotation_matrix
        redact_annot = page.add_redact_annot(
            rot_rect, text=f"{page.number=}", text_color=RED
        )
        page.apply_redactions()

    document.ez_save(output_pdf_path)


if __name__ == "__main__":
    input_pdf_path = "input.pdf"  # Replace with the path to your input PDF
    output_pdf_path = "output.pdf"  # Replace with the path to your output PDF

    process_pdf(input_pdf_path, output_pdf_path)
    print(f"Processed PDF saved to {output_pdf_path}")

@lyon-tonic
Copy link
Author

Thanks for responding!

This is part of the issue, but it is still not solving the issue of the redact_annot fill. The fill rectangle appears to be rendering separately from the redact_annot, and I'm not sure why.

The black fill rect is not showing up here.

import pymupdf as fitz  # PyMuPDF

RED = fitz.pdfcolor["red"]


def process_pdf(input_pdf_path, output_pdf_path):
    # Open the input PDF file
    document = fitz.open(input_pdf_path)

    # Iterate through each page
    for page in document:
        # 234 is half of the width of the page
        rect = fitz.Rect(0, 0, 234, 234)
        rot_rect = rect * page.derotation_matrix
        redact_annot = page.add_redact_annot(
            rot_rect, text=f"{page.number=}", text_color=RED
        )
        redact_annot.update(fill_color=(0, 0, 0))  # set fill color to black
        page.apply_redactions()

    document.ez_save(output_pdf_path)


if __name__ == "__main__":
    input_pdf_path = "input.pdf"  # Replace with the path to your input PDF
    output_pdf_path = "output.pdf"  # Replace with the path to your output PDF

    process_pdf(input_pdf_path, output_pdf_path)
    print(f"Processed PDF saved to {output_pdf_path}")

@JorjMcKie
Copy link
Collaborator

This file indeed does a few unexpected things!
Here is a complete solution that removes the page rotations.

import pymupdf as fitz  # PyMuPDF

RED = fitz.pdfcolor["red"]
BLACK = fitz.pdfcolor["black"]


def process_pdf(input_pdf_path, output_pdf_path):
    rect = fitz.Rect(0, 0, 234, 234)

    # Open the input PDF file
    src = fitz.open(input_pdf_path)
    doc = fitz.open()  # output file

    # Iterate through each page
    for src_page in src:
        # the output PDF will contain pages with rotation 0
        src_rect = src_page.rect
        w, h = src_rect.br
        src_rot = src_page.rotation
        src_page.set_rotation(0)
        # make output page having the visible dimension of the input
        page = doc.new_page(width=w, height=h)
        page.show_pdf_page(  # insert source page
            page.rect,
            src,
            src_page.number,
            rotate=-src_rot,  # reversed original rotation
        )
        
        # now we can redact in a worry-free manner
        redact_annot = page.add_redact_annot(
            rect, text=f"{page.number=}", text_color=RED, fill=BLACK
        )
        page.apply_redactions()

    doc.ez_save(output_pdf_path)


if __name__ == "__main__":
    input_pdf_path = "input.pdf"  # Replace with the path to your input PDF
    output_pdf_path = "output.pdf"  # Replace with the path to your output PDF

    process_pdf(input_pdf_path, output_pdf_path)
    print(f"Processed PDF saved to {output_pdf_path}")

@JorjMcKie
Copy link
Collaborator

Close issue for lack of reaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants