Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply_redactions moves some unredacted text #3278

Closed
indigoviolet opened this issue Mar 19, 2024 · 11 comments
Closed

apply_redactions moves some unredacted text #3278

indigoviolet opened this issue Mar 19, 2024 · 11 comments
Labels
upstream bug bug outside this package

Comments

@indigoviolet
Copy link

Description of the bug

Before:

image

After:

image

source.pdf
result_annotated.pdf


Notes:

  • a similar issue was previously reported in apply_redactions() moving text #2957
  • this is being reported on behalf of an Enterprise support customer (I can email details separately if needed).
  • We have noticed that some issues like this are resolved when using Preview.app to (re-) print the PDF with "scale to fit" turned on.

How to reproduce the bug

import fitz

source_doc = fitz.open('source.pdf')
dst_doc = fitz.open()
dst_doc.insert_pdf(source_doc, 0, 0)

redaction_coords = [(507.3806485, 35.06467372292002, 552.9736259804001, 24.838430185399943),
                    (212.5271666879999, 137.43544772292012, 258.12014416840003, 127.20920418540004),
                    (212.527166688, 129.4908031440001, 258.1201441684001, 139.2412260440001)]

page = dst_doc.load_page(0)

for coords in redaction_coords:
        redaction = page.add_redact_annot(
            coords,
            text="  ",
            fill=(0.5, 0.5, 0.5),  # grey
        )
        redaction.update()

fitz.utils.apply_redactions(page)

dst_doc.save("result.pdf")
dst_doc.close()

PyMuPDF version

1.23.26

Operating system

MacOS

Python version

3.11

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Mar 19, 2024
@JorjMcKie JorjMcKie reopened this Mar 19, 2024
@JorjMcKie
Copy link
Collaborator

Sorry - I was wrong: got confused by the first two coords rectangles: they are empty and thus invalid.
Re-opened this issue.
It is an error in the base library, and I will submit an issue there.

The error has nothing to do with redactions. The page's /Contents stream is destroyed even when "cleaning" the PDF (part of the internal redaction process).

@JorjMcKie JorjMcKie added upstream bug bug outside this package and removed not a bug not a bug / user error / unable to reproduce labels Mar 19, 2024
@JorjMcKie
Copy link
Collaborator

In this case, the PDF resulting from a PDF-to-PDF conversion will not cause a problem. So you can use this approach is you need an immediate circumvention:

import fitz

source_doc = fitz.open("source.pdf")
pdfdata = source_doc.convert_to_pdf()  # make a PDF/PDF conversion
# this will issue some error messages: obviously, some fonts contain errors.
source_doc = fitz.open("pdf", pdfdata)  # re-open source after conversion
dst_doc = fitz.open()
dst_doc.insert_pdf(source_doc, 0, 0)

redaction_coords = [
    (507.3806485, 35.06467372292002, 552.9736259804001, 24.838430185399943),
    (212.5271666879999, 137.43544772292012, 258.12014416840003, 127.20920418540004),
    (212.527166688, 129.4908031440001, 258.1201441684001, 139.2412260440001),
]

page = dst_doc.load_page(0)

for coords in redaction_coords:
    if fitz.Rect(coords).is_empty:
        print(f"normalizing empty rect {coords=}")
        coords = tuple(fitz.Rect(coords).normalize())
    redaction = page.add_redact_annot(
        coords,
        text="  ",
        fill=(0.5, 0.5, 0.5),  # grey
    )
    page.draw_rect(coords, color=(1, 0, 0))

page._apply_redactions(0, 0)

dst_doc.save("result.pdf")
dst_doc.close()

@JorjMcKie
Copy link
Collaborator

Issue number in MuPDF's issue system: https://bugs.ghostscript.com/show_bug.cgi?id=707673.

@manjuadditive
Copy link

manjuadditive commented Mar 19, 2024

In this case, the PDF resulting from a PDF-to-PDF conversion will not cause a problem. So you can use this approach is you need an immediate circumvention:

import fitz

source_doc = fitz.open("source.pdf")
pdfdata = source_doc.convert_to_pdf()  # make a PDF/PDF conversion
# this will issue some error messages: obviously, some fonts contain errors.
source_doc = fitz.open("pdf", pdfdata)  # re-open source after conversion
dst_doc = fitz.open()
dst_doc.insert_pdf(source_doc, 0, 0)

redaction_coords = [
    (507.3806485, 35.06467372292002, 552.9736259804001, 24.838430185399943),
    (212.5271666879999, 137.43544772292012, 258.12014416840003, 127.20920418540004),
    (212.527166688, 129.4908031440001, 258.1201441684001, 139.2412260440001),
]

page = dst_doc.load_page(0)

for coords in redaction_coords:
    if fitz.Rect(coords).is_empty:
        print(f"normalizing empty rect {coords=}")
        coords = tuple(fitz.Rect(coords).normalize())
    redaction = page.add_redact_annot(
        coords,
        text="  ",
        fill=(0.5, 0.5, 0.5),  # grey
    )
    page.draw_rect(coords, color=(1, 0, 0))

page._apply_redactions(0, 0)

dst_doc.save("result.pdf")
dst_doc.close()

@JorjMcKie
We are trying this solution, but we are getting too many cannot create ToUnicode mapping warnings which are making redactions harder. In such cases convert_to_pdf probably (can you confirm?) converts the underlying text to images. When we redact the area with PDF_REDACT_IMAGE_REMOVE option, the images doesn't seem to get removed.

UPDATE: I don't think convert_to_pdf is creating images when cannot create ToUnicode mapping is encountered. But I am lost on what the replacement object is. And it is certainly not getting redacted although it overlaps the redaction bounds.

@JorjMcKie
Copy link
Collaborator

Document.convert_to_pdf() does not convert page content to images. However, as the documentation explains, only a subset of fonts is eligible for such a conversion.
And of course: internally all text and other content is rewritten to the new PDF, therefore any errors in that area are detected and complained about.

@manjuadditive
Copy link

Document.convert_to_pdf() does not convert page content to images. However, as the documentation explains, only a subset of fonts is eligible for such a conversion. And of course: internally all text and other content is rewritten to the new PDF, therefore any errors in that area are detected and complained about.

What happens if the conversion fails?

@JorjMcKie
Copy link
Collaborator

An exception is raised. There also is the risk of incomplete / distorted output.
As I wrote: a possible circumvention for the time being.
Not meant as a permanent solution.

@indigoviolet
Copy link
Author

@JorjMcKie do you need our customer details to fill in on the MuPDF issue so it gets the corresponding priority?

@JorjMcKie
Copy link
Collaborator

@JorjMcKie do you need our customer details to fill in on the MuPDF issue so it gets the corresponding priority?

Yes, please provide it if available.

@JorjMcKie
Copy link
Collaborator

Receive the customer id - thank you!

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

4 participants