Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page.delete_widget() doesn't fully remove the widget, other programs still detect the widgets #3478

Open
ag-gaphp opened this issue May 14, 2024 · 14 comments

Comments

@ag-gaphp
Copy link

ag-gaphp commented May 14, 2024

Description of the bug

I am unable to completely delete widgets in any of my documents using Page.delete_widget(widget) and then Document.ez_save(). Although the resulting PDF looks like the fields are removed when viewed in a reader like Acrobat, the fields are still present in some lingering form and get picked up by other programs like eSign platforms.

This is causing me issues in my process, as I need to add proper signature fields in my documents created from LibreOffice, which doesn't have the ability to add signature fields to PDFs. What my app is doing is getting the box coords for the sig and init fields, deleting them from the PDF entirely, and then appending new signature fields using the box coords in pyHanko.

My only work around for the moment is to rename the fields I want to delete so that pyHanko creates brand new fields with new names. The issue with this, however, is that there is an extra text field that is underneath my signatures when I upload them to an eSign platform, even though PyMuPDF does not detect the fields are there.

There must be some remaining aspect of the deleted fields that is still present after the delete and save. I thought it was an issue in pyHanko for a while, but now I am seeing these "deleted" fields pop up in 4 different eSign platforms when importing fields automatically from the PDF. This leads me to believe that there is something not being removed from the PDF when a widget is deleted with PyMuPDF.

Note: the widgets being removed are basic text fields with nothing special about them other than a specific name scheme. I am able to affect everything else about these fields in PyMuPDF like the name, flags, etc. without issue.

How to reproduce the bug

My Process:

  1. Open file
  2. Iterate each page and find widgets that start with "sig" or "init"
  3. Save their box dimensions for later and mark the widget for removal
  4. Save the document with garbage > 0
  5. Close the document
  6. View PDF in Acrobat
  7. Upload to eSign platform and import fields automatically

Expected:

  • Fields are not present when viewed in Acrobat
  • Fields are not present when automatically imported using an eSign platform

Actual:

  • Fields are not present when viewed in Acrobat
  • Fields are present when automatically imported using an eSign platform

Screenshot from signNow after uploading deleted_signatures.pdf and having their system auto-import fields:
image

The same location in the same file, viewed through Acrobat, looks like there is no field there:
image

Notes:

  • I've confirmed this behavior on the folloring eSign platforms: DocuSign, signNow, OneSpan Sign, and RightSignature
  • I've uploaded 2 files. Initial and signature fields have either a light-yellow or light-red box that marks their position in these files.
  • Here is my test script. It's an isolated function from my overall app that handles removing the widgets. If you run this against the attached PDF example file, you'll see that PyMuPDF does not report that it found the fields after deleting them and re-opening the file. However, when I upload to one of the above platforms, or try to use pyHanko, the fields are still present in the file somewhere.
import fitz, os

OLD_FILE="exported_from_libreoffice.pdf"
NEW_FILE="deleted_signatures.pdf"

if os.path.exists(NEW_FILE):
    os.remove(NEW_FILE)

_fdoc = fitz.open(OLD_FILE)

# iterate the pages
for page in _fdoc:
    # iterate the fields on this page
    for field in page.widgets():
        n = field.field_name
        # if it's a signature, remove it
        if n.startswith("sig") or n.startswith("init"):
            page.delete_widget(field)

# save the document updates
_fdoc.ez_save(NEW_FILE)
_fdoc.close()

# now re-opening the document to check if the fields I removed are still there or not according to PyMuPDF
_check_doc = fitz.open(NEW_FILE)

# iterate the pages again
for page in _check_doc:
    # iterate the fields on this page
    for field in page.widgets():
        n = field.field_name
        # if it's a signature, print to console
        if n.startswith("sig") or n.startswith("init"):
            print(f"...'{n}' is still present")

_check_doc.close()

PyMuPDF version

1.24.3

Operating system

Windows

Python version

3.12

@JorjMcKie
Copy link
Collaborator

This not a bug!
As always with iterations, it is error-prone to change the iterator underway.
The following modification works:

import pymupdf as fitz, os

OLD_FILE = "exported_from_libreoffice.pdf"
NEW_FILE = "deleted_signatures2.pdf"

if os.path.exists(NEW_FILE):
    os.remove(NEW_FILE)

_fdoc = fitz.open(OLD_FILE)

# iterate the pages
for page in _fdoc:
    # iterate the fields on this page
    # BUT: use widget xrefs for iteration!!!
    xrefs = [w.xref for w in page.widgets()]
    for xref in xrefs:
        field = page.load_widget(xref)
        n = field.field_name
        # if it's a signature, remove it
        if n.startswith("sig") or n.startswith("init"):
            page.delete_widget(field)

# save the document updates
_fdoc.ez_save(NEW_FILE)
_fdoc.close()

# now re-opening the document to check if the fields I removed are still there or not according to PyMuPDF
_check_doc = fitz.open(NEW_FILE)

# iterate the pages again
for page in _check_doc:
    # iterate the fields on this page
    for field in page.widgets():
        n = field.field_name
        # if it's a signature, print to console
        if n.startswith("sig") or n.startswith("init"):
            print(f"...'{n}' is still present")

_check_doc.close()

Another safe way of iteration:

field = page.first_widget
while field:
    if n.startswith("sig") or n.startswith("init"):
        field = page.delete_widget(field)
    else:
        field = field.next

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label May 16, 2024
@ag-gaphp
Copy link
Author

Unfortunately, both of the presented solutions give me the same end result as my original bug post. Even when I use the xrefs to iterate as you suggested in the modification, the fields that I mark for deletion are still present in some way in the file. The eSign platforms and pyHanko are still able to find them and try to utilize them as form fields, even when Acrobat doesn't display them.

Is there something extra I can do while saving the file to make sure that any lingering references are removed from the xref table?

@JorjMcKie
Copy link
Collaborator

But the file check iteration shows that they are gone!
A signature often is associated with some image - do these images molest you then or what?

@JorjMcKie
Copy link
Collaborator

I see that the result PDF has some yellow and red rectangle graphics where there were fields before. These have nothing to do with widgets / fields - they are just vector graphics.
If you don't want them either, you must remove them separately.

@ag-gaphp
Copy link
Author

The graphic boxes are meant to stay, it's only the widgets/fields that I am trying to remove.

I understand that PyMuPDF says they are gone, I'm just confused on if they are gone, how are the other programs able to see the removed widgets and their names/dimensions? In pyHanko, I can still retrieve that info even if PyMuPDF says they are not there, and the eSign platforms do the same. I must be misunderstanding something, or there is something off about the way LibreOffice is setting up the widgets in the first place. I'm not sure how to troubleshoot that at the moment.

@JorjMcKie
Copy link
Collaborator

Attached my output
deleted_signatures2.pdf
On which page do you still see deleted widgets?

@ag-gaphp
Copy link
Author

ag-gaphp commented May 17, 2024

First, my apologies for not doing this initially. There might be a slight language barrier and I didn't give the most complete example that I could have given.

I just tried your PDF and I get the same results. When I upload to an eSign platform, the removed fields are detected during the import. When I attempt to add signatures in those same locations with pyHanko, it also complains that the fields already exist in the PDF.

Here is my complete testing script for this that prints out some data along the way. This is a watered down version of what my tool is doing. First it stores the coords of signature/initial fields, deletes them, then uses pyHanko to add proper signature fields.

requirements.txt

pymupdf
pyhanko

test.py

import fitz, os
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
from pyhanko.sign.fields import SigFieldSpec, append_signature_field

OLD_FILE="exported_from_libreoffice.pdf"
NEW_FILE="deleted_signatures.pdf"

if os.path.exists(NEW_FILE):
    os.remove(NEW_FILE)

boxes = {}

doc = fitz.open(OLD_FILE)

# iterate the pages
print("Removing fields with PyMuPDF")
for page in doc:
    # store the page's height for placement
    _page_rect = page.bound()
    _page_height = _page_rect.y1

    # iterate the fields on this page
    field = page.first_widget
    while field:
        n = field.field_name

        # if it's a signature, remove it
        if n.startswith("sig") or n.startswith("init"):
            # PyMuPDF y coords go top-to-bottom, but pyHanko goes bottom-to-top
            # Subtract the y coords from the current page height for pyHanko
            boxes[n] = {
                "page": page.number,
                "box": (
                    field.rect.x0,
                    _page_height-field.rect.y0,
                    field.rect.x1,
                    _page_height-field.rect.y1
                )
            }
            print("Removing field: ", n)
            field = page.delete_widget(field)

        else:
            field = field.next

# save the document updates
doc.ez_save(NEW_FILE)
doc.close()

# now re-opening the document to check if the fields I removed are still there or not
check_doc = fitz.open(NEW_FILE)

# iterate the pages again
print("Checking PDF for removed fields with PyMuPDF")
found = 0
for page in check_doc:
    # iterate the fields on this page
    field = page.first_widget
    while field:
        n = field.field_name
        # if it's a signature, print to console
        if n.startswith("sig") or n.startswith("init"):
            found += 1
            print(f"...'{n}' is still present")
        field = field.next

if found == 0:
    print("PyMuPDF did not find any fields")

check_doc.close()

# now let's try to use pyHanko to add new signatures
# if we find that a field already exists, print the error
print("Adding signatures to new PDF with pyHanko")
found = 0
with open(NEW_FILE, 'rb+') as sig_doc:
    writer = IncrementalPdfFileWriter(sig_doc, strict=False)
    for name in boxes.keys():
        _dict = boxes[name]
        try:
            append_signature_field(writer, SigFieldSpec(
                                        sig_field_name=name,
                                        on_page=_dict["page"],
                                        box=_dict["box"]
                                    ))

        except Exception as e:
            found += 1
            print("ERROR: ", e)

    writer.write_in_place()

if found > 0:
    print(f"pyHanko found {found} fields")

Even when I run only the last for loop on the PDF you uploaded, I get the same results where pyHanko can still see the fields.

Example of the full output that I see when I run this against my originally uploaded exported_from_libreoffice.pdf:

Removing fields with PyMuPDF
Removing field:  init0
Removing field:  init1
Removing field:  init2
Removing field:  init3
Removing field:  init4
Removing field:  init5
Removing field:  init6
Removing field:  init7
Removing field:  init8
Removing field:  init9
Removing field:  init10
Removing field:  init11
Removing field:  init12
Removing field:  init13
Removing field:  init14
Removing field:  init15
Removing field:  init16
Removing field:  init17
Removing field:  init18
Removing field:  init19
Removing field:  init20
Removing field:  init21
Removing field:  init22
Removing field:  init23
Removing field:  init24
Removing field:  init25
Removing field:  init26
Removing field:  init27
Removing field:  init28
Removing field:  sigPrimary1
Removing field:  init29
Removing field:  sigSecondary1
Checking PDF for removed fields with PyMuPDF
PyMuPDF did not find any fields
Adding signatures to new PDF with pyHanko
ERROR:  Field with name init0 exists but is not a signature field
ERROR:  Field with name init1 exists but is not a signature field
ERROR:  Field with name init2 exists but is not a signature field
ERROR:  Field with name init3 exists but is not a signature field
ERROR:  Field with name init4 exists but is not a signature field
ERROR:  Field with name init5 exists but is not a signature field
ERROR:  Field with name init6 exists but is not a signature field
ERROR:  Field with name init7 exists but is not a signature field
ERROR:  Field with name init8 exists but is not a signature field
ERROR:  Field with name init9 exists but is not a signature field
ERROR:  Field with name init10 exists but is not a signature field
ERROR:  Field with name init11 exists but is not a signature field
ERROR:  Field with name init12 exists but is not a signature field
ERROR:  Field with name init13 exists but is not a signature field
ERROR:  Field with name init14 exists but is not a signature field
ERROR:  Field with name init15 exists but is not a signature field
ERROR:  Field with name init16 exists but is not a signature field
ERROR:  Field with name init17 exists but is not a signature field
ERROR:  Field with name init18 exists but is not a signature field
ERROR:  Field with name init19 exists but is not a signature field
ERROR:  Field with name init20 exists but is not a signature field
ERROR:  Field with name init21 exists but is not a signature field
ERROR:  Field with name init22 exists but is not a signature field
ERROR:  Field with name init23 exists but is not a signature field
ERROR:  Field with name init24 exists but is not a signature field
ERROR:  Field with name init25 exists but is not a signature field
ERROR:  Field with name init26 exists but is not a signature field
ERROR:  Field with name init27 exists but is not a signature field
ERROR:  Field with name init28 exists but is not a signature field
ERROR:  Field with name sigPrimary1 exists but is not a signature field
ERROR:  Field with name init29 exists but is not a signature field
ERROR:  Field with name sigSecondary1 exists but is not a signature field
pyHanko found 32 fields

I mentioned this before, but it's possible that something in LibreOffice is part of my issue. I'm trying to do some more testing on that today when I can to see if I can figure anything out.

My bad if I'm annoying you, but I'm just baffled at how the other apps are able to still detect the fields if they are removed from the PDF. I will admit it could completely be a misunderstanding on my part about how things are done in PDFs, but from a basic logic standpoint it seems like some sort of reference to the deleted fields remain in the PDF.

@ag-gaphp
Copy link
Author

ag-gaphp commented May 21, 2024

I've tracked this down a little more specifically. Looks like the delete method deletes the widget from the annotations list of the PDF, but it remains in the fields list as an object. When pyHanko and the eSign platforms iterate, they are using the fields list and not the annotations list, so this is why it seems like the fields still remain in the PDF, but they don't appear in regular viewers.

Would it not make sense to remove all references to a widget from the PDF entirely if it's meant to be deleted? Or is that operation not possible in PDF for some reason?

@JorjMcKie
Copy link
Collaborator

I've tracked this down a little more specifically. Looks like the delete method deletes the widget from the annotations list of the PDF, but it remains in the fields list as an object. When pyHanko and the eSign platforms iterate, they are using the fields list and not the annotations list, so this is why it seems like the fields still remain in the PDF, but they don't appear in regular viewers.

Would it not make sense to remove all references to a widget from the PDF entirely if it's meant to be deleted? Or is that operation not possible in PDF for some reason?

I was beginning to suspect something like this. As to my own impression, 50% of the PDF viewers I am using look at the /Annots list of the page only and will disregard anything else.
The others indeed seem to still look at the central array of widgets (located in the PDF catalog).

We should probably indeed make sure to also either remove the entry there too (perfectly possible) or empty the PDF object definition.

Let me re-open this issue as an enhancement request.

@JorjMcKie JorjMcKie reopened this May 21, 2024
@JorjMcKie JorjMcKie added enhancement and removed not a bug not a bug / user error / unable to reproduce labels May 21, 2024
@JorjMcKie
Copy link
Collaborator

Here is an upfront solution - I hope.
The trick is to empty the object definition of the field.
All the viewers I tried do no longer see a field treated like this.
Uploading test2.zip…

@ag-gaphp
Copy link
Author

There might have been a github malfunction when you posted the comment, looks like it is linking back to this issue instead of a zip file. I'll definitely check it out when I can download it.

Is it a dev version of the module, or code examples on how to empty an object definition using existing methods?

And thanks for bearing with me on this. I know my terminology isn't completely accurate, I'm still learning PDF/python and clearly have a ways to go.

@JorjMcKie
Copy link
Collaborator

My internet connection is terrible at the moment. So the ZIP upload was interrupted / incomplete. It is actually 2 statements only that you must add:

  • Before deleting the field do xref = field.xref to get the xref
  • After deleting the field do _fdoc.update_object(xref, "<<>>"). This will empty the field's object definition. It effectively will no longer be a field and everything reading it will not gain any information from it.

@ag-gaphp
Copy link
Author

Yes! Absolutely perfect. When I first started hunting this issue down, this exact process is something I thought might need to be done, I just didn't do enough reading to connect all the dots, I guess.

At least in my use case, adding these two lines remove the widgets entirely and pyHanko/eSign platforms no longer see the erroneous fields.

@JorjMcKie
Copy link
Collaborator

Thanks for the feedback!
I will add these instructions to the standard behavior ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants