Skip to content

scrub fails to remove hidden text after clean_contents stopped including line breaks (≥ 1.24.0) #4670

@nge06

Description

@nge06

Description of the bug

I understand that it's expected for clean_contents() to no longer generate line breaks.
However, scrub calls clean_contents, and then passes the cont.splitlines() (which is a single line) to remove_hidden which still expects separate lines when checking for markers like b"3 Tr" etc. So, hidden text is not removed.

Replacing clean_contents with pretty_contents , suggested in issue 3419 is a possible solution.

The issue affects all versions from 1.24.0 up to and including 1.26.4 (current).

How to reproduce the bug

I believe the issue itself is visible in the the code for scrub and remove_hidden, but it can also be reproduced with any PDF containing hidden text, like TestOCR.pdf in issue 3533

import pymupdf
doc = pymupdf.open("TestOCR.pdf")
text_before_scrub = doc[0].get_text()
doc.scrub(hidden_text=True)
print(doc[0].get_text() == text_before_scrub)
print(doc[0].get_text())

In 1.23.26 : prints False and empty string.
In 1.26.4 : prints True and the full hidden text.

PyMuPDF version

1.26.4

Operating system

Windows

Python version

3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions