Description of the bug
I understand that it's expected for clean_contents() to no longer generate line breaks.
However, scrub calls clean_contents, and then passes the cont.splitlines() (which is a single line) to remove_hidden which still expects separate lines when checking for markers like b"3 Tr" etc. So, hidden text is not removed.
Replacing clean_contents with pretty_contents , suggested in issue 3419 is a possible solution.
The issue affects all versions from 1.24.0 up to and including 1.26.4 (current).
How to reproduce the bug
I believe the issue itself is visible in the the code for scrub and remove_hidden, but it can also be reproduced with any PDF containing hidden text, like TestOCR.pdf in issue 3533
import pymupdf
doc = pymupdf.open("TestOCR.pdf")
text_before_scrub = doc[0].get_text()
doc.scrub(hidden_text=True)
print(doc[0].get_text() == text_before_scrub)
print(doc[0].get_text())
In 1.23.26 : prints False and empty string.
In 1.26.4 : prints True and the full hidden text.
PyMuPDF version
1.26.4
Operating system
Windows
Python version
3.12