-
Notifications
You must be signed in to change notification settings - Fork 639
Description
Hello,
Thanks for this awesome work. I'm using PyMuPDF to provide blind people with accessible eBook reading experience with Bookworm.
Describe the bug (mandatory)
PyMuPDf v1.16.0 and up introduced a weired bug with certain PDF documents. When using text extraction functions., excessive spaces are inserted between characters. The extracted text is just a blob of characters. I have to admit that this problem is rare, it happens with some kind of PDF documents. In v1.14X versions the text is extracted correctly.
To Reproduce (mandatory)
Here is the text extraction function straight from the FitzDocument backend:
def _text_from_page(page):
bloks = page.getTextBlocks()
text = [blk[4].replace("\n", " ") for blk in bloks]
return "\r\n".join(text)
Expected behavior (optional)
The text is extracted preserving its basic structure (words and paragraphs).
Your configuration (mandatory)
- Operating system: Windows 10 Pro, 64-bit
- Python: Python 3.7.4 64-bit/32-bit
- PyMuPDF version: v1.16.0, installed from PyPI
Additional context (optional)
The problem still happens with the latest version of PyMuPDF (v1.16.2). Playing with text extraction flags didn't help, it gives several variations none of which solves the issue.
This problem happens with the Dart specs document, and several others. As an example, the Dart language specs is attached.
DartLangSpec-v2.2.pdf