Skip to content

Excessive spacing when extracting text from PDF #364

@mush42

Description

@mush42

Hello,

Thanks for this awesome work. I'm using PyMuPDF to provide blind people with accessible eBook reading experience with Bookworm.

Describe the bug (mandatory)

PyMuPDf v1.16.0 and up introduced a weired bug with certain PDF documents. When using text extraction functions., excessive spaces are inserted between characters. The extracted text is just a blob of characters. I have to admit that this problem is rare, it happens with some kind of PDF documents. In v1.14X versions the text is extracted correctly.

To Reproduce (mandatory)

Here is the text extraction function straight from the FitzDocument backend:

def _text_from_page(page):
    bloks = page.getTextBlocks()
    text = [blk[4].replace("\n", " ") for blk in bloks]
    return "\r\n".join(text)

Expected behavior (optional)

The text is extracted preserving its basic structure (words and paragraphs).

Your configuration (mandatory)

  • Operating system: Windows 10 Pro, 64-bit
  • Python: Python 3.7.4 64-bit/32-bit
  • PyMuPDF version: v1.16.0, installed from PyPI

Additional context (optional)

The problem still happens with the latest version of PyMuPDF (v1.16.2). Playing with text extraction flags didn't help, it gives several variations none of which solves the issue.

This problem happens with the Dart specs document, and several others. As an example, the Dart language specs is attached.
DartLangSpec-v2.2.pdf

Metadata

Metadata

Assignees

Labels

postponepostpone to a future versionupstream bugbug outside this package

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions