Skip to content

PyMuPDF-1.22.2: extractText show unreadible output from specific pdf #2377

@zdenop

Description

@zdenop

Describe the bug (mandatory)

I am not able to get a text from this specific pdf. Pdf is correctly displayed by Adobe Reader (also in Chrome, Firefox)
AD-3.HANO.pdf

To Reproduce (mandatory)

import fitz

fname = "AD-3.HANO.pdf"

with fitz.open(fname) as doc:
    for page_id in range(doc.page_count):
        page = doc.load_page(page_id)
        content = page.get_textpage()
        print(f"Page {page_id1}:")
        print(content.extractText())

Expected behavior (optional)

Get text out (as with other pdf files)

Screenshots (optional)

image

Your configuration (mandatory)

  • Windows 10 64bit
  • Python 3.9.13 64bit, Python 3.11.2
  • pymupdf-1.22.2 , installation method: (wheel).

Metadata

Metadata

Assignees

No one assigned

    Labels

    not a bugnot a bug / user error / unable to reproduce

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions