Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extractText() extracts broken text from pdf #3186

Closed
spagliarini opened this issue Feb 20, 2024 · 1 comment
Closed

extractText() extracts broken text from pdf #3186

spagliarini opened this issue Feb 20, 2024 · 1 comment

Comments

@spagliarini
Copy link

Description of the bug

Hi,

I noticed a bug in PyMuPDF version > 1.23.9 (included) when using get_text to extract text from PDF documents.

To reproduce the bug

  • Consider the attached PDF file: test_file.pdf

  • Extract text using the code below (see "How to reproduce the bug")

  • To reproduce the correct behavior install a PyMuPDF version < 1.23.9 (e.g., 1.23.8). We obtain the following complete text: doc_text_1238.txt

  • To reproduce the bug behavior install a PyMuPDF version >= 1.23.9 (e.g., 1.23.24). We obtain the following broken text: doc_text_12324.txt

ADDITIONAL NOTES

  • The bug behavior can only be observed on certain documents (e.g., the one attached above)
  • extractBLOCKS, extractWORDS and extractDICT work fine, the bug seems to show only for extractTEXT
  • We tried both on windows and linux and neither works

Thank you for your help

How to reproduce the bug

Extract text using the following code

fitz_doc = fitz.open(pdf_path)

doc_text = list()
for page in fitz_doc:
    doc_text.append(page.get_text())

doc_text = ' '.join(doc_text)

To reproduce the bug behavior install a PyMuPDF version >= 1.23.9 (e.g., 1.23.24).

PyMuPDF version

1.23.24

Operating system

Windows

Python version

3.10

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants