extractText() extracts broken text from pdf #3186

spagliarini · 2024-02-20T11:38:16Z

Description of the bug

Hi,

I noticed a bug in PyMuPDF version > 1.23.9 (included) when using get_text to extract text from PDF documents.

To reproduce the bug

Consider the attached PDF file: test_file.pdf
Extract text using the code below (see "How to reproduce the bug")
To reproduce the correct behavior install a PyMuPDF version < 1.23.9 (e.g., 1.23.8). We obtain the following complete text: doc_text_1238.txt
To reproduce the bug behavior install a PyMuPDF version >= 1.23.9 (e.g., 1.23.24). We obtain the following broken text: doc_text_12324.txt

ADDITIONAL NOTES

The bug behavior can only be observed on certain documents (e.g., the one attached above)
extractBLOCKS, extractWORDS and extractDICT work fine, the bug seems to show only for extractTEXT
We tried both on windows and linux and neither works

Thank you for your help

How to reproduce the bug

Extract text using the following code

fitz_doc = fitz.open(pdf_path)

doc_text = list()
for page in fitz_doc:
    doc_text.append(page.get_text())

doc_text = ' '.join(doc_text)

To reproduce the bug behavior install a PyMuPDF version >= 1.23.9 (e.g., 1.23.24).

PyMuPDF version

1.23.24

Operating system

Windows

Python version

3.10

julian-smith-artifex-com · 2024-02-20T22:52:59Z

Fixed in 1.23.25.

julian-smith-artifex-com added a commit that referenced this issue Feb 20, 2024

Address #3186: don't terminate extracted text at chr(0) characters.

9983ce2

julian-smith-artifex-com mentioned this issue Feb 20, 2024

Address #3182: Pixmap.invert_irect argument type error. #3187

Merged

julian-smith-artifex-com added a commit that referenced this issue Feb 20, 2024

Address #3186: don't terminate extracted text at chr(0) characters.

25adc23

julian-smith-artifex-com added the Fixed in next release label Feb 20, 2024

julian-smith-artifex-com closed this as completed Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extractText() extracts broken text from pdf #3186

extractText() extracts broken text from pdf #3186

spagliarini commented Feb 20, 2024

julian-smith-artifex-com commented Feb 20, 2024

extractText() extracts broken text from pdf #3186

extractText() extracts broken text from pdf #3186

Comments

spagliarini commented Feb 20, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

julian-smith-artifex-com commented Feb 20, 2024