Buffer size mismatch when calling `get_text_range` #298

elonzh · 2024-02-22T07:37:10Z

Issue Template

Checklist

I confirm this is not a question or feature request. Otherwise, use the Discussions page.
I confirm this is not an issue encountered with an installed build of pypdfium2, but about some other aspect of the project (specify below). Otherwise, use one of the package templates (PyPA/conda), even if you believe this is not a package-specific issue.
I confirm this is not about an unofficial build of pypdfium2. We do not support third-party builds, and they are not eligible for a bug report.
I have read and acknowledged the response policy.

Reason for Generic issue (keyword/topic)

text extraction

Description

> poetry show pypdfium2
Using python3 (3.10.8)
 name         : pypdfium2                 
 version      : 4.27.0

88d1ebc0-75ea-4fac-b398-202848e59d80.pdf

Script to reproduce issue

import pypdfium2

path = "88d1ebc0-75ea-4fac-b398-202848e59d80.pdf"

doc = pypdfium2.PdfDocument(path, autoclose=True)
try:
    for page_number, page in enumerate(doc):
        print(f"Reading Page {page_number + 1}")
        text_page = page.get_textpage()
        content = text_page.get_text_range()
        text_page.close()
        page.close()
finally:
    doc.close()

Output:

Reading Page 1
Reading Page 2
Reading Page 3
Reading Page 4
Reading Page 5
Reading Page 6
Reading Page 7
Reading Page 8
Reading Page 9
Traceback (most recent call last):
  File "/home/elonzh/workspace/ivysci/unob/poc.py", line 12, in <module>
    content = text_page.get_text_range()
  File "/home/elonzh/.cache/pypoetry/virtualenvs/unob-hX3-r8Vo-py3.10/lib/python3.10/site-packages/pypdfium2/_helpers/textpage.py", line 89, in get_text_range
    assert in_size == out_size, f"Buffer size mismatch: {in_size} vs {out_size}"
AssertionError: Buffer size mismatch: 3208 vs 3210

mara004 · 2024-02-22T12:15:34Z

Oh, that's not good. Thank you for the bug report; I will investigate.
The assert which caught here was added during refactoring in the course of this issue:
#261 (comment)
#261 (comment)

mara004 · 2024-02-22T12:37:58Z

I think this must be due to a pdfium change.
I reverted to a pdfium version at the time of #261 via ./run emplace auto:6056 (assuming an editable install).

With this, the following test code passes just fine:

from pypdfium2 import *
from pypdfium2.raw import *
from pathlib import Path

in_path = Path("~/Downloads/88d1ebc0-75ea-4fac-b398-202848e59d80.pdf").expanduser()
pdf = PdfDocument(in_path)
tp_a = pdf[8].get_textpage()
tp_b = pdf[9].get_textpage()

print( tp_a.get_text_range() )
print( tp_b.get_text_range() )

But it breaks with latest pdfium (./run emplace auto:6309), triggering the above assert.

So I guess we'll need to find out what has changed, why, and which commit(s) are responsible.

mara004 · 2024-02-22T12:52:22Z

I have consulted pdfium via this ticket: https://bugs.chromium.org/p/pdfium/issues/detail?id=2133

mara004 · 2024-02-26T14:42:53Z

In the meantime, please use get_text_bounded(), unless you specifically need the ability to define a character range.

I've also written a patch that would work around the get_text_range() issue by doubling the allocation, but I'd prefer not to have this. Let's first wait some time for upstream.

mara004 · 2024-03-01T15:48:54Z

Upstream is tough and bureaucratic as usual.
They've classified the issue as WontFix and suggest to generally allocate 4 bytes (which I'd consider bad practice). Otherwise they seem to request a new bug report and new evidence why the API should tell the exact buffer size. Duh...

elonzh · 2024-03-01T16:23:29Z

Seems we have no choice.

mara004 · 2024-03-01T16:29:36Z

Yeah... I suppose I should merge #301, and if pdfium improves, we can still update the code.
In the meantime, we can add a warning to prefer get_text_bounded() where possible, and maybe even make get_text_range() translate to get_text_bounded() if called without arguments.

mara004 · 2024-03-01T22:39:41Z

Workaround implemented as per previous comment.
GH auto-closed the issue on merge – or would you say we should keep it open?

elonzh · 2024-03-02T03:34:33Z

The current implementation is already very good. Thank you for your hard work.

mara004 · 2024-04-12T14:25:41Z

It seems like pdfium team actually got back to the issue and reverted FPDFText_GetText() to 2 byte characters only. (Seeing as it was not only that we couldn't tell the exact buffer size anymore, but also that API stability expectations were broken.)

This is speculative, but I suspect they might have received a separate security bug that led to revisiting of the previously raised concerns, as the commit-linked issue is non-public.

mara004 added the bug Something isn't working label Feb 22, 2024

This comment was marked as off-topic.

Sign in to view

mara004 pinned this issue Feb 25, 2024

mara004 added pdfium This issue may be caused by (or related to) pdfium itself severe labels Feb 25, 2024

mara004 added a commit that referenced this issue Feb 26, 2024

Possible workaround against #298 / pdfium bug 2133

b0890a7

mara004 linked a pull request Feb 26, 2024 that will close this issue

Possible workaround against #298 / pdfium bug 2133 #301

Merged

mara004 closed this as completed in #301 Mar 1, 2024

mara004 added a commit that referenced this issue Mar 1, 2024

Possible workaround against #298 / pdfium bug 2133 (#301)

47a2b81

mara004 unpinned this issue Mar 1, 2024

mara004 added a commit that referenced this issue Mar 1, 2024

Changelog for #298

eafec33

mara004 added a commit that referenced this issue Mar 1, 2024

Changelog for #298

380f30f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer size mismatch when calling `get_text_range` #298

Buffer size mismatch when calling `get_text_range` #298

elonzh commented Feb 22, 2024 •

edited by mara004

Checklist

mara004 commented Feb 22, 2024 •

edited

mara004 commented Feb 22, 2024 •

edited

mara004 commented Feb 22, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

mara004 commented Feb 26, 2024

mara004 commented Mar 1, 2024 •

edited

elonzh commented Mar 1, 2024

mara004 commented Mar 1, 2024 •

edited

mara004 commented Mar 1, 2024

elonzh commented Mar 2, 2024

mara004 commented Apr 12, 2024 •

edited

Buffer size mismatch when calling get_text_range #298

Buffer size mismatch when calling get_text_range #298

Comments

elonzh commented Feb 22, 2024 • edited by mara004

Checklist

Reason for Generic issue (keyword/topic)

Description

Script to reproduce issue

Output:

mara004 commented Feb 22, 2024 • edited

mara004 commented Feb 22, 2024 • edited

mara004 commented Feb 22, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

mara004 commented Feb 26, 2024

mara004 commented Mar 1, 2024 • edited

elonzh commented Mar 1, 2024

mara004 commented Mar 1, 2024 • edited

mara004 commented Mar 1, 2024

elonzh commented Mar 2, 2024

mara004 commented Apr 12, 2024 • edited

Buffer size mismatch when calling `get_text_range` #298

Buffer size mismatch when calling `get_text_range` #298

elonzh commented Feb 22, 2024 •

edited by mara004

mara004 commented Feb 22, 2024 •

edited

mara004 commented Feb 22, 2024 •

edited

mara004 commented Mar 1, 2024 •

edited

mara004 commented Mar 1, 2024 •

edited

mara004 commented Apr 12, 2024 •

edited