Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer size mismatch when calling get_text_range #298

Closed
4 tasks done
elonzh opened this issue Feb 22, 2024 · 13 comments · Fixed by #301
Closed
4 tasks done

Buffer size mismatch when calling get_text_range #298

elonzh opened this issue Feb 22, 2024 · 13 comments · Fixed by #301
Labels
bug Something isn't working pdfium This issue may be caused by (or related to) pdfium itself severe

Comments

@elonzh
Copy link

elonzh commented Feb 22, 2024

Issue Template

Checklist

  • I confirm this is not a question or feature request. Otherwise, use the Discussions page.
  • I confirm this is not an issue encountered with an installed build of pypdfium2, but about some other aspect of the project (specify below). Otherwise, use one of the package templates (PyPA/conda), even if you believe this is not a package-specific issue.
  • I confirm this is not about an unofficial build of pypdfium2. We do not support third-party builds, and they are not eligible for a bug report.
  • I have read and acknowledged the response policy.

Reason for Generic issue (keyword/topic)

text extraction

Description

> poetry show pypdfium2
Using python3 (3.10.8)
 name         : pypdfium2                 
 version      : 4.27.0                    

88d1ebc0-75ea-4fac-b398-202848e59d80.pdf

Script to reproduce issue

import pypdfium2

path = "88d1ebc0-75ea-4fac-b398-202848e59d80.pdf"

doc = pypdfium2.PdfDocument(path, autoclose=True)
try:
    for page_number, page in enumerate(doc):
        print(f"Reading Page {page_number + 1}")
        text_page = page.get_textpage()
        content = text_page.get_text_range()
        text_page.close()
        page.close()
finally:
    doc.close()

Output:

Reading Page 1
Reading Page 2
Reading Page 3
Reading Page 4
Reading Page 5
Reading Page 6
Reading Page 7
Reading Page 8
Reading Page 9
Traceback (most recent call last):
  File "/home/elonzh/workspace/ivysci/unob/poc.py", line 12, in <module>
    content = text_page.get_text_range()
  File "/home/elonzh/.cache/pypoetry/virtualenvs/unob-hX3-r8Vo-py3.10/lib/python3.10/site-packages/pypdfium2/_helpers/textpage.py", line 89, in get_text_range
    assert in_size == out_size, f"Buffer size mismatch: {in_size} vs {out_size}"
AssertionError: Buffer size mismatch: 3208 vs 3210
@mara004 mara004 added the bug Something isn't working label Feb 22, 2024
@mara004
Copy link
Member

mara004 commented Feb 22, 2024

Oh, that's not good. Thank you for the bug report; I will investigate.
The assert which caught here was added during refactoring in the course of this issue:
#261 (comment)
#261 (comment)

@mara004
Copy link
Member

mara004 commented Feb 22, 2024

I think this must be due to a pdfium change.
I reverted to a pdfium version at the time of #261 via ./run emplace auto:6056 (assuming an editable install).

With this, the following test code passes just fine:

from pypdfium2 import *
from pypdfium2.raw import *
from pathlib import Path

in_path = Path("~/Downloads/88d1ebc0-75ea-4fac-b398-202848e59d80.pdf").expanduser()
pdf = PdfDocument(in_path)
tp_a = pdf[8].get_textpage()
tp_b = pdf[9].get_textpage()

print( tp_a.get_text_range() )
print( tp_b.get_text_range() )

But it breaks with latest pdfium (./run emplace auto:6309), triggering the above assert.

So I guess we'll need to find out what has changed, why, and which commit(s) are responsible.

@mara004
Copy link
Member

mara004 commented Feb 22, 2024

I have consulted pdfium via this ticket: https://bugs.chromium.org/p/pdfium/issues/detail?id=2133

@mara004

This comment was marked as off-topic.

@elonzh

This comment was marked as off-topic.

@mara004

This comment was marked as off-topic.

@mara004 mara004 pinned this issue Feb 25, 2024
@mara004 mara004 added pdfium This issue may be caused by (or related to) pdfium itself severe labels Feb 25, 2024
@mara004 mara004 linked a pull request Feb 26, 2024 that will close this issue
@mara004
Copy link
Member

mara004 commented Feb 26, 2024

In the meantime, please use get_text_bounded(), unless you specifically need the ability to define a character range.

I've also written a patch that would work around the get_text_range() issue by doubling the allocation, but I'd prefer not to have this. Let's first wait some time for upstream.

@mara004
Copy link
Member

mara004 commented Mar 1, 2024

Upstream is tough and bureaucratic as usual.
They've classified the issue as WontFix and suggest to generally allocate 4 bytes (which I'd consider bad practice). Otherwise they seem to request a new bug report and new evidence why the API should tell the exact buffer size. Duh...

@elonzh
Copy link
Author

elonzh commented Mar 1, 2024

Seems we have no choice.

@mara004
Copy link
Member

mara004 commented Mar 1, 2024

Yeah... I suppose I should merge #301, and if pdfium improves, we can still update the code.
In the meantime, we can add a warning to prefer get_text_bounded() where possible, and maybe even make get_text_range() translate to get_text_bounded() if called without arguments.

@mara004
Copy link
Member

mara004 commented Mar 1, 2024

Workaround implemented as per previous comment.
GH auto-closed the issue on merge – or would you say we should keep it open?

@mara004 mara004 unpinned this issue Mar 1, 2024
mara004 added a commit that referenced this issue Mar 1, 2024
mara004 added a commit that referenced this issue Mar 1, 2024
@elonzh
Copy link
Author

elonzh commented Mar 2, 2024

The current implementation is already very good. Thank you for your hard work.

@mara004
Copy link
Member

mara004 commented Apr 12, 2024

It seems like pdfium team actually got back to the issue and reverted FPDFText_GetText() to 2 byte characters only. (Seeing as it was not only that we couldn't tell the exact buffer size anymore, but also that API stability expectations were broken.)

This is speculative, but I suspect they might have received a separate security bug that led to revisiting of the previously raised concerns, as the commit-linked issue is non-public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pdfium This issue may be caused by (or related to) pdfium itself severe
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants