New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer size mismatch when calling get_text_range
#298
Comments
Oh, that's not good. Thank you for the bug report; I will investigate. |
I think this must be due to a pdfium change. With this, the following test code passes just fine: from pypdfium2 import *
from pypdfium2.raw import *
from pathlib import Path
in_path = Path("~/Downloads/88d1ebc0-75ea-4fac-b398-202848e59d80.pdf").expanduser()
pdf = PdfDocument(in_path)
tp_a = pdf[8].get_textpage()
tp_b = pdf[9].get_textpage()
print( tp_a.get_text_range() )
print( tp_b.get_text_range() ) But it breaks with latest pdfium ( So I guess we'll need to find out what has changed, why, and which commit(s) are responsible. |
I have consulted pdfium via this ticket: https://bugs.chromium.org/p/pdfium/issues/detail?id=2133 |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
In the meantime, please use I've also written a patch that would work around the |
Upstream is tough and bureaucratic as usual. |
Seems we have no choice. |
Yeah... I suppose I should merge #301, and if pdfium improves, we can still update the code. |
Workaround implemented as per previous comment. |
The current implementation is already very good. Thank you for your hard work. |
It seems like pdfium team actually got back to the issue and reverted FPDFText_GetText() to 2 byte characters only. (Seeing as it was not only that we couldn't tell the exact buffer size anymore, but also that API stability expectations were broken.) This is speculative, but I suspect they might have received a separate security bug that led to revisiting of the previously raised concerns, as the commit-linked issue is non-public. |
Issue Template
Checklist
Reason for Generic issue (keyword/topic)
text extraction
Description
88d1ebc0-75ea-4fac-b398-202848e59d80.pdf
Script to reproduce issue
Output:
The text was updated successfully, but these errors were encountered: