New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent textpage.get_text_range results #261
Comments
I'm afraid I don't know any more than you do here. Well, this currently is the code for def get_text_range(self, index=0, count=-1, errors="ignore"):
if count == -1:
count = self.count_chars() - index
n_bytes = count * 2
buffer = ctypes.create_string_buffer(n_bytes+2)
buffer_ptr = ctypes.cast(buffer, ctypes.POINTER(ctypes.c_ushort))
pdfium_c.FPDFText_GetText(self, index, count, buffer_ptr)
return buffer.raw[:n_bytes].decode("utf-16-le", errors=errors) What catches the eye is that we don't handle the return of |
I pushed commit a9c6485 which (I think) should fix the trailing null char part of the issue in accordance with the hypothesis above. |
Continuing on this, I made the following observation: >>> import pypdfium2.raw as pdfium_c
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-2)
3398
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-1)
-1
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400)
3399 It looks like you've found the sample for https://groups.google.com/g/pdfium/c/LNkXslbSRPY 😅 |
Putting the clues together, I think we need to use Thank you for reporting this, that was very helpful! |
However, unfortunately that commit was not correct. I'll continue to investigate: >>> textpage.get_text_range(3400-2, 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../pypdfium2/_helpers/textpage.py", line 65, in get_text_range
assert in_size == out_size
^^^^^^^^^^^^^^^^^^^
AssertionError |
Thank you for looking into this! I've created a ticket with pdfium, in case they might have more information. (I'm hoping they'll still classify it as a bug, as constantly converting from one indexing scheme to another is bound to be annoying) |
I think I've finally got Now I get this, which should be correct: >>> textpage.get_text_range(3398, 1)
'z'
>>> textpage.get_text_range(3399, 1)
''
>>> textpage.get_text_range(3400, 1)
'\r'
>>> textpage.get_text_range(3398, 2)
'z'
>>> textpage.get_text_range(3399, 2)
'\r'
>>> textpage.get_text_range(3398, 3)
'z\r' Char |
Hmm, I suppose it's just two different representations / API layers. One is the internal character list as stored in the PDF, the other the "polished" output by That said, even if we accept this design as intentional, here some concerns with the present situation:
|
Don't loose l_passive in the second recursion clause. Also pass r_passive in the first recursion clause for consistency, though it should be a no-op.
Don't loose l_passive in the second recursion clause. Also pass r_passive in the first recursion clause for consistency, though it should be a no-op.
Don't lose l_passive in the second recursion clause. Also pass r_passive in the first recursion clause for consistency, though it should be a no-op.
I believe this is fixed for what the pypdfium2 side is concerned. |
Essentially
get_text_range(index, 1) != get_text_range(index - 1, 2)[1]
. It's quite rare, but because the shift propagates throughout the rest of the page it's quite annoying. It means you cannot use api liketextpage.get_charbox
if you usedget_text_range(0, -1)
before, because indexes will be out of sync for the rest of the page.I suspect this is a PDFium issue, but I don't have a C++ setup to test it directly, and it may be some crazy encoding thing, so though I'd report here first.
Full code, using the attached document.
Barclays-PLC-Annual-Report-2022.pdf
The text was updated successfully, but these errors were encountered: