Inconsistent textpage.get_text_range results #261

nikitar · 2023-09-21T04:00:15Z

Essentially get_text_range(index, 1) != get_text_range(index - 1, 2)[1]. It's quite rare, but because the shift propagates throughout the rest of the page it's quite annoying. It means you cannot use api like textpage.get_charbox if you used get_text_range(0, -1) before, because indexes will be out of sync for the rest of the page.

I suspect this is a PDFium issue, but I don't have a C++ setup to test it directly, and it may be some crazy encoding thing, so though I'd report here first.

Full code, using the attached document.

PDF_PATH = .....

import pypdfium2
from contextlib import closing

with closing(pypdfium2.PdfDocument(PDF_PATH)) as pdf:
    # page index is 159 (0-based), but it's named 158 in pdf viewers because there are two named pages in the beginning
    page = pdf[159]
    textpage = page.get_textpage()

    char_index = 3400
    char1 = textpage.get_text_range(char_index, 1, 'strict')
    char2 = textpage.get_text_range(char_index - 1, 2, 'strict')[1]
    assert char1 == char2

Barclays-PLC-Annual-Report-2022.pdf

The text was updated successfully, but these errors were encountered:

nikitar · 2023-09-21T04:04:48Z

Additionally, if you retrieve whole page text with textpage.get_text_range(0, -1, 'strict'), it will have a null char at the end. That does not happen for other pages, so I assume it's the same issue.

mara004 · 2023-09-21T11:22:52Z

I'm afraid I don't know any more than you do here.

Well, this currently is the code for get_text_range():

def get_text_range(self, index=0, count=-1, errors="ignore"):
    
    if count == -1:
        count = self.count_chars() - index
    
    n_bytes = count * 2
    buffer = ctypes.create_string_buffer(n_bytes+2)
    buffer_ptr = ctypes.cast(buffer, ctypes.POINTER(ctypes.c_ushort))
    
    pdfium_c.FPDFText_GetText(self, index, count, buffer_ptr)
    return buffer.raw[:n_bytes].decode("utf-16-le", errors=errors)

What catches the eye is that we don't handle the return of FPDFText_GetText(), but rely entirely on the count * 2.
(Unlike many other APIs, FPDFText_GetText() can't be called with a NULL buffer to get the size first.)
However, I guess we should use the exact return from FPDFText_GetText() when slicing the buffer (and probably add some check, too - CC #208). The trailing null char looks a bit odd, like the actual number of bytes written to the buffer might be 2 less than calculated.

mara004 · 2023-09-21T12:55:22Z

However, I guess we should use the exact return from FPDFText_GetText() when slicing the buffer

I pushed commit a9c6485 which (I think) should fix the trailing null char part of the issue in accordance with the hypothesis above.
(Sorry for the accidental close, GH doesn't understand "Partially fix" correctly.)

mara004 · 2023-09-21T13:10:16Z

Continuing on this, I made the following observation:

>>> import pypdfium2.raw as pdfium_c
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-2)
3398
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-1)
-1
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400)
3399

It looks like you've found the sample for https://groups.google.com/g/pdfium/c/LNkXslbSRPY 😅

mara004 · 2023-09-21T14:07:15Z

Putting the clues together, I think we need to use FPDFText_GetTextIndexFromCharIndex() to get the exact required buffer size to fix this correctly, not just working around a too large buffer. I will push a commit shortly.

Thank you for reporting this, that was very helpful!

@nikitar

Thanks to @nikitar (Nikita Rybak)

mara004 · 2023-09-21T14:55:56Z

However, unfortunately that commit was not correct. I'll continue to investigate:

>>> textpage.get_text_range(3400-2, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../pypdfium2/_helpers/textpage.py", line 65, in get_text_range
    assert in_size == out_size
           ^^^^^^^^^^^^^^^^^^^
AssertionError

nikitar · 2023-09-21T15:03:14Z

Thank you for looking into this! I've created a ticket with pdfium, in case they might have more information. (I'm hoping they'll still classify it as a bug, as constantly converting from one indexing scheme to another is bound to be annoying)

Finally try to fix it properly.

mara004 · 2023-09-21T17:39:21Z

I think I've finally got get_text_range() correct with commit 535398d, but this was fairly complicated.
(Commit 71bdc71 was wrong and is now reverted.)

Now I get this, which should be correct:

>>> textpage.get_text_range(3398, 1)
'z'
>>> textpage.get_text_range(3399, 1)
''
>>> textpage.get_text_range(3400, 1)
'\r'
>>> textpage.get_text_range(3398, 2)
'z'
>>> textpage.get_text_range(3399, 2)
'\r'
>>> textpage.get_text_range(3398, 3)
'z\r'

Char 3399 is "passive", i.e. it exists in the internal character list but is excluded from the FPDFText_GetText() output.
In this light, you simply cannot expect calls like get_text_range(index-1, 2)[1] to work in the general case, e.g. get_text_range(3399, 2)[1] on the given PDF would now raise an index error, which is correct.
That's just how the API is, FPDFText_GetText() may exclude or insert characters, so the returned text's length does not have to match the given count, even if it may for most PDFs.

mara004 · 2023-09-21T19:57:04Z

I'm hoping they'll still classify it as a bug, as constantly converting from one indexing scheme to another is bound to be annoying

Hmm, I suppose it's just two different representations / API layers. One is the internal character list as stored in the PDF, the other the "polished" output by FPDFText_GetText(). It may be confusing for a caller, but I don't think it's technically wrong. Though, I agree it might be better to avoid the distinction in the first place if it could be helped.

That said, even if we accept this design as intentional, here some concerns with the present situation:

Incomplete and non-obvious documentation (even if one can kind of see what is meant in retrospect)
FPDFText_GetText() behaving inconsistently when it comes to passive chars:
- Trailing passive chars are counted, meaning the function stops with the last active char in range(index, index+count) (OK).
- Leading passive chars, however, are not counted, meaning pdfium just increments the start index to the first active char and consumes count chars from that point onward, i.e. it may read beyond the scope range(index, index+count) (ODD).
pypdfium2 now accounts for this by adjusting index and count to exclude leading/trailing passive chars in the first place, so we are independent of how FPDFText_GetText() handles them.
That was a bit troublesome to find out.
FPDFText_GetText() not allowing to be called with a NULL buffer first (unlike other text APIs), which means the caller has to install some fairly complicated logic to get the allocation right. A naive caller just constructing the buffer from char count without FPDFText_GetTextIndexFromCharIndex() might even threaten to run into OOB write in case of newly inserted characters.

Don't loose l_passive in the second recursion clause. Also pass r_passive in the first recursion clause for consistency, though it should be a no-op.

Don't lose l_passive in the second recursion clause. Also pass r_passive in the first recursion clause for consistency, though it should be a no-op.

mara004 · 2023-09-23T12:50:49Z

I believe this is fixed for what the pypdfium2 side is concerned.
Any further considerations are up to pdfium (https://bugs.chromium.org/p/pdfium/issues/detail?id=2079), so I'd close this issue, OK? Feel free to reopen if you think there's something left to do in pypdfium2.

mara004 closed this as completed in a9c6485 Sep 21, 2023

mara004 added a commit that referenced this issue Sep 21, 2023

Partially fix #261

1fd8fef

mara004 reopened this Sep 21, 2023

mara004 added a commit that referenced this issue Sep 21, 2023

Further correct get_text_range() (CC #261)

71bdc71

Thanks to @nikitar (Nikita Rybak)

mara004 added a commit that referenced this issue Sep 21, 2023

Continue on #261

535398d

Finally try to fix it properly.

mara004 added a commit that referenced this issue Sep 22, 2023

Update docs for get_text_range() (CC #261)

d50d157

mara004 added a commit that referenced this issue Sep 22, 2023

Update docs for get_text_range() (CC #261)

e32bdd3

mara004 added a commit that referenced this issue Sep 22, 2023

Update docs for get_text_range() (CC #261)

22926fb

mara004 added a commit that referenced this issue Sep 22, 2023

Update docs for get_text_range() (CC #261)

73601aa

mara004 added a commit that referenced this issue Sep 22, 2023

Update changelog (CC #261)

f238883

mara004 added a commit that referenced this issue Sep 22, 2023

Update docs for get_text_range() (CC #261)

fdf95fc

mara004 added a commit that referenced this issue Sep 22, 2023

Update changelog (CC #261)

3d23c56

mara004 added a commit that referenced this issue Sep 22, 2023

[Hotfix] get_text_range: don't loose l_passive (CC #261)

ccaef58

Don't loose l_passive in the second recursion clause. Also pass r_passive in the first recursion clause for consistency, though it should be a no-op.

mara004 added a commit that referenced this issue Sep 22, 2023

[Hotfix] get_text_range: don't loose l_passive (CC #261)

bebd682

Don't loose l_passive in the second recursion clause. Also pass r_passive in the first recursion clause for consistency, though it should be a no-op.

mara004 added a commit that referenced this issue Sep 22, 2023

[Hotfix] get_text_range: don't lose l_passive (CC #261)

456899c

Don't lose l_passive in the second recursion clause. Also pass r_passive in the first recursion clause for consistency, though it should be a no-op.

mara004 closed this as completed Sep 23, 2023

mara004 mentioned this issue Feb 22, 2024

Buffer size mismatch when calling get_text_range #298

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent textpage.get_text_range results #261

Inconsistent textpage.get_text_range results #261

nikitar commented Sep 21, 2023 •

edited by mara004

nikitar commented Sep 21, 2023

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 21, 2023

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 21, 2023 •

edited

nikitar commented Sep 21, 2023

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 23, 2023

Inconsistent textpage.get_text_range results #261

Inconsistent textpage.get_text_range results #261

Comments

nikitar commented Sep 21, 2023 • edited by mara004

nikitar commented Sep 21, 2023

mara004 commented Sep 21, 2023 • edited

mara004 commented Sep 21, 2023

mara004 commented Sep 21, 2023 • edited

mara004 commented Sep 21, 2023 • edited

mara004 commented Sep 21, 2023 • edited

nikitar commented Sep 21, 2023

mara004 commented Sep 21, 2023 • edited

mara004 commented Sep 21, 2023 • edited

mara004 commented Sep 23, 2023

nikitar commented Sep 21, 2023 •

edited by mara004

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 21, 2023 •

edited

mara004 commented Sep 21, 2023 •

edited