Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent textpage.get_text_range results #261

Closed
nikitar opened this issue Sep 21, 2023 · 10 comments
Closed

Inconsistent textpage.get_text_range results #261

nikitar opened this issue Sep 21, 2023 · 10 comments

Comments

@nikitar
Copy link

nikitar commented Sep 21, 2023

Essentially get_text_range(index, 1) != get_text_range(index - 1, 2)[1]. It's quite rare, but because the shift propagates throughout the rest of the page it's quite annoying. It means you cannot use api like textpage.get_charbox if you used get_text_range(0, -1) before, because indexes will be out of sync for the rest of the page.

I suspect this is a PDFium issue, but I don't have a C++ setup to test it directly, and it may be some crazy encoding thing, so though I'd report here first.

Full code, using the attached document.

PDF_PATH = .....

import pypdfium2
from contextlib import closing

with closing(pypdfium2.PdfDocument(PDF_PATH)) as pdf:
    # page index is 159 (0-based), but it's named 158 in pdf viewers because there are two named pages in the beginning
    page = pdf[159]
    textpage = page.get_textpage()

    char_index = 3400
    char1 = textpage.get_text_range(char_index, 1, 'strict')
    char2 = textpage.get_text_range(char_index - 1, 2, 'strict')[1]
    assert char1 == char2

Barclays-PLC-Annual-Report-2022.pdf

@nikitar
Copy link
Author

nikitar commented Sep 21, 2023

Additionally, if you retrieve whole page text with textpage.get_text_range(0, -1, 'strict'), it will have a null char at the end. That does not happen for other pages, so I assume it's the same issue.

Screenshot 2023-09-21 at 2 03 21 pm

@mara004
Copy link
Member

mara004 commented Sep 21, 2023

I'm afraid I don't know any more than you do here.

Well, this currently is the code for get_text_range():

def get_text_range(self, index=0, count=-1, errors="ignore"):
    
    if count == -1:
        count = self.count_chars() - index
    
    n_bytes = count * 2
    buffer = ctypes.create_string_buffer(n_bytes+2)
    buffer_ptr = ctypes.cast(buffer, ctypes.POINTER(ctypes.c_ushort))
    
    pdfium_c.FPDFText_GetText(self, index, count, buffer_ptr)
    return buffer.raw[:n_bytes].decode("utf-16-le", errors=errors)

What catches the eye is that we don't handle the return of FPDFText_GetText(), but rely entirely on the count * 2.
(Unlike many other APIs, FPDFText_GetText() can't be called with a NULL buffer to get the size first.)
However, I guess we should use the exact return from FPDFText_GetText() when slicing the buffer (and probably add some check, too - CC #208). The trailing null char looks a bit odd, like the actual number of bytes written to the buffer might be 2 less than calculated.

mara004 added a commit that referenced this issue Sep 21, 2023
@mara004 mara004 reopened this Sep 21, 2023
@mara004
Copy link
Member

mara004 commented Sep 21, 2023

However, I guess we should use the exact return from FPDFText_GetText() when slicing the buffer

I pushed commit a9c6485 which (I think) should fix the trailing null char part of the issue in accordance with the hypothesis above.
(Sorry for the accidental close, GH doesn't understand "Partially fix" correctly.)

@mara004
Copy link
Member

mara004 commented Sep 21, 2023

Continuing on this, I made the following observation:

>>> import pypdfium2.raw as pdfium_c
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-2)
3398
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-1)
-1
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400)
3399

It looks like you've found the sample for https://groups.google.com/g/pdfium/c/LNkXslbSRPY 😅

@mara004
Copy link
Member

mara004 commented Sep 21, 2023

Putting the clues together, I think we need to use FPDFText_GetTextIndexFromCharIndex() to get the exact required buffer size to fix this correctly, not just working around a too large buffer. I will push a commit shortly.

Thank you for reporting this, that was very helpful!

mara004 added a commit that referenced this issue Sep 21, 2023
@mara004
Copy link
Member

mara004 commented Sep 21, 2023

However, unfortunately that commit was not correct. I'll continue to investigate:

>>> textpage.get_text_range(3400-2, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../pypdfium2/_helpers/textpage.py", line 65, in get_text_range
    assert in_size == out_size
           ^^^^^^^^^^^^^^^^^^^
AssertionError

@nikitar
Copy link
Author

nikitar commented Sep 21, 2023

Thank you for looking into this! I've created a ticket with pdfium, in case they might have more information. (I'm hoping they'll still classify it as a bug, as constantly converting from one indexing scheme to another is bound to be annoying)

mara004 added a commit that referenced this issue Sep 21, 2023
Finally try to fix it properly.
@mara004
Copy link
Member

mara004 commented Sep 21, 2023

I think I've finally got get_text_range() correct with commit 535398d, but this was fairly complicated.
(Commit 71bdc71 was wrong and is now reverted.)

Now I get this, which should be correct:

>>> textpage.get_text_range(3398, 1)
'z'
>>> textpage.get_text_range(3399, 1)
''
>>> textpage.get_text_range(3400, 1)
'\r'
>>> textpage.get_text_range(3398, 2)
'z'
>>> textpage.get_text_range(3399, 2)
'\r'
>>> textpage.get_text_range(3398, 3)
'z\r'

Char 3399 is "passive", i.e. it exists in the internal character list but is excluded from the FPDFText_GetText() output.
In this light, you simply cannot expect calls like get_text_range(index-1, 2)[1] to work in the general case, e.g. get_text_range(3399, 2)[1] on the given PDF would now raise an index error, which is correct.
That's just how the API is, FPDFText_GetText() may exclude or insert characters, so the returned text's length does not have to match the given count, even if it may for most PDFs.

@mara004
Copy link
Member

mara004 commented Sep 21, 2023

I'm hoping they'll still classify it as a bug, as constantly converting from one indexing scheme to another is bound to be annoying

Hmm, I suppose it's just two different representations / API layers. One is the internal character list as stored in the PDF, the other the "polished" output by FPDFText_GetText(). It may be confusing for a caller, but I don't think it's technically wrong. Though, I agree it might be better to avoid the distinction in the first place if it could be helped.

That said, even if we accept this design as intentional, here some concerns with the present situation:

  • Incomplete and non-obvious documentation (even if one can kind of see what is meant in retrospect)

  • FPDFText_GetText() behaving inconsistently when it comes to passive chars:

    • Trailing passive chars are counted, meaning the function stops with the last active char in range(index, index+count) (OK).
    • Leading passive chars, however, are not counted, meaning pdfium just increments the start index to the first active char and consumes count chars from that point onward, i.e. it may read beyond the scope range(index, index+count) (ODD).

    pypdfium2 now accounts for this by adjusting index and count to exclude leading/trailing passive chars in the first place, so we are independent of how FPDFText_GetText() handles them.
    That was a bit troublesome to find out.

  • FPDFText_GetText() not allowing to be called with a NULL buffer first (unlike other text APIs), which means the caller has to install some fairly complicated logic to get the allocation right. A naive caller just constructing the buffer from char count without FPDFText_GetTextIndexFromCharIndex() might even threaten to run into OOB write in case of newly inserted characters.

mara004 added a commit that referenced this issue Sep 22, 2023
mara004 added a commit that referenced this issue Sep 22, 2023
mara004 added a commit that referenced this issue Sep 22, 2023
mara004 added a commit that referenced this issue Sep 22, 2023
mara004 added a commit that referenced this issue Sep 22, 2023
mara004 added a commit that referenced this issue Sep 22, 2023
mara004 added a commit that referenced this issue Sep 22, 2023
mara004 added a commit that referenced this issue Sep 22, 2023
Don't loose l_passive in the second recursion clause.
Also pass r_passive in the first recursion clause for consistency,
though it should be a no-op.
mara004 added a commit that referenced this issue Sep 22, 2023
Don't loose l_passive in the second recursion clause.
Also pass r_passive in the first recursion clause for consistency,
though it should be a no-op.
mara004 added a commit that referenced this issue Sep 22, 2023
Don't lose l_passive in the second recursion clause.
Also pass r_passive in the first recursion clause for consistency,
though it should be a no-op.
@mara004
Copy link
Member

mara004 commented Sep 23, 2023

I believe this is fixed for what the pypdfium2 side is concerned.
Any further considerations are up to pdfium (https://bugs.chromium.org/p/pdfium/issues/detail?id=2079), so I'd close this issue, OK? Feel free to reopen if you think there's something left to do in pypdfium2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants