Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior: get_rect() requires count_rects() #207

Closed
guyrosin opened this issue Apr 19, 2023 · 10 comments
Closed

Unexpected behavior: get_rect() requires count_rects() #207

guyrosin opened this issue Apr 19, 2023 · 10 comments
Assignees
Labels
bug Something isn't working pdfium This issue may be caused by (or related to) pdfium itself

Comments

@guyrosin
Copy link

Hi,
It seems get_rect() returns the actual rectangle only if count_rects() was previously executed on the same PdfTextPage object.

Here's a code example to reproduce:

import urllib.request

import pypdfium2 as pdfium

pdf_path = "test.pdf"
urllib.request.urlretrieve(
    "https://www.africau.edu/images/default/sample.pdf", pdf_path
)
doc = pdfium.PdfDocument(pdf_path)
page = doc[0]
text_page = page.get_textpage()
print(text_page.get_rect(0))
print("----------")
text_page = page.get_textpage()
print(text_page.count_rects())
print(text_page.get_rect(0))

Output:

(0.0, 0.0, 0.0, 0.0)
----------
10
(57.375, 716.9340209960938, 292.9229736328125, 741.93603515625)
@mara004
Copy link
Member

mara004 commented Apr 19, 2023

Thanks for reporting this!

PDFium docs say

// Function: FPDFText_CountRects
//          Counts number of rectangular areas occupied by a segment of text,
//          and caches the result for subsequent FPDFText_GetRect() calls.

so presumably this is a known (deliberate) requirement of pdfium that FPDFText_CountRects() needs to be called once before FPDFText_GetRect() (as should virtually always be the case in real-world code, but nvm).
However, I agree that our Python helper class should somehow handle this.

Apart from that, get_rect() also lacks a return code check. A failed get_rect() call should raise an exception rather than returning an all-zero rect (CC #208).

@mara004 mara004 self-assigned this Apr 19, 2023
@mara004 mara004 added bug Something isn't working minor Low importance pdfium This issue may be caused by (or related to) pdfium itself labels Apr 19, 2023
@guyrosin
Copy link
Author

Yeah, I agree (and must say it's a weird design choice...)
Thanks for the quick response!

@mara004
Copy link
Member

mara004 commented Apr 20, 2023

and must say it's a weird design choice...

Hmm, yeah, I guess GetRect() needs to know rect count for an OOB check of the caller's index (as this is C), but then one would think GetRect() could call CountRects() internally if no value was assigned to the cache yet.

@guyrosin
Copy link
Author

Exactly!

@mara004
Copy link
Member

mara004 commented Apr 20, 2023

Do you think a proper exception for get_rect() with hint to check if count_rects() was called would be sufficient?

I could also implement the implicit count_rects() call, but that would involve a state variable and an additional if check for all get_rect() calls on our side, just to handle this edge case. Hardly seems worth it TBH.

mara004 added a commit that referenced this issue Apr 20, 2023
@guyrosin
Copy link
Author

OK, that makes sense. An exception along with a note in get_rect()'s documentation will probably suffice :)

@guyrosin
Copy link
Author

Just saw your commit. Looks great, thanks!

@mara004
Copy link
Member

mara004 commented Apr 20, 2023

In fact this issue reveals a bigger problem than initially thought.
The test code that should have detected this missed assert statements, so the expectations were never confirmed (my bad, sorry). The problem is that count_rects() can be called with specific spans (as our tests do), but then the cached value is not suitable for an OOB check, so count_rects() needs to be called again with the full range. Our docs now clarify this, but it sounds like a design mistake of pdfium.

@guyrosin
Copy link
Author

Oh :/ yeah that makes it sounds like a mistake rather than just weird

mara004 added a commit that referenced this issue Apr 20, 2023
@mara004
Copy link
Member

mara004 commented Apr 20, 2023

anyway, thanks a lot for reporting this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pdfium This issue may be caused by (or related to) pdfium itself
Projects
None yet
Development

No branches or pull requests

2 participants