-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Handle Sequence as an IndirectObject when extracting text with layout mode #2788
BUG: Handle Sequence as an IndirectObject when extracting text with layout mode #2788
Conversation
The spec allows an int or float to be an IndirectObject as well, but this commit does not address that theoretical possibility.
I see https://pypdf.readthedocs.io/en/latest/dev/intro.html now so will fix the commit message, etc, then re-request |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2788 +/- ##
==========================================
+ Coverage 95.12% 95.16% +0.04%
==========================================
Files 51 51
Lines 8547 8550 +3
Branches 1705 1704 -1
==========================================
+ Hits 8130 8137 +7
+ Misses 263 261 -2
+ Partials 154 152 -2 ☔ View full report in Codecov by Sentry. |
Just edit the title of the PR - no need to force push or completely recreate the PR. |
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Is If you agree, I will add a commit to simply call |
as it is, this sounds good to me : It very very very unusual to replace an int with a indirectobject: this is not efficient from a memory/size point of view. I would have really liked to have an example of a failling document |
Definitely a bug, as I mentioned it was raising a TypeError for me because of the attempt to add I agree about the int/float thing, I'm just thinking if it's not harmful, why not handle dumb documents that are technically correct. On both of these though, I defer. Like I keep saying, it's my first contribution to the package, and I don't have strong opinions. EDIT: I'm checking with my client if I can have permission to publicly share the document that triggers the bug. |
you can just extract 1 failing page and use remove_text() |
Sweet. |
-Rename w_1 to w_next_entry -Utilize ParseError instead of PdfReadError -Write a test (both positive and negative)
Also adds a comment to clarify that we don't explicitly handle the IndexError exception. Rather, we let it be raised as an IndexError.
I'm at a loss why this broke Windows tests. FAILED tests/test_cmap.py::test_text_extraction_fast[None-ASurveyofImageClassificationBasedTechniques.pdf-False] - FileNotFoundError: [Errno 2] No such file or directory: 'D:\a\pypdf\pypdf\tests\pdf_cache\ASurveyofImageClassificationBasedTechniques.pdf' What does that have to do with what I added? I could use guidance on that. Was this a random failure? |
Are the PDF files covered by your own copyright? If not, they should probably be uploaded to a comment inside this PR and referenced by a URL. Otherwise, they might better fit into the sample files repository. |
Sometimes, downloading some test files fails and thus let's the corresponding tests fail. I have just re-run the corresponding job and it passed. Could you please have a look at the code style issue? |
No, but I removed all of the text. I'm fine moving them here as you suggest. |
UPDATE: This was due to a stale pdf_cache file.So, I tired using the hosted file urls here, but the Any ideas? I'm way out of my element here, and really cannot afford much more time on this. I downloaded the link (https://github.com/user-attachments/files/16491621/2788_example.pdf) and it's identical at a binary level to the version in the resources folder which works.
This raises the new ParseError. |
Two other things which are bugging me about the code, now.
|
The test failure above was due to a bad pdf_cache. Now I know about pdf_cache. |
Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
The spec allows an int or float to be an IndirectObject as well, but this commit does not address that theoretical possibility.
I'm a bit out of my element, so it's possible there's a better way to write this to handle IndirectObjects in all cases, but this fixed the TypeError I was getting with one of my PDFs (trying to add 1 to an IndirectObject).
I didn't open an issue because I think this is strictly better. I also fixed logic which I believe simply didn't work if the width definition was invalid, and it will now return a PdfReadError, so that could theoretically be a breaking change, but I think anywhere this suddenly raises an Exception was silently wrong before.