Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Handle Sequence as an IndirectObject when extracting text with layout mode #2788

Merged
merged 10 commits into from
Aug 5, 2024

Conversation

owurman
Copy link
Contributor

@owurman owurman commented Aug 3, 2024

The spec allows an int or float to be an IndirectObject as well, but this commit does not address that theoretical possibility.

I'm a bit out of my element, so it's possible there's a better way to write this to handle IndirectObjects in all cases, but this fixed the TypeError I was getting with one of my PDFs (trying to add 1 to an IndirectObject).

I didn't open an issue because I think this is strictly better. I also fixed logic which I believe simply didn't work if the width definition was invalid, and it will now return a PdfReadError, so that could theoretically be a breaking change, but I think anywhere this suddenly raises an Exception was silently wrong before.

The spec allows an int or float to be an IndirectObject as well, but this commit does not address that theoretical possibility.
@owurman owurman marked this pull request as draft August 3, 2024 03:55
@owurman
Copy link
Contributor Author

owurman commented Aug 3, 2024

I see https://pypdf.readthedocs.io/en/latest/dev/intro.html now so will fix the commit message, etc, then re-request

Copy link

codecov bot commented Aug 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.16%. Comparing base (38f3925) to head (6892666).
Report is 74 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2788      +/-   ##
==========================================
+ Coverage   95.12%   95.16%   +0.04%     
==========================================
  Files          51       51              
  Lines        8547     8550       +3     
  Branches     1705     1704       -1     
==========================================
+ Hits         8130     8137       +7     
+ Misses        263      261       -2     
+ Partials      154      152       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Collaborator

I see https://pypdf.readthedocs.io/en/latest/dev/intro.html now so will fix the commit message, etc, then re-request

Just edit the title of the PR - no need to force push or completely recreate the PR.

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
@owurman owurman changed the title Handle Sequence as an IndirectObject when extracting text with layout mode BUG: Handle Sequence as an IndirectObject when extracting text with layout mode Aug 3, 2024
@owurman owurman marked this pull request as ready for review August 3, 2024 18:02
@owurman
Copy link
Contributor Author

owurman commented Aug 3, 2024

Is get_object() safe to call on int and float types? It looks like it is, so it's probably a good idea to do that. In theory a PDF could have those as IndirectObjects, as well.

If you agree, I will add a commit to simply call get_object() behind _w[idx + 1] and _w[idx + 2] in each usage. That seems cleaner to me, but I defer.

@pubpub-zz
Copy link
Collaborator

as it is, this sounds good to me : It very very very unusual to replace an int with a indirectobject: this is not efficient from a memory/size point of view. I would have really liked to have an example of a failling document
PS : you should consider this PR as an ENH not a BUG specially if no issues exist

@owurman
Copy link
Contributor Author

owurman commented Aug 3, 2024

Definitely a bug, as I mentioned it was raising a TypeError for me because of the attempt to add (IndirectObject) + 1.

I agree about the int/float thing, I'm just thinking if it's not harmful, why not handle dumb documents that are technically correct.

On both of these though, I defer. Like I keep saying, it's my first contribution to the package, and I don't have strong opinions.

EDIT: I'm checking with my client if I can have permission to publicly share the document that triggers the bug.

@pubpub-zz
Copy link
Collaborator

EDIT: I'm checking with my client if I can have permission to publicly share the document that triggers the bug.

you can just extract 1 failing page and use remove_text()

@owurman
Copy link
Contributor Author

owurman commented Aug 3, 2024

Sweet.
2788_example.pdf

-Rename w_1 to w_next_entry
-Utilize ParseError instead of PdfReadError
-Write a test (both positive and negative)
Also adds a comment to clarify that we don't explicitly handle the IndexError exception. Rather, we let it be raised as an IndexError.
@owurman owurman requested a review from pubpub-zz August 5, 2024 01:18
@owurman
Copy link
Contributor Author

owurman commented Aug 5, 2024

I'm at a loss why this broke Windows tests.

FAILED tests/test_cmap.py::test_text_extraction_fast[None-ASurveyofImageClassificationBasedTechniques.pdf-False] - FileNotFoundError: [Errno 2] No such file or directory: 'D:\a\pypdf\pypdf\tests\pdf_cache\ASurveyofImageClassificationBasedTechniques.pdf'

What does that have to do with what I added? I could use guidance on that. Was this a random failure?

@stefan6419846
Copy link
Collaborator

Are the PDF files covered by your own copyright? If not, they should probably be uploaded to a comment inside this PR and referenced by a URL. Otherwise, they might better fit into the sample files repository.

@stefan6419846
Copy link
Collaborator

What does that have to do with what I added? I could use guidance on that. Was this a random failure?

Sometimes, downloading some test files fails and thus let's the corresponding tests fail. I have just re-run the corresponding job and it passed.

Could you please have a look at the code style issue?

@owurman
Copy link
Contributor Author

owurman commented Aug 5, 2024

Are the PDF files covered by your own copyright? If not, they should probably be uploaded to a comment inside this PR and referenced by a URL. Otherwise, they might better fit into the sample files repository.

No, but I removed all of the text. I'm fine moving them here as you suggest.
2788_example_malformed.pdf
2788_example.pdf

@owurman
Copy link
Contributor Author

owurman commented Aug 5, 2024

UPDATE: This was due to a stale pdf_cache file.

So, I tired using the hosted file urls here, but the 2788_example.pdf has a strange issue. It reads fine from the resources folder, but when I download it, the actual integer element sequence (not Sequence) 255, 257, 330 seems to be getting interpreted as 255, 257, 33 0 R. In code it's then being treated as 255, 257, [500] ([500] being the Sequence value of the IndirectObject 33 0).

Any ideas? I'm way out of my element here, and really cannot afford much more time on this.

I downloaded the link (https://github.com/user-attachments/files/16491621/2788_example.pdf) and it's identical at a binary level to the version in the resources folder which works.

@pytest.mark.enable_socket()
def test_layout_mode_indirect_sequence_font_widths():
    # Cover the situation where the sequence for font widths is an IndirectObject
    # ref https://github.com/py-pdf/pypdf/pull/2788
    url = "https://github.com/user-attachments/files/16491621/2788_example.pdf"
    reader = PdfReader(BytesIO(get_data_from_url(url, name="2788_example.pdf")))
    assert reader.pages[0].extract_text(extraction_mode="layout") == ""

This raises the new ParseError.

@owurman
Copy link
Contributor Author

owurman commented Aug 5, 2024

Two other things which are bugging me about the code, now.

  1. Per the spec, I think all of the numbers must be integers, not floats. Maybe the width itself can be a float, but certainly not the indices, right?
  2. Any objection if I add a ParseError instead of the continue on line 69?

@owurman
Copy link
Contributor Author

owurman commented Aug 5, 2024

The test failure above was due to a bad pdf_cache. Now I know about pdf_cache.

resources/2788_example.pdf Outdated Show resolved Hide resolved
resources/2788_example_malformed.pdf Outdated Show resolved Hide resolved
pypdf/_text_extraction/_layout_mode/_font.py Show resolved Hide resolved
tests/test_text_extraction.py Outdated Show resolved Hide resolved
@owurman owurman requested a review from pubpub-zz August 5, 2024 16:46
@pubpub-zz pubpub-zz merged commit b2d7204 into py-pdf:main Aug 5, 2024
17 checks passed
@owurman owurman deleted the handle_font_indirect_sequences branch August 5, 2024 19:17
@owurman owurman restored the handle_font_indirect_sequences branch August 5, 2024 19:26
@pubpub-zz pubpub-zz mentioned this pull request Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants