BUG: Handle Sequence as an IndirectObject when extracting text with layout mode #2788

owurman · 2024-08-03T03:53:51Z

The spec allows an int or float to be an IndirectObject as well, but this commit does not address that theoretical possibility.

I'm a bit out of my element, so it's possible there's a better way to write this to handle IndirectObjects in all cases, but this fixed the TypeError I was getting with one of my PDFs (trying to add 1 to an IndirectObject).

I didn't open an issue because I think this is strictly better. I also fixed logic which I believe simply didn't work if the width definition was invalid, and it will now return a PdfReadError, so that could theoretically be a breaking change, but I think anywhere this suddenly raises an Exception was silently wrong before.

The spec allows an int or float to be an IndirectObject as well, but this commit does not address that theoretical possibility.

owurman · 2024-08-03T03:56:27Z

I see https://pypdf.readthedocs.io/en/latest/dev/intro.html now so will fix the commit message, etc, then re-request

codecov · 2024-08-03T04:03:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.16%. Comparing base (38f3925) to head (6892666).
Report is 74 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2788      +/-   ##
==========================================
+ Coverage   95.12%   95.16%   +0.04%     
==========================================
  Files          51       51              
  Lines        8547     8550       +3     
  Branches     1705     1704       -1     
==========================================
+ Hits         8130     8137       +7     
+ Misses        263      261       -2     
+ Partials      154      152       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pypdf/_text_extraction/_layout_mode/_font.py

stefan6419846 · 2024-08-03T08:37:52Z

I see https://pypdf.readthedocs.io/en/latest/dev/intro.html now so will fix the commit message, etc, then re-request

Just edit the title of the PR - no need to force push or completely recreate the PR.

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

owurman · 2024-08-03T18:08:57Z

Is get_object() safe to call on int and float types? It looks like it is, so it's probably a good idea to do that. In theory a PDF could have those as IndirectObjects, as well.

If you agree, I will add a commit to simply call get_object() behind _w[idx + 1] and _w[idx + 2] in each usage. That seems cleaner to me, but I defer.

pubpub-zz · 2024-08-03T18:48:46Z

as it is, this sounds good to me : It very very very unusual to replace an int with a indirectobject: this is not efficient from a memory/size point of view. I would have really liked to have an example of a failling document
PS : you should consider this PR as an ENH not a BUG specially if no issues exist

owurman · 2024-08-03T18:54:29Z

Definitely a bug, as I mentioned it was raising a TypeError for me because of the attempt to add (IndirectObject) + 1.

I agree about the int/float thing, I'm just thinking if it's not harmful, why not handle dumb documents that are technically correct.

On both of these though, I defer. Like I keep saying, it's my first contribution to the package, and I don't have strong opinions.

EDIT: I'm checking with my client if I can have permission to publicly share the document that triggers the bug.

pubpub-zz · 2024-08-03T19:08:56Z

EDIT: I'm checking with my client if I can have permission to publicly share the document that triggers the bug.

you can just extract 1 failing page and use remove_text()

owurman · 2024-08-03T19:42:42Z

Sweet.
2788_example.pdf

pypdf/_text_extraction/_layout_mode/_font.py

-Rename w_1 to w_next_entry -Utilize ParseError instead of PdfReadError -Write a test (both positive and negative)

Also adds a comment to clarify that we don't explicitly handle the IndexError exception. Rather, we let it be raised as an IndexError.

owurman · 2024-08-05T01:32:32Z

I'm at a loss why this broke Windows tests.

FAILED tests/test_cmap.py::test_text_extraction_fast[None-ASurveyofImageClassificationBasedTechniques.pdf-False] - FileNotFoundError: [Errno 2] No such file or directory: 'D:\a\pypdf\pypdf\tests\pdf_cache\ASurveyofImageClassificationBasedTechniques.pdf'

What does that have to do with what I added? I could use guidance on that. Was this a random failure?

stefan6419846 · 2024-08-05T06:06:31Z

Are the PDF files covered by your own copyright? If not, they should probably be uploaded to a comment inside this PR and referenced by a URL. Otherwise, they might better fit into the sample files repository.

stefan6419846 · 2024-08-05T06:15:34Z

What does that have to do with what I added? I could use guidance on that. Was this a random failure?

Sometimes, downloading some test files fails and thus let's the corresponding tests fail. I have just re-run the corresponding job and it passed.

Could you please have a look at the code style issue?

owurman · 2024-08-05T06:32:09Z

Are the PDF files covered by your own copyright? If not, they should probably be uploaded to a comment inside this PR and referenced by a URL. Otherwise, they might better fit into the sample files repository.

No, but I removed all of the text. I'm fine moving them here as you suggest.
2788_example_malformed.pdf
2788_example.pdf

owurman · 2024-08-05T06:47:46Z

UPDATE: This was due to a stale pdf_cache file.

So, I tired using the hosted file urls here, but the 2788_example.pdf has a strange issue. It reads fine from the resources folder, but when I download it, the actual integer element sequence (not Sequence) 255, 257, 330 seems to be getting interpreted as 255, 257, 33 0 R. In code it's then being treated as 255, 257, [500] ([500] being the Sequence value of the IndirectObject 33 0).

Any ideas? I'm way out of my element here, and really cannot afford much more time on this.

I downloaded the link (https://github.com/user-attachments/files/16491621/2788_example.pdf) and it's identical at a binary level to the version in the resources folder which works.

@pytest.mark.enable_socket()
def test_layout_mode_indirect_sequence_font_widths():
    # Cover the situation where the sequence for font widths is an IndirectObject
    # ref https://github.com/py-pdf/pypdf/pull/2788
    url = "https://github.com/user-attachments/files/16491621/2788_example.pdf"
    reader = PdfReader(BytesIO(get_data_from_url(url, name="2788_example.pdf")))
    assert reader.pages[0].extract_text(extraction_mode="layout") == ""

This raises the new ParseError.

owurman · 2024-08-05T06:57:59Z

Two other things which are bugging me about the code, now.

Per the spec, I think all of the numbers must be integers, not floats. Maybe the width itself can be a float, but certainly not the indices, right?
Any objection if I add a ParseError instead of the continue on line 69?

owurman · 2024-08-05T07:05:56Z

The test failure above was due to a bad pdf_cache. Now I know about pdf_cache.

resources/2788_example.pdf

resources/2788_example_malformed.pdf

pypdf/_text_extraction/_layout_mode/_font.py

tests/test_text_extraction.py

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

Handle Sequence as an IndirectObject

9a50a9c

The spec allows an int or float to be an IndirectObject as well, but this commit does not address that theoretical possibility.

owurman marked this pull request as draft August 3, 2024 03:55

stefan6419846 reviewed Aug 3, 2024

View reviewed changes

pypdf/_text_extraction/_layout_mode/_font.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Aug 3, 2024

View reviewed changes

pypdf/_text_extraction/_layout_mode/_font.py Outdated Show resolved Hide resolved

Update pypdf/_text_extraction/_layout_mode/_font.py

341e255

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

owurman changed the title ~~Handle Sequence as an IndirectObject when extracting text with layout mode~~ BUG: Handle Sequence as an IndirectObject when extracting text with layout mode Aug 3, 2024

owurman requested a review from stefan6419846 August 3, 2024 18:02

owurman marked this pull request as ready for review August 3, 2024 18:02

pubpub-zz reviewed Aug 4, 2024

View reviewed changes

pypdf/_text_extraction/_layout_mode/_font.py Outdated Show resolved Hide resolved

pypdf/_text_extraction/_layout_mode/_font.py Show resolved Hide resolved

pypdf/_text_extraction/_layout_mode/_font.py Outdated Show resolved Hide resolved

owurman added 2 commits August 4, 2024 18:04

Address PR comments

b3f39cc

-Rename w_1 to w_next_entry -Utilize ParseError instead of PdfReadError -Write a test (both positive and negative)

Handle unlikely case of IndirectObjects for float/int width elements

87dcc74

Also adds a comment to clarify that we don't explicitly handle the IndexError exception. Rather, we let it be raised as an IndexError.

owurman requested a review from pubpub-zz August 5, 2024 01:18

Yoda condition I removed

9cd998c

Last commit was a bad patch, confused by non-committed changes

f593e7b

Use test files from URL rather than resources

0731c04

pubpub-zz reviewed Aug 5, 2024

View reviewed changes

resources/2788_example.pdf Outdated Show resolved Hide resolved

resources/2788_example_malformed.pdf Outdated Show resolved Hide resolved

pypdf/_text_extraction/_layout_mode/_font.py Show resolved Hide resolved

tests/test_text_extraction.py Outdated Show resolved Hide resolved

owurman and others added 3 commits August 5, 2024 07:17

Update tests/test_text_extraction.py

952c4a3

Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

Fix code style warnings in range() call

2076657

Merge branch 'main' into handle_font_indirect_sequences

6892666

owurman requested a review from pubpub-zz August 5, 2024 16:46

pubpub-zz approved these changes Aug 5, 2024

View reviewed changes

pubpub-zz merged commit b2d7204 into py-pdf:main Aug 5, 2024
17 checks passed

owurman deleted the handle_font_indirect_sequences branch August 5, 2024 19:17

owurman restored the handle_font_indirect_sequences branch August 5, 2024 19:26

pubpub-zz mentioned this pull request Sep 15, 2024

REL: 5.0.0 #2851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Handle Sequence as an IndirectObject when extracting text with layout mode #2788

BUG: Handle Sequence as an IndirectObject when extracting text with layout mode #2788

owurman commented Aug 3, 2024

owurman commented Aug 3, 2024

codecov bot commented Aug 3, 2024 •

edited

Loading

stefan6419846 commented Aug 3, 2024

owurman commented Aug 3, 2024

pubpub-zz commented Aug 3, 2024

owurman commented Aug 3, 2024 •

edited

Loading

pubpub-zz commented Aug 3, 2024

owurman commented Aug 3, 2024

owurman commented Aug 5, 2024 •

edited

Loading

stefan6419846 commented Aug 5, 2024

stefan6419846 commented Aug 5, 2024

owurman commented Aug 5, 2024

owurman commented Aug 5, 2024 •

edited

Loading

owurman commented Aug 5, 2024 •

edited

Loading

owurman commented Aug 5, 2024

BUG: Handle Sequence as an IndirectObject when extracting text with layout mode #2788

BUG: Handle Sequence as an IndirectObject when extracting text with layout mode #2788

Conversation

owurman commented Aug 3, 2024

owurman commented Aug 3, 2024

codecov bot commented Aug 3, 2024 • edited Loading

Codecov Report

stefan6419846 commented Aug 3, 2024

owurman commented Aug 3, 2024

pubpub-zz commented Aug 3, 2024

owurman commented Aug 3, 2024 • edited Loading

pubpub-zz commented Aug 3, 2024

owurman commented Aug 3, 2024

owurman commented Aug 5, 2024 • edited Loading

stefan6419846 commented Aug 5, 2024

stefan6419846 commented Aug 5, 2024

owurman commented Aug 5, 2024

owurman commented Aug 5, 2024 • edited Loading

UPDATE: This was due to a stale pdf_cache file.

owurman commented Aug 5, 2024 • edited Loading

owurman commented Aug 5, 2024

codecov bot commented Aug 3, 2024 •

edited

Loading

owurman commented Aug 3, 2024 •

edited

Loading

owurman commented Aug 5, 2024 •

edited

Loading

owurman commented Aug 5, 2024 •

edited

Loading

owurman commented Aug 5, 2024 •

edited

Loading