Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add "layout" mode for text extraction #2388

Merged
merged 41 commits into from
Jan 11, 2024

Conversation

shartzog
Copy link
Contributor

@shartzog shartzog commented Jan 3, 2024

The PageObject.extract_text got a new extraction_mode parameter. The old type of extraction is called "plain" which aims more at extracting text in a way that would be useful for NLP or a Text-to-Speech (TTS) system.

The new extraction_mode="layout" aims at visually representing the PDF. This is useful for detecting/extracting tables.

For Reviewers

  • add _text_extraction/_layout_mode subpackage (initial version)
  • expose subpackage functionality via new private PageObject methods _layout_mode_fonts() and _layout_mode_text()
  • add "extraction_mode" parameter and layout_mode kwargs to existing PageObject.extract_text() method for experimental usage
  • A unit test was added
  • User documentation was added.

- add _text_extraction/_layout_mode subpackage (initial version)
- expose new subpackage functionality via new PageObject methods _layout_mode_fonts() and _layout_mode_text()
- add "extraction_mode" parameter and layout_mode kwargs to existing PageObject.extract_text() method for experimental usage
Copy link

codecov bot commented Jan 3, 2024

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (cfd8712) 94.34% compared to head (f43201a) 94.42%.

Files Patch % Lines
..._text_extraction/_layout_mode/_fixed_width_page.py 97.04% 3 Missing and 2 partials ⚠️
...text_extraction/_layout_mode/_text_state_params.py 92.30% 2 Missing and 2 partials ⚠️
pypdf/_page.py 90.00% 1 Missing and 2 partials ⚠️
pypdf/_text_extraction/_layout_mode/_font.py 95.65% 1 Missing and 1 partial ⚠️
...ext_extraction/_layout_mode/_text_state_manager.py 97.61% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2388      +/-   ##
==========================================
+ Coverage   94.34%   94.42%   +0.07%     
==========================================
  Files          43       49       +6     
  Lines        7569     7961     +392     
  Branches     1520     1608      +88     
==========================================
+ Hits         7141     7517     +376     
- Misses        265      274       +9     
- Partials      163      170       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pypdf/_page.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma added this to the pypdf==4.0.0 milestone Jan 3, 2024
@MartinThoma MartinThoma changed the title ENH: text extraction "layout" mode ENH: Add "layout" mode for text extraction Jan 3, 2024
shartzog and others added 4 commits January 3, 2024 22:13
- DOC: standardize language. use "layout", not "structure/structural".
- BUG: address bug introduced by ruff refactoring (remove "TYPE_CHECKING" block for Literal import)
- DEV: use sys.version_info based import switch (not try/except) for Literal and TypedDict to correct vscode colors and prevent odd mypy errors
- TST: add test created by @MartinThoma in py-pdf#2390
- ENH: add remaining standard fonts and aliases
MartinThoma
MartinThoma previously approved these changes Jan 4, 2024
Copy link
Member

@MartinThoma MartinThoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a layout-mode type of text extraction.

The only public API change is adding the layout: Literal["plain", "layout"] = "plain" parameter to extract_text.

It has a workflow-test and user documentation.

While there are still quite a couple of lines not covered by any test, I think the PR adds a massive value to several users. The feature works reasonably well. We can cover the remaining lines later.

@MartinThoma
Copy link
Member

@shartzog From my side the PR looks good. Is there anything you would want to change before it gets merged?

I'd also ping pubpub for a review before I merge. I guess he is currently on vacation so it might take longer than usually. I just want to prepare all of the other stuff so that pubpub can focus on the more complex PDF internals (he is way more knowledgable than I am when it comes to PDF, especially text extraction).

@MartinThoma
Copy link
Member

I might add a few more PDFs for testing so that we increase the line coverage :-) Is it fine for you if I update this PR (merging changes from main) so that I can see which lines are covered when I add more PDFs for testing?

pypdf/_page.py Outdated Show resolved Hide resolved
pypdf/_page.py Outdated Show resolved Hide resolved
pypdf/_page.py Outdated Show resolved Hide resolved
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
@MartinThoma
Copy link
Member

@stefan6419846 Thanks for the thorough review! I didn't know that one can read directly from pathlib objects - nice!

I think I addresed all comments within the PR :-)

MartinThoma
MartinThoma previously approved these changes Jan 9, 2024
Copy link
Collaborator

@pubpub-zz pubpub-zz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR could be merged.
This job is for me in good shape to be at least released for users to start to beta test it.
It should be good to add cautions in the release note to ensure people will help to improve and not to complain about.

@stefan6419846
Copy link
Collaborator

I didn't know that one can read directly from pathlib objects - nice!

No worries - there is enough functionality in the stdlib I have never heard of as well ;)

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience. LGTM for the first official version.

@MartinThoma MartinThoma merged commit fc893d5 into py-pdf:main Jan 11, 2024
16 checks passed
@MartinThoma
Copy link
Member

@shartzog Congratulations to your first merged PR! That's quite an impressive first contribution 👏 👏 👏

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

@MartinThoma
Copy link
Member

This feature will be part of the pypdf==4.0.0 release. I would like to release in January, but it's more important to me to finish everything we have planned than to be "on time" with the release. You can see the plan here: https://github.com/py-pdf/pypdf/milestone/5

@MartinThoma
Copy link
Member

@stefan6419846 @pubpub-zz Thank you both for the many valuable comments 🙏 Your review helps to keep the code quality high and to avoid too many changes / follow-up fixes. I appreciate that 🤗

MartinThoma added a commit that referenced this pull request Jan 19, 2024
## What's new

pypdf==4.0.0 is a big milestone forward:

* We finally have a layout-mode text extraction.
  This enables users who want to detect / extract tables
  with heuristics to give it a try.
* We deprecated a lot of the old PyPDF2 API that was either
  not following PEP8 naming styles or was not using a
  property. Users comming from PyPDF2 might want to switch
  first to pypdf<4.0.0 to get helpful error messages
  that show the new API in their speicific cases.

A big 'Thank you!' the the whole pypdf community for your
work. Thanks to you, pypdf is better than ever.

Kudos to @shartzog who added the layout-mode with his first
contribution!

### Deprecations (DEP)
-  Drop Python 3.6 support (#2369) by @MartinThoma
-  Remove deprecated code (#2367) by @MartinThoma
-  Remove deprecated XMP properties (#2386) by @stefan6419846

### New Features (ENH)
-  Add "layout" mode for text extraction (#2388) by @shartzog
-  Add Jupyter Notebook integration for PdfReader (#2375) by @MartinThoma
-  Improve/rewrite PDF permission retrieval (#2400) by @stefan6419846

### Bug Fixes (BUG)
-  PdfWriter.add_uri was setting the wrong type (#2406) by @pmiller66
-  Add support for GBK2K cmaps (#2385) by @stefan6419846

### Documentation (DOC)
-  Add pmiller66 for #2406 as a contributor by @MartinThoma
-  Add missing expand parameter (#2393) by @Atomnp
-  Resolve build warnings (#2380) by @stefan6419846
-  Fix testing prerequisites (#2381) by @stefan6419846
-  Improve formatting of contributors page (#2383) by @stefan6419846
-  Add Tobeabellwether as a contributor for #2341 by @MartinThoma

### Developer Experience (DEV)
-  Make dependabot aware of our PR prefixes (#2415) by @stefan6419846
-  Fail on Sphinx issues (#2405) by @stefan6419846
-  Move title check to own workflow (#2384) by @MasterOdin
-  Write to temporary files instead of the working directory (#2379) by @stefan6419846
-  Ensure that the PR titles have the correct format (#2378) by @stefan6419846

### Maintenance (MAINT)
-  Complete FileSpecificationDictionaryEntries constants (#2416) by @MartinThoma
-  Return None instead of -1 when page is not attached (#2376) by @MartinThoma
-  Replace warning with logging.error (#2377) by @MartinThoma

### Testing (TST)
-  Add missing pytest.mark.samples annotations (#2412) by @kitterma
-  Correctly close temporary files (#2396) by @stefan6419846
-  Fix  side effect #2379 (#2395) by @pubpub-zz
-  Add test for layout extraction mode (#2390) by @MartinThoma

### Code Style (STY)
-  Use the UserAccessPermissions enum (#2398) by @MartinThoma
-  Run black (#2370) by @MartinThoma

[Full Changelog](3.17.4...4.0.0)
@shartzog
Copy link
Contributor Author

@shartzog Congratulations to your first merged PR! That's quite an impressive first contribution 👏 👏 👏

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

Thanks, @MartinThoma. Just getting caught up after a much needed vacation, so sorry for the late response! I'd definitely appreciate the nod on the CONTRIBUTORS page, but I see 4.0.0 is released (🎉🥳), so no worries either way.

MartinThoma added a commit that referenced this pull request Jan 19, 2024
@shartzog
Copy link
Contributor Author

@stefan6419846 @pubpub-zz Thank you both for the many valuable comments 🙏 Your review helps to keep the code quality high and to avoid too many changes / follow-up fixes. I appreciate that 🤗

Ditto! Thanks so much for the thorough reviews, @stefan6419846 and @pubpub-zz!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants