New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add PageObject.images attribute #1330
Merged
Merged
Changes from 15 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
5d1d71c
ENH: Add PageObject.images attribute
MartinThoma 55669e3
Add docs
MartinThoma 8cb5c98
Add docs
MartinThoma 85a67e5
fix flake8
MartinThoma 4aed7ee
Add mime types
MartinThoma 82e4796
Add more mime types
MartinThoma ad19cc3
Fix image extraction
MartinThoma bb7185b
Update workflows
MartinThoma fe7a965
Flake8 fix
MartinThoma c18d2a6
Update
MartinThoma 4d8ac66
mime type
MartinThoma 53bd98c
Fix
MartinThoma b0cf635
mypy
MartinThoma f1e7b84
fix imports
MartinThoma 80479ef
Merge branch 'main' into image-extraction
MartinThoma 30fbf41
Rename 'file_extension' to 'format'
MartinThoma 4b77a6a
Format rename
MartinThoma c491881
Remove mime type
MartinThoma 44efe78
Add docstring mentioning inline images
MartinThoma 50447af
Fix test
MartinThoma File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Extract Images | ||
|
||
Every page of a PDF document can contain an arbitrary amount of images. | ||
The names of the files may not be unique. | ||
|
||
```python | ||
from PyPDF2 import PdfReader | ||
|
||
reader = PdfReader("example.pdf") | ||
|
||
page = reader.pages[0] | ||
count = 0 | ||
|
||
for image_file_object in page.images: | ||
with open(str(count) + image_file_object.name, "wb") as fp: | ||
fp.write(image_file_object.data) | ||
count += 1 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,3 +10,4 @@ pytest-benchmark | |
pycryptodome | ||
typeguard | ||
types-Pillow | ||
types-dataclasses |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strikes me as an odd abstraction, where we are passing in the
mime_type
as part of theFile
constructor, but we also need to construct the full filename, using a private static function to boot, but also that thefile_extension
method doesn't correspond to the extension of the passed inname
, but rathermime_type
.If we go the route of passing in the
mime_type
for the File, I'd advocate for just passing inname
sans extension altogether and we can have a special property function that does the concatenation of name + extension to give a "filename" on demand as needed by users.The only caveat would be for attachments, it may make sense to pass in the full filename, but I'm not well versed on that part of the spec to even know how that API might look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll go with 'File only has name + data (no mime_type)' for the moment, because it seems to have only advantages:
_xobj_to_image
just pass the file extension as before_xobj_to_image
is a private function, we can easily change the behavior if we see a clear advantage