Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add parameter to select images to be removed #2214

Merged
merged 8 commits into from
Oct 29, 2023

Conversation

pubpub-zz
Copy link
Collaborator

closes #2208

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
can you have a look or rerun the action. I do not understand the problem (if any)

MartinThoma pushed a commit that referenced this pull request Sep 26, 2023
INLINE_IMAGES = auto()
DRAWING_IMAGES = auto()
ALL = XOBJECT_IMAGES | INLINE_IMAGES | DRAWING_IMAGES
IMAGES = ALL # for consistancy with ObjectDeletionFlag
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo consistency

@codecov
Copy link

codecov bot commented Sep 26, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (56e191d) 94.44% compared to head (3103748) 94.46%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2214      +/-   ##
==========================================
+ Coverage   94.44%   94.46%   +0.01%     
==========================================
  Files          43       43              
  Lines        7638     7641       +3     
  Branches     1511     1509       -2     
==========================================
+ Hits         7214     7218       +4     
  Misses        262      262              
+ Partials      162      161       -1     
Files Coverage Δ
pypdf/__init__.py 88.23% <100.00%> (+0.73%) ⬆️
pypdf/_writer.py 88.37% <100.00%> (+0.03%) ⬆️
pypdf/constants.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
All yours

@MartinThoma
Copy link
Member

I've just noticed that you now went with the flag-based design. Also, the code looks more complex than I expected.

I'll try to find an index-based approach + a simple way for users to get image data (including this flag) so that they have an easy time to decide which images (by index) they want to remove.

The reasons why I prefer the index-based approach are:

  1. Internal Consistency: We can use this approach for many other cases, especially annotations. But maybe also other objects
  2. External Recognizability: Deleting by index (or by lists of indices) is nothing completely new. It can be done in numpy, which is well-known by many Python devs.
  3. Simplicity: We don't have to think about which parameters to add to this function / in which order to add / how to call them. We can just extend the function that gives meta-data about images (which is good for other use-cases than deletion)

@MartinThoma
Copy link
Member

MartinThoma commented Sep 27, 2023

I'm sorry that I communicated that plan / idea poorly :-/

@MartinThoma
Copy link
Member

We could even implement __delitem__(self, key) for _VirtualListImages so that we could do:

del reader.pages[4].images[2]  # delete image at index 2 on page with index 4
del reader.pages[4].images[2:4]  # delete image at index 2 and 3 on page with index 4
del reader.pages[4].images[[3, 5, 9]]  # delete image at index 3, 5, and 9 on page with index 4

so keys can either be an integer, a range-object or an iterable.

@MartinThoma
Copy link
Member

I would suggest to do this in at least two PRs:

  1. Extend ImageFile: Add the ImageType enum (no need to make it a flag as one image is always exactly one type, right?) to the ImageFile so that the images function exposes the type.
  2. Refactor _VirtualListImages: I think we might need to store some additional information to be able to reference images by index.
  3. Add deletion: Either via __delitem__ to the images property or by adding an indices: Union[int, Iterable[int]] parameter to remove_images

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Sep 27, 2023

I've just noticed that you now went with the flag-based design. Also, the code looks more complex than I expected.

I'll try to find an index-based approach + a simple way for users to get image data (including this flag) so that they have an easy time to decide which images (by index) they want to remove.

The reasons why I prefer the index-based approach are:

1. Internal Consistency: We can use this approach for many other cases, especially annotations. But maybe also other objects

2. External Recognizability: Deleting by index (or by lists of indices) is nothing completely new. It can be done in numpy, which is well-known by many Python devs.

3. Simplicity: We don't have to think about which parameters to add to this function / in which order to add / how to call them. We can just extend the function that gives meta-data about images (which is good for other use-cases than deletion)

The current solution allows to delete drawings that the images/index will not be able to do it. the solution also addresses to delete all images (on all pages) at once which will be more complex for users to do
What you are proposing is a new feature that will complete this PR

We could even implement __delitem__(self, key) for _VirtualListImages so that we could do:

del reader.pages[4].images[2]  # delete image at index 2 on page with index 4
del reader.pages[4].images[2:4]  # delete image at index 2 and 3 on page with index 4
del reader.pages[4].images[[3, 5, 9]]  # delete image at index 3, 5, and 9 on page with index 4

so keys can either be an integer, a range-object or an iterable.

The index is not for me the only way to access data : we will have also to consider strings as indexes
also consider images withing XForm
(from the help)

   Examples:
       reader.pages[0].images[0]        # return fist image
       reader.pages[0].images['/I0']    # return image '/I0'
       reader.pages[0].images['/TP1','/Image1']            # return image '/Image1' within '/TP1' Xobject/Form

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Sep 27, 2023

I would suggest to do this in at least two PRs:

1. **Extend ImageFile**: Add the `ImageType` enum (no need to make it a flag as one image is always exactly one type, right?) to the `ImageFile` so that the `images` function exposes the type.

I do not see what you mean by "exposes the type". If you want to distinguish between inline and xobjects it is already possible through the file name

2. **Refactor _VirtualListImages**: I think we might need to store some additional information to be able to reference images by index.

except adding __delitem__() I see no need to change anything else.

3. **Add deletion**: Either via `__delitem__` to the `images` property or by adding an `indices: Union[int, Iterable[int]]` parameter to `remove_images`

this is more complex : str can be an index and images within forms have to be addressed

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
any answers/opinions on my responses ?

@MartinThoma
Copy link
Member

@pubpub-zz Yes, I've seen the comment. I was just trying to write the feature as I imagine it, but that takes quite some time. I think I first need to refactor remove_objects_from_page / _remove_annots_from_page .

str can be an index

What do you mean by that?

@pubpub-zz
Copy link
Collaborator Author

str can be an index

What do you mean by that?

the images virtual table can be either get data either by numbers (position in the image list) but also by str (name of the object)
also there is the cases where the str, but it can be also tuples to access images within forms.

There is also my comment about the drawings

pypdf/constants.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

@pubpub-zz Please give me another week to think about how to proceed here. I fixed the merge conflicts so that I have an easier time merging.

Currently, I tend to the following:

  1. Your PR fixes a user problem (ENH: Remove only raster graphics #2208) and hence is valuable
  2. My idea might be cleaner, but super hard to get done. So it will likely not be feasable any time soon.

For those two reasons, I tend to merge it + make the clean solution whenever I have time (which could also be never)

The main point I still want to check is if this adds too much maintenance complexity.

@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Oct 29, 2023
@MartinThoma MartinThoma merged commit 9afda0a into py-pdf:main Oct 29, 2023
13 of 14 checks passed
@MartinThoma MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Oct 29, 2023
@MartinThoma
Copy link
Member

Sorry that it took me so long to merge this 🙈 And thank you for your patience with me / for solving this issue 🤗

MartinThoma added a commit that referenced this pull request Oct 29, 2023
## What's new

### Security (SEC)
-  Infinite recursion when using PdfWriter(clone_from=reader) (#2264) by @Alexhuszagh

### New Features (ENH)
-  Add parameter to select images to be removed (#2214) by @pubpub-zz

### Bug Fixes (BUG)
-  Correctly handle image mode 1 with FlateDecode (#2249) by @stefan6419846
-  Error when filling a value with parentheses #2268 (#2269) by @KanorUbu
-  Handle empty root outline (#2239) by @pubpub-zz

### Documentation (DOC)
-  Improve merging docs (#2247) by @stefan6419846

### Developer Experience (DEV)
-  Test Python 3.7 with cryptopgraphy provider as well (#2276) by @stefan6419846
-  Run CI with windows-latest (#2258) by @MartinThoma
-  Use pytest-xdist (#2254) by @MartinThoma
-  Attribute correct authors in the release notes (#2246) by @stefan6419846

### Maintenance (MAINT)
-  Apply pre-commit hooks (#2277) by @MartinThoma
-  Update requirements + mypy fixes (#2275) by @MartinThoma
-  Explicitly provide Any for IO generic argument (#2272) by @nilehmann

### Testing (TST)
-  Fix test_image_without_pillow in windows environment (#2257) by @pubpub-zz

### Code Style (STY)
-  Remove unused import by @MartinThoma

[Full Changelog](3.16.4...3.17.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Remove only raster graphics
4 participants