Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving Duplicate Images #359

Closed
mlh758 opened this issue Mar 19, 2019 · 2 comments
Closed

Saving Duplicate Images #359

mlh758 opened this issue Mar 19, 2019 · 2 comments

Comments

@mlh758
Copy link
Contributor

mlh758 commented Mar 19, 2019

Description

When adding an image, it would be useful to detect if the image is a duplicate and instead of adding the image again, return a reference to the existing media.

We have a header that gets added in some large workbooks to each sheet, and this substantially increases the file size. If we open the file in Excel and save it again, Excel detects the duplicate images and consolidates them on save, reducing the file size.

I think it would be possible to change addMedia in picture.go to handle this behavior. AddPictureFromBytes would change slightly as well:

  1. addMedia steps through all the saved media and looks for byte slices that have the same length as the file we are trying to save. Compute the hash for both (we can obviously re-use the hash of the file being saved for future comparisons)
  2. If the hash matches, return the media path for the existing image. If no matches are found, save the new media and return its path
  3. AddPictureFromBytes calls addMedia before calling addDrawingRelationships and uses the media path provided by addMedia in the call to addDrawingRelationships.

Since checking the length of a slice is a constant time operation, and very few slices should have the exact same length without actually being the same media there shouldn't be much of a performance impact from the hashing. However, it would be worth adding a benchmark to ensure this doesn't cause a regression for #274. It would also be useful to incorporate a benchmark for actually saving the xlsx file since it's likely that the performance impact of this check would be offset by not having to write as many files.

I will wait for feedback on this one since while I think it may be useful but I understand not wanting to risk the performance impact.

@xuri
Copy link
Member

xuri commented Mar 20, 2019

Thanks for the insight @mlh758. That's a useful feature. I'll certainly accept that patch if somebody did that. we can store the hash of the image in the File object for reducing the impact on performance.

@mlh758
Copy link
Contributor Author

mlh758 commented Mar 20, 2019

A hash may actually be overkill, the bytes package has has an Equal function that already handles equality and would probably be more efficient.

@xuri xuri closed this as completed in a94dcb9 Mar 26, 2019
xuri added a commit that referenced this issue Mar 26, 2019
nullfy pushed a commit to nullfy/excelize that referenced this issue Oct 23, 2020
Adding the same image should create a drawing referencing the
already stored copy of the image.

Closes qax-os#359
nullfy pushed a commit to nullfy/excelize that referenced this issue Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants