Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make image metadata available to the processors. #625

Closed
sbaechler opened this issue Dec 4, 2015 · 16 comments
Closed

Make image metadata available to the processors. #625

sbaechler opened this issue Dec 4, 2015 · 16 comments

Comments

@sbaechler
Copy link
Contributor

Metadata, not just Exif but also IPTC and XMP data can contain information that could be used by Thumbor.

As far as I can see, PIL can only read Exif data. This would mean a CLI tool such as exiftool would have to be used. Exif and IPTC are key-value based. XMP has a tree structure.

Possible use cases would be:

  1. The focus point of an image could be set by the photographer and stored in the Exif data and would not have to be defined by whomever uploads an image into the CMS.
  2. Some images should not be cropped because of copyright issues. If a metadata entry existed that prevents an image from being cropped, then it would just have padding added instead.

You could assign the issue to me. I'll try to implement a solution within the next few weeks.

@masom
Copy link
Contributor

masom commented Dec 4, 2015

There was a disucssion on creating a Media object to hold the image buffer + metadata at #577

Adding IPTC and XMP metadata would make a lot of sense.

@heynemann
Copy link
Member

+1

@sbaechler
Copy link
Contributor Author

This turned out to be more difficult than I thought.

There is a Python XMP Toolkit that can read and write XMP metadata to and from an image file. Unfortunately only to image files, not image buffers. The limitation seems to be set by the underlying C-Library Exempi.

The only way to extract XMP metadata from an image buffer for now is to do it by hand. I have created a prototype implementation for jpeg images and the PIL engine here: https://github.com/sbaechler/thumbor/commit/3cb53677870029ca09045f0ebfb03693a08899ee

The XMP data can then be further processed by the XMP toolkit.

My ultimate goal is to be able to write XMP data as well. If we used the Media object and the FileStorage, it would be possible to pass the file path along to a filter that would modify the meta data.

@gi11es
Copy link
Contributor

gi11es commented Jan 13, 2016

If that framework only deals with files, why not dump the buffer into a temp file?

@sbaechler
Copy link
Contributor Author

@gi11es This would add a dependency on the file system. I found another Python XMP Framework Pyexiv which is based on a different C library. I'll give that one a try first.

@masom
Copy link
Contributor

masom commented Jan 14, 2016

@sbaechler the current filters and optimizers receive a tmp file buffer, already on the filesystem.

@sbaechler
Copy link
Contributor Author

@masom Maybe that temp file won't be available once the Media object is used.
I tried another version with Pyexiv2 to read the image metadata. This library works well with image buffers. It is a bit tricky to install. It needs some C libraries that have to be installed first. It's also deprecated in favor of another library called Gexiv (A Gnome library). However, I did not manage to get that library working on my system. Since they are based on the same underlying library (Exiv), it should be easy to swap them once the maintainers fix the installation issues. Maybe it's even possible to support both libraries if they have the same interface.

The best thing about Pyexiv is that it not only supports XMP, but also IPTC and Exif. It can also converts all data into Python objects.

https://github.com/sbaechler/thumbor/commit/060f0e6ed70e3fbbe67b2339f63b0a2e4f47bb44

I added the code to the Engine class. Maybe it is better to keep it in a separate module. That way any filter that needed to access the metadata would have to instantiate the Metadata class, but the code itself would be independent from the engine (or the Media object).

@masom
Copy link
Contributor

masom commented Jan 22, 2016

A lot of tools don't work well with STDIN / STDOUT.

ffmpeg for instance will stutter and produce weird output if the input file is STDIN.

@gi11es
Copy link
Contributor

gi11es commented Jan 22, 2016

That's because by default ffmpeg is interactive while it runs. You can press a key to abort its processing. There's an ffmpeg option to turn that off, though.

@gi11es
Copy link
Contributor

gi11es commented Jan 22, 2016

Actually maybe not an option, but I remember that there's a way to work around that problem.

@sbaechler
Copy link
Contributor Author

@masom Pyexiv2 doesn't use STDIN/STDOUT. It's a Python library. The biggest issue is that it requires the boost and exiv2 C++ libraries wich have to be built with Python bindings. But still, this is the easiest and most comfortable way of getting the image metadata that I have found so far.

@heynemann
Copy link
Member

I don't think hard dependencies are an issue as long as this is a plugin
and not built-in.
On Jan 24, 2016 19:52, "Simon Bächler" notifications@github.com wrote:

@masom https://github.com/masom Pyexiv2 doesn't use STDIN/STDOUT. It's
a Python library. The biggest issue is that it requires the boost
http://www.boost.org/ and exiv2 C++ libraries wich have to be built
with Python bindings. But still, this is the easiest and most comfortable
way of getting the image metadata that I have found so far.


Reply to this email directly or view it on GitHub
#625 (comment).

@sbaechler
Copy link
Contributor Author

The metadata can only be extracted from the raw buffer, not the engine.image. PIL strips all metadata when creating an Image object. Therefore if the metadata should be available to the app, then it has to be extracted in the Engine.load() or Engine.create_image() method. Which are in core.

The metadata extraction is only done if the library is installed.

Another option is to extract the metadata from the temp file in the filters. The downside of this is that it creates additional I/O.

@sbaechler
Copy link
Contributor Author

Fixed in #661

@jimas14
Copy link

jimas14 commented Feb 23, 2021

@sbaechler Does this fix only provide metadata to the engine? Is there still more work to do to have IPTC data persist on output images?

@sbaechler
Copy link
Contributor Author

@jimas14 Back then there was an effort to create a rich Media object that gets passed through the pipeline instead of a global context and just passing the buffer around. #577. I don't know what the current state is on this one but it would simplify adding metadata back to the transformed image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants