New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umlaut in Title #358

Closed
dmenne opened this Issue Dec 27, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@dmenne
Copy link

dmenne commented Dec 27, 2018

An Umlaut in Exif-Title written by Lightroom gives error:

 File "d:\anaconda\lib\site-packages\sigal\image.py", line 250, in get_iptc_data
    iptc_data["title"] = raw_iptc[(2, 5)].decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 10: invalid start byte

I do not know if the upload images below preserves the EXIF data, so a copy for direct download is on

https://menne-biomed.de/uni/umlautintitle.jpg

p9040040

@saimn

This comment has been minimized.

Copy link
Owner

saimn commented Dec 28, 2018

It seems that your IPTC data is encoded as iso8859-1:

ipdb> pp raw_iptc                                                                       
{(2, 0): b'\x00\x02',
 (2, 5): b'Heinrichsh\xfctte',
 (2, 25): [b'09', b'2003', b'Bochum', b'Deutschland', b'Jahr', b'Land']}
ipdb> pp raw_iptc[(2,5)].decode('utf8')                                                 
*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 10: invalid start byte
ipdb> pp raw_iptc[(2,5)].decode('iso8859-1')                                            
'Heinrichshütte'

IPTC has a CodedCharacterSet tag that should give the encoding for the file (drewnoakes/metadata-extractor#12, https://stackoverflow.com/questions/15003031/how-to-properly-write-utf8-iptc-metadata-with-python-library-iptcinfo) but I cannot see this tag in your file.

I also discovered and tried iptcinfo3 which seems to handle this encoding tag, but is not able to decode the metadata in your file:

In [20]: info = IPTCInfo('umlautintitle.jpg')                                           
INFO:iptcinfo:File is JPEG, proceeding with JpegScan
DEBUG:iptcinfo:jpeg_next_marker: at marker E1 (225)
DEBUG:iptcinfo:JPEG variable length: 11296
DEBUG:iptcinfo:jpeg_next_marker: at marker E1 (225)
DEBUG:iptcinfo:JPEG variable length: 3429
DEBUG:iptcinfo:jpeg_next_marker: at marker ED (237)
DEBUG:iptcinfo:JPEG variable length: 144
DEBUG:iptcinfo:blindScan: starting scan, max length 142
DEBUG:iptcinfo:BlindScan: found IIM start at offset 26
DEBUG:iptcinfo:tag: 28	record: 2	dataset: 0	length: 2
DEBUG:iptcinfo:tag: 28	record: 2	dataset: 5	length: 14
DEBUG:iptcinfo:tag: 28	record: 2	dataset: 25	length: 2
DEBUG:iptcinfo:tag: 28	record: 2	dataset: 25	length: 4
DEBUG:iptcinfo:tag: 28	record: 2	dataset: 25	length: 6
DEBUG:iptcinfo:tag: 28	record: 2	dataset: 25	length: 11
DEBUG:iptcinfo:tag: 28	record: 2	dataset: 25	length: 4
DEBUG:iptcinfo:tag: 28	record: 2	dataset: 25	length: 4

In [21]: info._data                                                                     
Out[21]: 
{20: [],
 25: [b'09', b'2003', b'Bochum', b'Deutschland', b'Jahr', b'Land'],
 118: [],
 5: b'Heinrichsh\xfctte'}

So one easy way to fix the issue would be to use a less strict decoding of tags. We could also think about switching to iptcinfo if it does a better job than pillow.

saimn added a commit that referenced this issue Dec 28, 2018

@dmenne

This comment has been minimized.

Copy link
Author

dmenne commented Dec 29, 2018

I agree, and found the source of the problem: these fields were encoded with an old version of Lightroom; LT switched to consistent UTF-8 only in a later version. After rewriting these fields with LT 5 everything went smooth.

I have also tried with thumbsup, which handled both versions smoothly, as far I can see ExifTools is doing it the non-strict way.

@saimn

This comment has been minimized.

Copy link
Owner

saimn commented Dec 29, 2018

Not sure if it is safe to always assume UTF-8, as the encoding seems to depend on the filesystem encoding when writing the file. For now I added a change to replace encoding errors and avoid a crash, we can revisit this later if there is a need for a better handling of IPTC encoding.

@saimn saimn closed this Dec 29, 2018

@saimn saimn added this to the 2.0 milestone Dec 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment