Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected UnicodeDecodeError while reading an OSX file comment. #109

Closed
fabiocaccamo opened this issue Nov 16, 2022 · 10 comments
Closed

Unexpected UnicodeDecodeError while reading an OSX file comment. #109

fabiocaccamo opened this issue Nov 16, 2022 · 10 comments

Comments

@fabiocaccamo
Copy link

Hi,
I'm experiencing an unexpected UnicodeDecodeError while reading an OSX file comment:

comment = xattr.getxattr(filepath, "com.apple.metadata:kMDItemFinderComment")
print(comment)

Output:

b'bplist00o\x10\x1f\x001\x000\x00-\x002\x00:\x00 \x00\xa9\x00 \x00d\x00e\x00l\x00p\x00i\x00e\x00r\x00o\x00o\x00/\x00D\x00e\x00p\x00o\x00s\x00i\x00t\x00p\x00h\x00o\x00t\x00o\x00s\x08\x00\x00\x00\x00\x00\x00\x01\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00I'

Then I try do decode the output:

comment_str = comment.decode()

But the following error is raised:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 24: invalid start byte
@etrepum
Copy link
Member

etrepum commented Nov 16, 2022

You can’t decode that as UTF-8 because it isn’t. That is a binary property list (bplist). You’ll need to use another library to decode it, such as https://docs.python.org/3/library/plistlib.html

@etrepum etrepum closed this as completed Nov 16, 2022
@fabiocaccamo
Copy link
Author

fabiocaccamo commented Nov 16, 2022

UPDATE:
I solved it by passing "unicode_escape" to the decode method (thanks to the @darkhaniop comment in #90).

comment_str = comment.decode("unicode_escape")

I think that this attempt should be done inside the xattr library.

@fabiocaccamo
Copy link
Author

@etrepum you should not close this issue since it can be solved directly inside this library.

@etrepum
Copy link
Member

etrepum commented Nov 16, 2022

I think you might change your mind about that if you had more experience working with binary formats and text encodings. unicode_escape is really only for very specialized situations, mostly for source code. The library will not use this codec by default, and this issue will remain closed.

@fabiocaccamo
Copy link
Author

UPDATE:
I was wrong, the output was not correct, I solved it reading the binary format with bplist library.

pip install bplist
comment = xattr.getxattr(
    filepath, "com.apple.metadata:kMDItemFinderComment"
)
comment_str = BPListReader(comment).parse()

Anyway... this could be implemented inside this library to help many people.

@etrepum
Copy link
Member

etrepum commented Nov 16, 2022

The problem here is that the data in these attributes could be encoded in any possible way, there's no universal decoding strategy that could be used. For reading Apple-specific metadata it would be a good idea to have a separate library that uses xattr to abstract this, xattr is a very low-level library that is only concerned with the direct reading and writing of these attributes.

@fabiocaccamo
Copy link
Author

I understand, thank you for the good explanation!

@RhetTbull
Copy link
Contributor

RhetTbull commented Nov 16, 2022

@fabiocaccamo as @etrepum said, this is a low level library. For working with Mac metadata, I recommend you look at osxmetadata (disclaimer: I'm the author) which provides direct access to all macOS metadata indexed by Spotlight as well as many other attributes. It does use xattr under the hood for some functions.

For example, to read Finder comments:

import osxmetadata
md = osxmetadata.OSXMetaData(filepath)
comment = md.findercomment

# also
comment = md.kMDItemFinderComment

# also, something you cannot do via setting the xattr:
md.findercomment = "My new comment"

You can also directly access the extended attributes (but be aware that the extended attribute is not the source of truth for macOS Spotlight metadata and changing it won't necessarily update the Spotlight database)

>>> from osxmetadata import *
>>> import plistlib
>>> from plistlib import FMT_BINARY
>>> from functools import partial
>>> md = OSXMetaData("test_file.txt")
>>> md.kMDItemWhereFroms = ["apple.com"]
>>> md.kMDItemWhereFroms
['apple.com']
>>> decode = partial(plistlib.loads, fmt=FMT_BINARY)
>>> encode = partial(plistlib.dumps, fmt=FMT_BINARY)
>>> md.get_xattr("com.apple.metadata:kMDItemWhereFroms")
b'bplist00\xa1\x01Yapple.com\x08\n\x00\x00\x00\x00\x00\x00\x01\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14'
>>> md.get_xattr("com.apple.metadata:kMDItemWhereFroms", decode=decode)
['apple.com']
>>> md.set_xattr("com.apple.metadata:kMDItemWhereFroms", ["google.com"], encode=encode)
>>> md.get_xattr("com.apple.metadata:kMDItemWhereFroms", decode=decode)
['google.com']
>>> md.remove_xattr("com.apple.metadata:kMDItemWhereFroms")
>>>

@etrepum Thank you for your work on xattr by the way! It has been extremely useful for me.

@fabiocaccamo
Copy link
Author

@RhetTbull thank you very much for pointing me out to osxmetadata, I didn't know it!

@etrepum
Copy link
Member

etrepum commented Nov 16, 2022

@RhetTbull if you'd like to plug osxmetadata in the xattr README.md I'd be happy to merge it! Making it easier to discover your library seems like a win for everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants