Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disparity between methods accepting filenames as bytes or str #62

Closed
tisdall opened this issue Mar 11, 2014 · 17 comments
Closed

disparity between methods accepting filenames as bytes or str #62

tisdall opened this issue Mar 11, 2014 · 17 comments
Labels

Comments

@tisdall
Copy link

tisdall commented Mar 11, 2014

I upgraded from 0.13.1 to 0.14 and got several errors. One was that several methods that took filenames have been changed to only accept binary strings instead of regular unicode strings. So, in py3, bytes instead of str.

I think 0.13.1 actually had the opposite situation for some of those same methods where it only accepted a unicode string and not a binary string.

I think the underlying C library uses binary strings (ie no encoding enforced), so it makes sense to use that (on *nix). However, the python wrapper should be able to accept both and translate accordingly.

@exarkun
Copy link
Member

exarkun commented Mar 11, 2014

Please be explicit. Which methods?

@tisdall
Copy link
Author

tisdall commented Mar 11, 2014

OpenSSL.SSL.Context.use_privatekey_file - "keyfile must be a byte string"
OpenSSL.SSL.Context.use_certificate_file - "certfile must be bytes or unicode"
OpenSSL.SSL.Context.load_verify_locations - "cafile must be None or a byte string"

Three different methods all accepting a file name and each has different requirements. There are other methods I haven't used that I didn't check. I think use_privatekey_file is the one that threw an exception when I switched from 0.13.1 to 0.14 as I was using a unicode string.

@exarkun exarkun added the bug label Mar 19, 2014
@exarkun
Copy link
Member

exarkun commented Mar 19, 2014

Thank you.

@flavio
Copy link

flavio commented May 29, 2014

This looks like a duplicate of issue #15.

@tisdall
Copy link
Author

tisdall commented May 29, 2014

@flavio - looks related, but not a duplicate. My issue is that there's no consistency between what the different methods take as far as "string" type. Also, that it changed between the versions and broke some of my existing code.

@exarkun
Copy link
Member

exarkun commented Mar 29, 2015

Considering the direction Python has chosen for handling paths, I think the sensible thing for pyOpenSSL is to:

  • accept bytes or unicode for any path value
  • automatically encode unicode to bytes using sys.getfilesystemencoding()

It would be nice to have a function which performed the necessary check and encoding for a single value. This could be used throughout pyOpenSSL anywhere an API accepts a path.

@tisdall
Copy link
Author

tisdall commented Mar 29, 2015

I've been using something like this:

file_name_encoding = sys.getfilesystemencoding()
if file_name_encoding is None:
    file_name_encoding = sys.getdefaultencoding()

As per the docs saying "Return the name of the encoding used to convert Unicode filenames into system file names, or None if the system default encoding is used." (https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding)

@tisdall
Copy link
Author

tisdall commented Mar 29, 2015

how is this for a utility function:

from six import PY3, binary_type, text_type

def filename_to_bytes(filename):
    if isinstance(filename, binary_type):
        return filename
    elif isinstance(filename, text_type):
        fs_encoding = sys.getfilesystemencoding()
        if fs_encoding is None:
            fs_encoding = sys.getdefaultencoding()
        return filename.encode(fs_encoding)
    else:
        raise TypeError("filenames must either be a string or bytes")

@exarkun
Copy link
Member

exarkun commented Mar 29, 2015

I'm not sure it makes sense to apply sys.getdefaultencoding() to filesystem paths. If there's no indication of how path names are represented in the underlying system then it's probably better not to guess or to make a very simple, safe guess (like "ascii").

Another thing to consider is that, according to the documentation (https://docs.python.org/3/library/sys.html#sys.getfilesystemencoding), getfilesystemencoding cannot return None in Python 3.2 and newer. That means the case where getfilesystemencoding() returns None is limited to Python 2.6 and Python 2.7 - where support for unicode path names is perhaps not something folks expect, anyway.

@tisdall
Copy link
Author

tisdall commented Mar 30, 2015

As I quoted, the docs say it returns None when the file encoding is the
system default... So, when it's None I use the system default, no? But,
yes, it seems to only matter with py2.
On Mar 29, 2015 3:16 PM, "Jean-Paul Calderone" notifications@github.com
wrote:

I'm not sure it makes sense to apply sys.getdefaultencoding() to
filesystem paths. If there's no indication of how path names are
represented in the underlying system then it's probably better not to guess
or to make a very simple, safe guess (like "ascii").

Another thing to consider is that, according to the documentation (
https://docs.python.org/3/library/sys.html#sys.getfilesystemencoding),
getfilesystemencoding cannot return None in Python 3.2 and newer. That
means the case where getfilesystemencoding() returns None is limited to
Python 2.6 and Python 2.7 - where support for unicode path names is
perhaps not something folks expect, anyway.


Reply to this email directly or view it on GitHub
#62 (comment).

@exarkun
Copy link
Member

exarkun commented Mar 30, 2015

So, when it's None I use the system default, no?

I don't think so, no. sys.getdefaultencoding() isn't the "system default" and it isn't particularly related to path names. It's just "ascii" on Python 2 and "utf-8" on Python 3 - unless someone screws around with it (using sys.setdefaultencoding).

I don't think that making the fallback be either "ascii" or "utf-8" depending on the Python version is a good idea, I don't think that tying it to the Python "default encoding" (which itself is simply a bad idea) is particularly useful, and I don't think that letting people screw with it by using sys.setdefaultencoding will be particularly beneficial.

I'm not entirely decided on whether hard-coding "ascii" or "utf-8" makes more sense.

The advantage of ascii is that if the encoding succeeds you can be pretty sure you're addressing the file the user meant to address (there are ascii-incompatible encodings but ... for all intents and purposes they are unused). The disadvantage is that many possible path names will not be encodable using ascii. The mitigating factor, perhaps, is that if you're using non-ascii path names your environment really, really should be set up so that sys.getfilesystemencoding() returns something (even if it isn't the right thing, but that's not pyOpenSSL's problem, that's just the downside of how Python has decided to handle path names).

The advantage of utf-8 is that it can encode anything. The disadvantage is that it might be the wrong encoding and so the referenced file won't be found.

What's worse? Having UnicodeEncodeError raised? Or an OpenSSL error indicating file-not-found because the wrong filename was used?

@tisdall
Copy link
Author

tisdall commented Mar 30, 2015

Well, wouldn't the best thing be to mimic open()'s functionality? I'd say if you pass a string into open and it couldn't find the file due to encoding issues, then it'd be too much to expect any third party to do it better.

I guess when I came across this issue, that was the most surprising thing... I expected that I could pass it a string in the same way I had been doing with open() and it would work.

@exarkun
Copy link
Member

exarkun commented Mar 30, 2015

Well, wouldn't the best thing be to mimic open()'s functionality?

Yes. That's fine. How does open work? The library documentation doesn't actually say. All that's documented on https://docs.python.org/3/library/functions.html#open is:

file is either a string or bytes object giving the pathname (absolute or relative
to the current working directory) of the file to be opened or an integer file
descriptor of the file to be wrapped.

@tisdall
Copy link
Author

tisdall commented Mar 30, 2015

^_^ I knew that was coming next...

I've tried making sense of https://hg.python.org/cpython/file/00c982c9f681/Python/fileutils.c but it's a little beyond me. PyUnicode_EncodeFSDefault says it uses the filesytem default encoding but if it's not set that it defaults to the current locale encoding. That may be what getdefaultencoding is already doing in py3k instead of returning None.

@tisdall
Copy link
Author

tisdall commented Mar 30, 2015

sorry.. I meant that's what getfilesystemencoding is doing...

@tisdall
Copy link
Author

tisdall commented Mar 30, 2015

It seems like Py2 does what you're saying... if it has no value for Py_FileSystemDefaultEncoding it defaults to ASCII only. So, with no default encoding and something not covered by ASCII you get an UnicodeEncodeError when using open(). I'm not sure how to confirm that, though.

So, maybe this is a good starting point:

from six import PY3, binary_type, text_type

def filename_to_bytes(filename):
    if isinstance(filename, binary_type):
        return filename
    elif isinstance(filename, text_type):
        return filename.encode(sys.getfilesystemencoding() or 'ascii')
    else:
        raise TypeError("filenames must either be a string or bytes")

@tisdall
Copy link
Author

tisdall commented May 15, 2015

This seems to be fixed by #209 so I'll close it.

@tisdall tisdall closed this as completed May 15, 2015
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

3 participants