-
Notifications
You must be signed in to change notification settings - Fork 34
Define backwards compatibility rules regarding encoding #1
Comments
If you start to require the user to use u"ou=people,dc=example,dc=org", it will break a lot of code. A smoother option would be to require Unicode on Python 3, but accept ASCII-encoded byte strings on Python 2. For the result, you can add an option on the connection to choose between bytes and Unicode. The obvious choice would be to use bytes on Python 2 and Unicode on Python 3 by default. You can suggest in the doc to enable Unicode explicitly on Python 2 too. Example with MySQL-Python: db = MySQLdb.connect(..., use_unicode="True", charset="utf8"). In the case of LDAP, you should not allow the user to configure the encoding (always use UTF-8). |
Yes, for Py2 the API should be unchanged. We should be aminig for a drop-in replacement. Encoding text fields to Unicode only under Python 3 sounds good, though. Please do not add a per-connection "output style" option. Keep in mind there may be libraries between pyldap and "the application". If you could mix and match output types, library functions that handle ldap results would get way more complicated than necessary. Whether output is bytes or text depends on the schema. And, in some cases the schema is unknown. Since bytes can't always be represented as text, but text can always be encoded, I think data should be text by default. Accepting both UTF-8 bytes and unicode on input sounds good. |
+1 for ASCII bytes in Python 2 and text in Python 3 For the return value it is a bit more tricky. Some elements can always be interpreted as unicode, e.g. attribute names, DNs, RDNs and a couple of other things. For attribute values it depends on the schema. Sometimes the schema is unknown. |
In this case, the library should be responsible the handle the connection, not the application. |
@Haypo, someties you just need to transform output. Besides, decoding data is not just bytes to text. Sometimes the values are integers or booleans – with all the same problems as decoding text, e.g. being dependent on schema. Traditionally, python-ldap did not do this, and I believe it is still a job for another library on top of python-ldap. (Though as a co-author of one, ipaldap, I might be biased.) To be clear, for things that are always text, I still think python-ldap should do the decoding – as long as it's perfectly clear what the user wants. |
Is it possible to easily know if a field is binary, like a photo (JPEG
data)?
|
No. |
Yes, the information is part of the schema. But in order to decide if a value is binary, text, bool, int or some other format the library has to fetch the schema and look up the type. The schema can change almost anytime time, too. The lookup and processing of the schema has a performance impact. The schema may not be available, too. In my opinion it makes more sense to have different layers. The low level LDAP interface should only decode elements that are always unicode: DN, RDN, attribute names and maybe more. Attribute values are returned as raw bytes. A high level API can fetch and cache the schema in order to convert bytes to text. It's even a bit more complicated. The schema also contains rules for equality, ordering and substrings matching. In LDAP text isn't just text. |
Oh. In this case, please don't guess anyone. It's better to return raw bytes than using the wrong encoding.
Can't we push the responsability to the application directly? The application knows the schema, so it should decode itself. Or pyldap can provide an helper to explicitly define which fields are text. By default, we may use a list of well known fields like DN, but the application may clear this list or add item to this list. Sorry, I don't know LDAP well enough to have a more concrete proposition. |
As I read the discussion, I think making the API compatible is the way to go now. Further changes may be discussed with python-ldap devs later, if we convince them to cooperate. |
As mentionned above, there are a few type of fields in LDAP:
The python-ldap version uses bytes for all those fields. Py3 comes with much better rules on that front, and forcing users to use bytes for fields that are clearly defined as "MUST contain valid UTF-8" seems a bad design — actually riskier, some users may use another charset by mistake. In other words, I want to avoid the following: def get_org_name(org_id):
base = "ou=%s,dc=example,dc=org" % org_id
results = connection.search_ext(base.encode('utf-8'), scope=ldap.SCOPE_SUBTREE)
bname = results[0][1][b'name']
return bname.decode('utf-8') The Instead, I'd want the following API: def get_org_name(org_id):
results = connection.search_ext("ou=%s,dc=example,dc=org" % org_id, scope=ldap.SCOPE_SUBTREE)
bname = results[0][1]['name']
return bname.decode('utf-8') Obviously, attribute values should be kept as bytes on both sides. |
Right. At this point it looks like we're all in agreement. Thanks for the summary @rbarrois. I'll just add that the Python 2 API should be unchanged. So, in py3c terms, everything that's UTF-8 should be PyStr. |
I could, if @rbarrois doesn't plan to do it himself. |
Not sure about what we agree on :) The current implementation (the one I wrote some time ago) already does the following:
The problem with this approach is that libraries relying on pyldap must have different code paths depending on whether they run under Py2 or under Py3:
Since having inconsistent APIs between Py2 and Py3 seems weird, I see 3 options:
In my opinion, we should provide a simple way for library users to "upgrade" to the unicode behaviour when they enable Py3 support. I'll try to implement option 2 during the week-end. |
Well, "native string" is bytes on py2 and unicode on py3, so the respective dances are pretty much par for the course when porting something. Python itself has an inconsistent API between the two versions. So, I can live without the unicode-enabling flag, but having it won't hurt. |
@encukou having ported a few libraries to Py3, a Py2-and-Py3-compatible library tends to switch to « the Py3 way » and drops "native strings" for unicode (obviously not true for one-time scripts). |
OK, I've just added an implementation of option 2. With this, code should run as follows:
|
Thanks @rbarrois. I'll at the code tomorrow, if you want. |
With this commit, all ldap connections accept a new parameter, ``bytes_mode``. When set to ``True``, this flag emulates the old Python 2 behavior, where all fields are bytes - including those declared as UTF-8 by the RFC (DN, RDN, attribute names). If this flag is set to ``False``, the code works with text (unicode) for all text fields (everything except attribute values). If no value is set under Python 2, the code will raise a BytesWarning and proceed with the flag set to ``True``, for backwards compatibility. Under Python 3, the value can only be set to ``False``. For safety and ease of upgrade, the code checks that all provided arguments are of the expected type (unicode with ``bytes_mode=False``, bytes with ``bytes_mode=True``).
The long-term goal of this repository is to be merged back into the upstream python-ldap package.
However, the python-ldap API doesn't feel natural under Python3, mostly with its encoding handling (it uses bytes everywhere, even when unicode strings could be used).
From the RFCs, some fields MUST be encoded as UTF-8, thus represented as unicode under Py2 / str under Py3.
This issue appears mostly on base DNs in filters and in return distinguished names:
If we want third party libraries integrating python-ldap to use the same code on both Py2 and Py3, we have only a few options:
The goal is to be able to write the following code:
The text was updated successfully, but these errors were encountered: