Define backwards compatibility rules regarding encoding #1

rbarrois · 2015-07-16T18:54:43Z

The long-term goal of this repository is to be merged back into the upstream python-ldap package.

However, the python-ldap API doesn't feel natural under Python3, mostly with its encoding handling (it uses bytes everywhere, even when unicode strings could be used).

From the RFCs, some fields MUST be encoded as UTF-8, thus represented as unicode under Py2 / str under Py3.

This issue appears mostly on base DNs in filters and in return distinguished names:

# In Py2:
>>> base = u"ou=people,dc=example,dc=org"
>>> results = connection.search_ext(base.encode('utf-8'), scope=ldap.SCOPE_SUBTREE)
>>> results[0][0]  # results is a list of (dn, data)
b'uid=rbarrois,ou=people,dc=example,dc=org'
>>> results[0][1].keys()[0]
b'distinguisedName'

For filters, since data is fed from the calling program to python-ldap, it is simple to support both unicode and bytes and convert as needed
For output, we must decide on a type - either bytes or unicode.

If we want third party libraries integrating python-ldap to use the same code on both Py2 and Py3, we have only a few options:

Keep the Py2, bytes-only API for Py3
Change the Py2 API from python-ldap's version
Use a new object to represent "modern" connections

The goal is to be able to write the following code:

# In Py3 ("" for str, b'' for bytes)
>>> base = "ou=people,dc=example,dc=org"
>>> results = connection.search_ext(base, scope=ldap.SCOPE_SUBTREE)
>>> results[0]
("uid=rbarrois,ou=people,dc=example,dc=org", {
    "uid": [b'rbarrois'],
    "picture": [b'XXX.UZYGFUYGFUEY...'],
})

# In Py2 (u"" for unicode, '' for bytes)
>>> base = u"ou=people,dc=example,dc=org"
>>> results = connection.search_ext(base, scope=ldap.SCOPE_SUBTREE)
>>> results[0]
(u"uid=rbarrois,ou=people,dc=example,dc=org", {
    u"uid": ['rbarrois'],
    u"picture": ['XXX.UZYGFUYGFUEY...'],
})

vstinner · 2015-07-17T08:18:15Z

If you start to require the user to use u"ou=people,dc=example,dc=org", it will break a lot of code. A smoother option would be to require Unicode on Python 3, but accept ASCII-encoded byte strings on Python 2.

For the result, you can add an option on the connection to choose between bytes and Unicode. The obvious choice would be to use bytes on Python 2 and Unicode on Python 3 by default. You can suggest in the doc to enable Unicode explicitly on Python 2 too.

Example with MySQL-Python: db = MySQLdb.connect(..., use_unicode="True", charset="utf8"). In the case of LDAP, you should not allow the user to configure the encoding (always use UTF-8).

encukou · 2015-07-17T08:56:27Z

Yes, for Py2 the API should be unchanged. We should be aminig for a drop-in replacement. Encoding text fields to Unicode only under Python 3 sounds good, though.

Please do not add a per-connection "output style" option. Keep in mind there may be libraries between pyldap and "the application". If you could mix and match output types, library functions that handle ldap results would get way more complicated than necessary.

Whether output is bytes or text depends on the schema. And, in some cases the schema is unknown. Since bytes can't always be represented as text, but text can always be encoded, I think data should be text by default.

Accepting both UTF-8 bytes and unicode on input sounds good.

tiran · 2015-07-17T09:00:44Z

+1 for ASCII bytes in Python 2 and text in Python 3

For the return value it is a bit more tricky. Some elements can always be interpreted as unicode, e.g. attribute names, DNs, RDNs and a couple of other things. For attribute values it depends on the schema. Sometimes the schema is unknown.

vstinner · 2015-07-17T09:07:48Z

Please do not add a per-connection "output style" option. Keep in mind there may be libraries between pyldap and "the application".

In this case, the library should be responsible the handle the connection, not the application.

encukou · 2015-07-17T10:22:02Z

@Haypo, someties you just need to transform output.

Besides, decoding data is not just bytes to text. Sometimes the values are integers or booleans – with all the same problems as decoding text, e.g. being dependent on schema. Traditionally, python-ldap did not do this, and I believe it is still a job for another library on top of python-ldap. (Though as a co-author of one, ipaldap, I might be biased.)
If one of the higher-level libraries manages to solve the entire problem well, we can think about rolling it into pyldap. If we try to solve it now, I fear we'll get a non-optimal/half-baked solution.

To be clear, for things that are always text, I still think python-ldap should do the decoding – as long as it's perfectly clear what the user wants.

vstinner · 2015-07-17T11:58:46Z

Is it possible to easily know if a field is binary, like a photo (JPEG data)?

encukou · 2015-07-17T12:07:37Z

No.
AFAIK, sometimes you'll find non-UTF8 data in fields where the schema says they should be textual. A real-world library needs to support that.

tiran · 2015-07-17T12:11:00Z

Yes, the information is part of the schema. But in order to decide if a value is binary, text, bool, int or some other format the library has to fetch the schema and look up the type. The schema can change almost anytime time, too. The lookup and processing of the schema has a performance impact. The schema may not be available, too.

In my opinion it makes more sense to have different layers. The low level LDAP interface should only decode elements that are always unicode: DN, RDN, attribute names and maybe more. Attribute values are returned as raw bytes. A high level API can fetch and cache the schema in order to convert bytes to text.

It's even a bit more complicated. The schema also contains rules for equality, ordering and substrings matching. In LDAP text isn't just text.

vstinner · 2015-07-17T12:41:07Z

Yes, the information is part of the schema. (...) The schema can change almost anytime time, too.

Oh. In this case, please don't guess anyone. It's better to return raw bytes than using the wrong encoding.

A high level API can fetch and cache the schema in order to convert bytes to text.

Can't we push the responsability to the application directly? The application knows the schema, so it should decode itself.

Or pyldap can provide an helper to explicitly define which fields are text. By default, we may use a list of well known fields like DN, but the application may clear this list or add item to this list. Sorry, I don't know LDAP well enough to have a more concrete proposition.

hroncok · 2015-07-17T14:25:16Z

As I read the discussion, I think making the API compatible is the way to go now. Further changes may be discussed with python-ldap devs later, if we convince them to cooperate.

rbarrois · 2015-07-17T14:47:35Z

As mentionned above, there are a few type of fields in LDAP:

distinguishedName (including bases and RDN) are always UTF-8
filters/queries are always UTF-8
attribute names are always UTF-8
attribute values may be bytes, booleans, integers, anything

The python-ldap version uses bytes for all those fields.
Users tend to store only ASCII chars in those fields, and this will convert nicely to unicode objects thanks to Py2's automagic casting rules — and break horribly once they add UTF-8 chars in there.

Py3 comes with much better rules on that front, and forcing users to use bytes for fields that are clearly defined as "MUST contain valid UTF-8" seems a bad design — actually riskier, some users may use another charset by mistake.

In other words, I want to avoid the following:

def get_org_name(org_id):
    base = "ou=%s,dc=example,dc=org" % org_id
    results = connection.search_ext(base.encode('utf-8'), scope=ldap.SCOPE_SUBTREE)
    bname = results[0][1][b'name']
    return bname.decode('utf-8')

The .encode('utf-8') and the b'name' are boilerplate, imposed by the library design yet useless from the RFC point of view.
The need to use b'name' is unnatural, and surprise many users, who won't understand why they get a KeyError.

Instead, I'd want the following API:

def get_org_name(org_id):
    results = connection.search_ext("ou=%s,dc=example,dc=org" % org_id, scope=ldap.SCOPE_SUBTREE)
    bname = results[0][1]['name']
    return bname.decode('utf-8')

Obviously, attribute values should be kept as bytes on both sides.

encukou · 2015-07-17T15:06:12Z

Right. At this point it looks like we're all in agreement. Thanks for the summary @rbarrois. I'll just add that the Python 2 API should be unchanged. So, in py3c terms, everything that's UTF-8 should be PyStr.
@hroncok, will you be able to implement this next week? I can do a review after EuroPython (if no one beats me to it), and I'll try to be available if you have any questions.

hroncok · 2015-07-17T15:14:30Z

I could, if @rbarrois doesn't plan to do it himself.

rbarrois · 2015-07-17T15:25:11Z

Not sure about what we agree on :)

The current implementation (the one I wrote some time ago) already does the following:

On Py2, keep the old, clumsy API
On Py3, use PyStr instead of PyBytes for DN / filter / attribute names

The problem with this approach is that libraries relying on pyldap must have different code paths depending on whether they run under Py2 or under Py3:

If they wish to provide a "unicode/str" interface, they need to play the .encode('utf-8') / .decode('utf-8') dans on Py2
If they prefer the "bytes only" version, they need to play that dance on Py3.

Since having inconsistent APIs between Py2 and Py3 seems weird, I see 3 options:

Break backwards compatibility and use unicode on Py2
Add an optional flag on methods to opt-in to unicode mode on Py2, on by default on Py3
Go back to the bytes-only API and provide a nicer wrapper for both Py2 and Py3

In my opinion, we should provide a simple way for library users to "upgrade" to the unicode behaviour when they enable Py3 support.
This could be easy if, under Py3, functions receiving bytes where they expect str fail with a message akin to Under Py3, this method takes bytes as argument. If you're adding Py3 support to your library, see pyldap.readthedocs.org/#upgrading-from-py2 for help.

I'll try to implement option 2 during the week-end.

encukou · 2015-07-17T16:08:00Z

Well, "native string" is bytes on py2 and unicode on py3, so the respective dances are pretty much par for the course when porting something. Python itself has an inconsistent API between the two versions.

So, I can live without the unicode-enabling flag, but having it won't hurt.

rbarrois · 2015-07-17T16:46:10Z

@encukou having ported a few libraries to Py3, a Py2-and-Py3-compatible library tends to switch to « the Py3 way » and drops "native strings" for unicode (obviously not true for one-time scripts).

rbarrois · 2015-07-20T23:20:17Z

OK, I've just added an implementation of option 2.

With this, code should run as follows:

If starting a new project with Py3, do nothing and use usual types
If porting a project from Py2, begin with setting bytes_mode=True in calls to ldap.initialize() / ldap.open().
When ready, set bytes_mode=False and fix code accordingly
Developers of third-party libraries should force bytes_mode=False everywhere: this ensures that the ldap API is consistent accross Python 2 and Python 3.

hroncok · 2015-07-21T01:41:01Z

Thanks @rbarrois. I'll at the code tomorrow, if you want.

With this commit, all ldap connections accept a new parameter, ``bytes_mode``. When set to ``True``, this flag emulates the old Python 2 behavior, where all fields are bytes - including those declared as UTF-8 by the RFC (DN, RDN, attribute names). If this flag is set to ``False``, the code works with text (unicode) for all text fields (everything except attribute values). If no value is set under Python 2, the code will raise a BytesWarning and proceed with the flag set to ``True``, for backwards compatibility. Under Python 3, the value can only be set to ``False``. For safety and ease of upgrade, the code checks that all provided arguments are of the expected type (unicode with ``bytes_mode=False``, bytes with ``bytes_mode=True``).

rbarrois · 2015-07-25T16:21:19Z

OK, I've just added a proper implementation proposal for this over in #7, taking into account @hroncok suggestions.

…pyldap#1)

…#1)

rbarrois added the design label Jul 16, 2015

tiran mentioned this issue Jul 16, 2015

Define upgrade path from upstream's python-ldap #2

Closed

hroncok mentioned this issue Jul 23, 2015

Release on PyPI #3

Closed

rbarrois mentioned this issue Jul 25, 2015

Implement backwards compatibility mode for Python 2 #7

Merged

hroncok closed this as completed Jul 26, 2015

mkurek mentioned this issue Mar 7, 2016

Fixed nested group syncing when logging in allegro/ralph#2267

Merged

3 tasks

bd808 mentioned this issue Jun 29, 2016

Switch to pyldap with 'bytes_mode=False' likely to break many Python2 consumers django-ldapdb/django-ldapdb#91

Closed

pdpotter added a commit to UGentPortaal/pyldap-ugent that referenced this issue Jan 13, 2017

Add bytes_mode flag to NonblockingLDAPObject, ReconnectLDAPObject (See …

d49f9a0

…pyldap#1)

rbarrois pushed a commit that referenced this issue Jan 19, 2017

Add bytes_mode flag to NonblockingLDAPObject, ReconnectLDAPObject (See …

f67a968

…#1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define backwards compatibility rules regarding encoding #1

Define backwards compatibility rules regarding encoding #1

rbarrois commented Jul 16, 2015

vstinner commented Jul 17, 2015

encukou commented Jul 17, 2015

tiran commented Jul 17, 2015

vstinner commented Jul 17, 2015

encukou commented Jul 17, 2015

vstinner commented Jul 17, 2015 via email

encukou commented Jul 17, 2015

tiran commented Jul 17, 2015

vstinner commented Jul 17, 2015

hroncok commented Jul 17, 2015

rbarrois commented Jul 17, 2015

encukou commented Jul 17, 2015

hroncok commented Jul 17, 2015

rbarrois commented Jul 17, 2015

encukou commented Jul 17, 2015

rbarrois commented Jul 17, 2015

rbarrois commented Jul 20, 2015

hroncok commented Jul 21, 2015

rbarrois commented Jul 25, 2015

Define backwards compatibility rules regarding encoding #1

Define backwards compatibility rules regarding encoding #1

Comments

rbarrois commented Jul 16, 2015

vstinner commented Jul 17, 2015

encukou commented Jul 17, 2015

tiran commented Jul 17, 2015

vstinner commented Jul 17, 2015

encukou commented Jul 17, 2015

vstinner commented Jul 17, 2015 via email

encukou commented Jul 17, 2015

tiran commented Jul 17, 2015

vstinner commented Jul 17, 2015

hroncok commented Jul 17, 2015

rbarrois commented Jul 17, 2015

encukou commented Jul 17, 2015

hroncok commented Jul 17, 2015

rbarrois commented Jul 17, 2015

encukou commented Jul 17, 2015

rbarrois commented Jul 17, 2015

rbarrois commented Jul 20, 2015

hroncok commented Jul 21, 2015

rbarrois commented Jul 25, 2015