Skip to content
This repository has been archived by the owner on Mar 21, 2019. It is now read-only.

Define backwards compatibility rules regarding encoding #1

Closed
rbarrois opened this issue Jul 16, 2015 · 19 comments
Closed

Define backwards compatibility rules regarding encoding #1

rbarrois opened this issue Jul 16, 2015 · 19 comments
Labels

Comments

@rbarrois
Copy link
Contributor

The long-term goal of this repository is to be merged back into the upstream python-ldap package.

However, the python-ldap API doesn't feel natural under Python3, mostly with its encoding handling (it uses bytes everywhere, even when unicode strings could be used).

From the RFCs, some fields MUST be encoded as UTF-8, thus represented as unicode under Py2 / str under Py3.

This issue appears mostly on base DNs in filters and in return distinguished names:

# In Py2:
>>> base = u"ou=people,dc=example,dc=org"
>>> results = connection.search_ext(base.encode('utf-8'), scope=ldap.SCOPE_SUBTREE)
>>> results[0][0]  # results is a list of (dn, data)
b'uid=rbarrois,ou=people,dc=example,dc=org'
>>> results[0][1].keys()[0]
b'distinguisedName'
  • For filters, since data is fed from the calling program to python-ldap, it is simple to support both unicode and bytes and convert as needed
  • For output, we must decide on a type - either bytes or unicode.

If we want third party libraries integrating python-ldap to use the same code on both Py2 and Py3, we have only a few options:

  • Keep the Py2, bytes-only API for Py3
  • Change the Py2 API from python-ldap's version
  • Use a new object to represent "modern" connections

The goal is to be able to write the following code:

# In Py3 ("" for str, b'' for bytes)
>>> base = "ou=people,dc=example,dc=org"
>>> results = connection.search_ext(base, scope=ldap.SCOPE_SUBTREE)
>>> results[0]
("uid=rbarrois,ou=people,dc=example,dc=org", {
    "uid": [b'rbarrois'],
    "picture": [b'XXX.UZYGFUYGFUEY...'],
})

# In Py2 (u"" for unicode, '' for bytes)
>>> base = u"ou=people,dc=example,dc=org"
>>> results = connection.search_ext(base, scope=ldap.SCOPE_SUBTREE)
>>> results[0]
(u"uid=rbarrois,ou=people,dc=example,dc=org", {
    u"uid": ['rbarrois'],
    u"picture": ['XXX.UZYGFUYGFUEY...'],
})
@vstinner
Copy link

If you start to require the user to use u"ou=people,dc=example,dc=org", it will break a lot of code. A smoother option would be to require Unicode on Python 3, but accept ASCII-encoded byte strings on Python 2.

For the result, you can add an option on the connection to choose between bytes and Unicode. The obvious choice would be to use bytes on Python 2 and Unicode on Python 3 by default. You can suggest in the doc to enable Unicode explicitly on Python 2 too.

Example with MySQL-Python: db = MySQLdb.connect(..., use_unicode="True", charset="utf8"). In the case of LDAP, you should not allow the user to configure the encoding (always use UTF-8).

@encukou
Copy link
Contributor

encukou commented Jul 17, 2015

Yes, for Py2 the API should be unchanged. We should be aminig for a drop-in replacement. Encoding text fields to Unicode only under Python 3 sounds good, though.

Please do not add a per-connection "output style" option. Keep in mind there may be libraries between pyldap and "the application". If you could mix and match output types, library functions that handle ldap results would get way more complicated than necessary.

Whether output is bytes or text depends on the schema. And, in some cases the schema is unknown. Since bytes can't always be represented as text, but text can always be encoded, I think data should be text by default.

Accepting both UTF-8 bytes and unicode on input sounds good.

@tiran
Copy link
Contributor

tiran commented Jul 17, 2015

+1 for ASCII bytes in Python 2 and text in Python 3

For the return value it is a bit more tricky. Some elements can always be interpreted as unicode, e.g. attribute names, DNs, RDNs and a couple of other things. For attribute values it depends on the schema. Sometimes the schema is unknown.

@vstinner
Copy link

Please do not add a per-connection "output style" option. Keep in mind there may be libraries between pyldap and "the application".

In this case, the library should be responsible the handle the connection, not the application.

@encukou
Copy link
Contributor

encukou commented Jul 17, 2015

@Haypo, someties you just need to transform output.

Besides, decoding data is not just bytes to text. Sometimes the values are integers or booleans – with all the same problems as decoding text, e.g. being dependent on schema. Traditionally, python-ldap did not do this, and I believe it is still a job for another library on top of python-ldap. (Though as a co-author of one, ipaldap, I might be biased.)
If one of the higher-level libraries manages to solve the entire problem well, we can think about rolling it into pyldap. If we try to solve it now, I fear we'll get a non-optimal/half-baked solution.

To be clear, for things that are always text, I still think python-ldap should do the decoding – as long as it's perfectly clear what the user wants.

@vstinner
Copy link

vstinner commented Jul 17, 2015 via email

@encukou
Copy link
Contributor

encukou commented Jul 17, 2015

No.
AFAIK, sometimes you'll find non-UTF8 data in fields where the schema says they should be textual. A real-world library needs to support that.

@tiran
Copy link
Contributor

tiran commented Jul 17, 2015

Yes, the information is part of the schema. But in order to decide if a value is binary, text, bool, int or some other format the library has to fetch the schema and look up the type. The schema can change almost anytime time, too. The lookup and processing of the schema has a performance impact. The schema may not be available, too.

In my opinion it makes more sense to have different layers. The low level LDAP interface should only decode elements that are always unicode: DN, RDN, attribute names and maybe more. Attribute values are returned as raw bytes. A high level API can fetch and cache the schema in order to convert bytes to text.

It's even a bit more complicated. The schema also contains rules for equality, ordering and substrings matching. In LDAP text isn't just text.

@vstinner
Copy link

Yes, the information is part of the schema. (...) The schema can change almost anytime time, too.

Oh. In this case, please don't guess anyone. It's better to return raw bytes than using the wrong encoding.

A high level API can fetch and cache the schema in order to convert bytes to text.

Can't we push the responsability to the application directly? The application knows the schema, so it should decode itself.

Or pyldap can provide an helper to explicitly define which fields are text. By default, we may use a list of well known fields like DN, but the application may clear this list or add item to this list. Sorry, I don't know LDAP well enough to have a more concrete proposition.

@hroncok
Copy link
Member

hroncok commented Jul 17, 2015

As I read the discussion, I think making the API compatible is the way to go now. Further changes may be discussed with python-ldap devs later, if we convince them to cooperate.

@rbarrois
Copy link
Contributor Author

As mentionned above, there are a few type of fields in LDAP:

  • distinguishedName (including bases and RDN) are always UTF-8
  • filters/queries are always UTF-8
  • attribute names are always UTF-8
  • attribute values may be bytes, booleans, integers, anything

The python-ldap version uses bytes for all those fields.
Users tend to store only ASCII chars in those fields, and this will convert nicely to unicode objects thanks to Py2's automagic casting rules — and break horribly once they add UTF-8 chars in there.

Py3 comes with much better rules on that front, and forcing users to use bytes for fields that are clearly defined as "MUST contain valid UTF-8" seems a bad design — actually riskier, some users may use another charset by mistake.

In other words, I want to avoid the following:

def get_org_name(org_id):
    base = "ou=%s,dc=example,dc=org" % org_id
    results = connection.search_ext(base.encode('utf-8'), scope=ldap.SCOPE_SUBTREE)
    bname = results[0][1][b'name']
    return bname.decode('utf-8')

The .encode('utf-8') and the b'name' are boilerplate, imposed by the library design yet useless from the RFC point of view.
The need to use b'name' is unnatural, and surprise many users, who won't understand why they get a KeyError.

Instead, I'd want the following API:

def get_org_name(org_id):
    results = connection.search_ext("ou=%s,dc=example,dc=org" % org_id, scope=ldap.SCOPE_SUBTREE)
    bname = results[0][1]['name']
    return bname.decode('utf-8')

Obviously, attribute values should be kept as bytes on both sides.

@encukou
Copy link
Contributor

encukou commented Jul 17, 2015

Right. At this point it looks like we're all in agreement. Thanks for the summary @rbarrois. I'll just add that the Python 2 API should be unchanged. So, in py3c terms, everything that's UTF-8 should be PyStr.
@hroncok, will you be able to implement this next week? I can do a review after EuroPython (if no one beats me to it), and I'll try to be available if you have any questions.

@hroncok
Copy link
Member

hroncok commented Jul 17, 2015

I could, if @rbarrois doesn't plan to do it himself.

@rbarrois
Copy link
Contributor Author

Not sure about what we agree on :)

The current implementation (the one I wrote some time ago) already does the following:

  • On Py2, keep the old, clumsy API
  • On Py3, use PyStr instead of PyBytes for DN / filter / attribute names

The problem with this approach is that libraries relying on pyldap must have different code paths depending on whether they run under Py2 or under Py3:

  • If they wish to provide a "unicode/str" interface, they need to play the .encode('utf-8') / .decode('utf-8') dans on Py2
  • If they prefer the "bytes only" version, they need to play that dance on Py3.

Since having inconsistent APIs between Py2 and Py3 seems weird, I see 3 options:

  1. Break backwards compatibility and use unicode on Py2
  2. Add an optional flag on methods to opt-in to unicode mode on Py2, on by default on Py3
  3. Go back to the bytes-only API and provide a nicer wrapper for both Py2 and Py3

In my opinion, we should provide a simple way for library users to "upgrade" to the unicode behaviour when they enable Py3 support.
This could be easy if, under Py3, functions receiving bytes where they expect str fail with a message akin to Under Py3, this method takes bytes as argument. If you're adding Py3 support to your library, see pyldap.readthedocs.org/#upgrading-from-py2 for help.

I'll try to implement option 2 during the week-end.

@encukou
Copy link
Contributor

encukou commented Jul 17, 2015

Well, "native string" is bytes on py2 and unicode on py3, so the respective dances are pretty much par for the course when porting something. Python itself has an inconsistent API between the two versions.

So, I can live without the unicode-enabling flag, but having it won't hurt.

@rbarrois
Copy link
Contributor Author

@encukou having ported a few libraries to Py3, a Py2-and-Py3-compatible library tends to switch to « the Py3 way » and drops "native strings" for unicode (obviously not true for one-time scripts).

@rbarrois
Copy link
Contributor Author

OK, I've just added an implementation of option 2.

With this, code should run as follows:

  • If starting a new project with Py3, do nothing and use usual types
  • If porting a project from Py2, begin with setting bytes_mode=True in calls to ldap.initialize() / ldap.open().
  • When ready, set bytes_mode=False and fix code accordingly
  • Developers of third-party libraries should force bytes_mode=False everywhere: this ensures that the ldap API is consistent accross Python 2 and Python 3.

@hroncok
Copy link
Member

hroncok commented Jul 21, 2015

Thanks @rbarrois. I'll at the code tomorrow, if you want.

@hroncok hroncok mentioned this issue Jul 23, 2015
rbarrois added a commit that referenced this issue Jul 25, 2015
With this commit, all ldap connections accept a new parameter,
``bytes_mode``.

When set to ``True``, this flag emulates the old Python 2 behavior,
where all fields are bytes - including those declared as UTF-8 by the RFC (DN,
RDN, attribute names).

If this flag is set to ``False``, the code works with text (unicode) for all
text fields (everything except attribute values).

If no value is set under Python 2, the code will raise a BytesWarning
and proceed with the flag set to ``True``, for backwards compatibility.
Under Python 3, the value can only be set to ``False``.

For safety and ease of upgrade, the code checks that all provided
arguments are of the expected type (unicode with ``bytes_mode=False``,
bytes with ``bytes_mode=True``).
@rbarrois
Copy link
Contributor Author

OK, I've just added a proper implementation proposal for this over in #7, taking into account @hroncok suggestions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants