IDNA encoding/decoding issue #46

ghost · 2013-11-29T14:33:10Z

Hello,

I have trouble with decoding punycoded IDN domain back to Unicode, here is example (python 2.7):

>>> n = dns.name.from_unicode(u'ee tč')

>>> n.to_text()
'xn--ee\\032t-jua.'

>>> print n.to_text()
xn--ee\032t-jua.

>>> n.to_unicode()
---------------------------------------------------------------------------
UnicodeError                              Traceback (most recent call last)
<ipython-input-66-38cdd9f34ab1> in <module>()
----> 1 n.to_unicode()

/usr/lib/python2.7/site-packages/dns/name.pyc in to_unicode(self, omit_final_dot)
    346         else:
    347             l = self.labels
--> 348         s = u'.'.join([encodings.idna.ToUnicode(_escapify(x)) for x in l])
    349         return s
    350 

/usr/lib64/python2.7/encodings/idna.pyc in ToUnicode(label)
    137     # label2 will already be in lower case.
    138     if label.lower() != label2:
--> 139         raise UnicodeError("IDNA does not round-trip", label, label2)
    140 
    141     # Step 8: return the result of step 5

UnicodeError: ('IDNA does not round-trip', 'xn--ee\\032t-jua', 'xn--ee\\032t-u1a')

But when I use standard library, it works:

>>> a = encodings.idna.ToASCII(u'ee tč')

>>> a
'xn--ee t-jua.'      

>>> u = encodings.idna.ToUnicode(a)

>>> u
>>> u'ee t\u010d'

>>> print u
ee tč

When I substitute space with \032, it works with encoding module, but it is giving me different punycoded value:

>>> a = encodings.idna.ToASCII(u'ee\\032tč')

>>> a
'xn--ee\\032t-p6a'

>>> u = encodings.idna.ToUnicode(a)

>>> u
u'ee\\032t\u010d'

>>> print u
ee\032tč

rthalley · 2013-12-02T02:05:20Z

On Nov 29, 2013, at 6:33, bastiak notifications@github.com wrote:

--> 348 s = u'.'.join([encodings.idna.ToUnicode(_escapify(x)) for x in l])

I think we need to call something like _escapify after calling encodings.idna.ToUnicode(), but I'm not quite sure what that something is. Clearly "." (and perhaps unicode equivalents to ".") need to be escaped, and escaping space seems sensible, but on the other hand just calling _escapify() is bad because it ends up escaping unicode code points > 127, which seems bad.

I don't know enough about Unicode to know if other whitespace-like things need escaping, if there are any.

I will ponder this further, but anyone who knows Unicode well is encouraged to help :)

/Bob

ghost · 2014-02-06T17:19:50Z

Sorry for late answer, I was busy.

In my opinion, in encodings.idna.ToUnicode(value) value should be "de-escapified" string.
Also in encodings.idna.ToASCII(value2) value2 should be pure unicode without escaping.
Python lib handles special cases and will throw an error if it happens.

I'm not sure, but in my opinion the dns.name.to_unicode() method should provide only human readable form of the domain name without any escaping.

ghost · 2014-03-28T13:44:54Z

Hi,
do you accept my proposal?

dns.name.from_unicode() should remove any escaping and make a pure unicode string per label, then encode labels with encodings.idna.ToASCII() (labels are stored in punycoded notation without escaping)

dns.name.Name.to_unicode() should convert labels with encodings.idna.ToUnicode() and then escape characters per label

dns.name.Name.to_text() should only escape some required characters per label

I haven't inspected to_wire() and from_wire() yet.

I would like to make a patch.
Bastiak

rthalley · 2014-03-28T17:27:08Z

On Mar 28, 2014, at 6:44, bastiak notifications@github.com wrote:

Hi,
do you accept my proposal?

dns.name.from_unicode() should remove any escaping and make a pure unicode string per label, then encode labels with encodings.idna.ToASCII() (labels are stored in punycoded notation without escaping)

Yes
dns.name.Name.to_unicode() should convert labels with encodings.idna.ToUnicode() and then escape characters per label

This is probably the right thing, but the escaping is different. Probably only the "special characters" in _escaped should be escaped, and perhaps values < 0x20, but values >= 0x80 need to be left alone.
dns.name.Name.to_text() should only escape some required characters per label

dns.name.Name.to_text() should not be changed at all.

I haven't inspected to_wire() and from_wire() yet.

These should not be changed either.

/Bob

underrun · 2014-04-17T20:17:24Z

but spaces are not allowed in domain names ...

ghost mentioned this issue Apr 15, 2014

IDNA encoding/decoding issue fix #67

Merged

rthalley closed this as completed Jun 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IDNA encoding/decoding issue #46

IDNA encoding/decoding issue #46

ghost commented Nov 29, 2013

rthalley commented Dec 2, 2013

ghost commented Feb 6, 2014

ghost commented Mar 28, 2014

rthalley commented Mar 28, 2014

underrun commented Apr 17, 2014

IDNA encoding/decoding issue #46

IDNA encoding/decoding issue #46

Comments

ghost commented Nov 29, 2013

rthalley commented Dec 2, 2013

ghost commented Feb 6, 2014

ghost commented Mar 28, 2014

rthalley commented Mar 28, 2014

underrun commented Apr 17, 2014