Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDNA encoding/decoding issue #46

Closed
ghost opened this issue Nov 29, 2013 · 5 comments
Closed

IDNA encoding/decoding issue #46

ghost opened this issue Nov 29, 2013 · 5 comments

Comments

@ghost
Copy link

ghost commented Nov 29, 2013

Hello,

I have trouble with decoding punycoded IDN domain back to Unicode, here is example (python 2.7):

>>> n = dns.name.from_unicode(u'ee tč')

>>> n.to_text()
'xn--ee\\032t-jua.'

>>> print n.to_text()
xn--ee\032t-jua.

>>> n.to_unicode()
---------------------------------------------------------------------------
UnicodeError                              Traceback (most recent call last)
<ipython-input-66-38cdd9f34ab1> in <module>()
----> 1 n.to_unicode()

/usr/lib/python2.7/site-packages/dns/name.pyc in to_unicode(self, omit_final_dot)
    346         else:
    347             l = self.labels
--> 348         s = u'.'.join([encodings.idna.ToUnicode(_escapify(x)) for x in l])
    349         return s
    350 

/usr/lib64/python2.7/encodings/idna.pyc in ToUnicode(label)
    137     # label2 will already be in lower case.
    138     if label.lower() != label2:
--> 139         raise UnicodeError("IDNA does not round-trip", label, label2)
    140 
    141     # Step 8: return the result of step 5

UnicodeError: ('IDNA does not round-trip', 'xn--ee\\032t-jua', 'xn--ee\\032t-u1a')

But when I use standard library, it works:

>>> a = encodings.idna.ToASCII(u'ee tč')

>>> a
'xn--ee t-jua.'      

>>> u = encodings.idna.ToUnicode(a)

>>> u
>>> u'ee t\u010d'

>>> print u
ee 

When I substitute space with \032, it works with encoding module, but it is giving me different punycoded value:

>>> a = encodings.idna.ToASCII(u'ee\\032tč')

>>> a
'xn--ee\\032t-p6a'

>>> u = encodings.idna.ToUnicode(a)

>>> u
u'ee\\032t\u010d'

>>> print u
ee\032
@rthalley
Copy link
Owner

rthalley commented Dec 2, 2013

On Nov 29, 2013, at 6:33, bastiak notifications@github.com wrote:

--> 348 s = u'.'.join([encodings.idna.ToUnicode(_escapify(x)) for x in l])

I think we need to call something like _escapify after calling encodings.idna.ToUnicode(), but I'm not quite sure what that something is. Clearly "." (and perhaps unicode equivalents to ".") need to be escaped, and escaping space seems sensible, but on the other hand just calling _escapify() is bad because it ends up escaping unicode code points > 127, which seems bad.

I don't know enough about Unicode to know if other whitespace-like things need escaping, if there are any.

I will ponder this further, but anyone who knows Unicode well is encouraged to help :)

/Bob

@ghost
Copy link
Author

ghost commented Feb 6, 2014

Sorry for late answer, I was busy.

In my opinion, in encodings.idna.ToUnicode(value) value should be "de-escapified" string.
Also in encodings.idna.ToASCII(value2) value2 should be pure unicode without escaping.
Python lib handles special cases and will throw an error if it happens.

I'm not sure, but in my opinion the dns.name.to_unicode() method should provide only human readable form of the domain name without any escaping.

@ghost
Copy link
Author

ghost commented Mar 28, 2014

Hi,
do you accept my proposal?

dns.name.from_unicode() should remove any escaping and make a pure unicode string per label, then encode labels with encodings.idna.ToASCII() (labels are stored in punycoded notation without escaping)

dns.name.Name.to_unicode() should convert labels with encodings.idna.ToUnicode() and then escape characters per label

dns.name.Name.to_text() should only escape some required characters per label

I haven't inspected to_wire() and from_wire() yet.

I would like to make a patch.
Bastiak

@rthalley
Copy link
Owner

On Mar 28, 2014, at 6:44, bastiak notifications@github.com wrote:

Hi,
do you accept my proposal?

dns.name.from_unicode() should remove any escaping and make a pure unicode string per label, then encode labels with encodings.idna.ToASCII() (labels are stored in punycoded notation without escaping)

Yes
dns.name.Name.to_unicode() should convert labels with encodings.idna.ToUnicode() and then escape characters per label

This is probably the right thing, but the escaping is different. Probably only the "special characters" in _escaped should be escaped, and perhaps values < 0x20, but values >= 0x80 need to be left alone.
dns.name.Name.to_text() should only escape some required characters per label

dns.name.Name.to_text() should not be changed at all.

I haven't inspected to_wire() and from_wire() yet.

These should not be changed either.

/Bob

@underrun
Copy link

but spaces are not allowed in domain names ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants