Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect normalzation behaviour on character sequence '%e2%80%b3' #160

Closed
gh2k opened this issue May 13, 2014 · 1 comment
Closed

Incorrect normalzation behaviour on character sequence '%e2%80%b3' #160

gh2k opened this issue May 13, 2014 · 1 comment
Labels

Comments

@gh2k
Copy link

gh2k commented May 13, 2014

Specifically, this produces an incorrect result:

1.9.3-p392 :019 > u = Addressable::URI.parse('http://example.org/%e2%80%b3')
 => #<Addressable::URI:0xd005e8 URI:http://example.org/%e2%80%b3> 
1.9.3-p392 :020 > u.normalize!
 => #<Addressable::URI:0xd005e8 URI:http://example.org/%E2%80%B2%E2%80%B2> 

Note that the normalized URL no longer matches.

I think this is related to Addressable::IDNA.unicode_normalize_kc

Specifiaclly:

1.9.3-p392 :013 > s = Addressable::URI.unencode('%e2%80%b3')
 => "″" 
1.9.3-p392 :014 > Addressable::IDNA.unicode_normalize_kc(s)
 => "′′" 

The output is now two UTF-8 characters, when previously it was one.

@sporkmonger
Copy link
Owner

This is not a bug. URIs, and particularly IRIs, use Unicode normalization form KC to eliminate visual ambiguities which may result in phishing attacks. NFKC splits that codepoint up to the characters that Addressable is giving you. If this behavior is undesirable for your use-case, you can normalize instead on a component-by-component basis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants