Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Entities and Numeric character references in URL #5

Closed
stav opened this issue Mar 29, 2013 · 3 comments
Closed

HTML Entities and Numeric character references in URL #5

stav opened this issue Mar 29, 2013 · 3 comments

Comments

@stav
Copy link
Contributor

stav commented Mar 29, 2013

URLs on some sites erroneously contain valid "safe" characters in an invalid way and the standard Python library is unable to deal with this; therefore it might be nice if w3lib could. For example the hash character # normally marks the beginning of the fragment; but, it is possible that the url contains Numeric Character References (NCRs) like ® for example.

w3lib.url.safe_url_string() uses urllib.quote() with the following safe chars:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()

and the following (invalid) url does not get altered:

>>> url = "/Pioneer_Speakers_with_iPod®~iPhone®_Dock?id=123#ipad"
>>> assert url == safe_url_string(url)
>>> 

urlparse.urldefrag() is confused:

>>> urlparse.urldefrag(url)
('/Pioneer_Speakers_with_iPod®~iPhone&', '174;_Dock?id=123#ipad')

Since safe_url_string() is used in SgmlLinkExtractor, for example with canonicalization turned on, we get fragment misinterpretation as the first hash triggers the slice:

/Pioneer_Speakers_with_iPod®~iPhone&

Using urllib.quote() directly does not work since it encodes all hashes, including the fragment hash:

>>> print urllib.quote(url)
/Pioneer_Speakers_with_iPod%26reg%3B%7EiPhone%26%23174%3B_Dock%3Fid%3D123%23ipad

What is needed is perhaps a Entity/NCR regex that first converts the references and then does the safe encoding. So that in the end we get:

/Pioneer_Speakers_with_iPod%26reg%3B~iPhone%26%23174%3B_Dock?id=123#ipad
@kmike
Copy link
Member

kmike commented Oct 10, 2013

I think that NCR should be converted earlier, by HTML or XML parser, and they shouldn't be handled by w3lib.url.safe_url_string module because

  • there is no way to tell if ® means NCR or a fragment,
  • safe_url_string implements RFC-3986 and escaping '#' would be a violation of this RFC.

Similar bug was closed as invalid in firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=353719

@nramirezuy
Copy link

@stav I agree with @kmike the HTML or XML parser should handle all the NCRs.

@stav
Copy link
Contributor Author

stav commented Oct 10, 2013

Yes, ok, I agree as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants