HTML Entities and Numeric character references in URL #5

stav · 2013-03-29T18:28:52Z

URLs on some sites erroneously contain valid "safe" characters in an invalid way and the standard Python library is unable to deal with this; therefore it might be nice if w3lib could. For example the hash character # normally marks the beginning of the fragment; but, it is possible that the url contains Numeric Character References (NCRs) like ® for example.

w3lib.url.safe_url_string() uses urllib.quote() with the following safe chars:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()

and the following (invalid) url does not get altered:

>>> url = "/Pioneer_Speakers_with_iPod&reg;~iPhone&#174;_Dock?id=123#ipad"
>>> assert url == safe_url_string(url)
>>>

urlparse.urldefrag() is confused:

>>> urlparse.urldefrag(url)
('/Pioneer_Speakers_with_iPod&reg;~iPhone&', '174;_Dock?id=123#ipad')

Since safe_url_string() is used in SgmlLinkExtractor, for example with canonicalization turned on, we get fragment misinterpretation as the first hash triggers the slice:

/Pioneer_Speakers_with_iPod&reg;~iPhone&

Using urllib.quote() directly does not work since it encodes all hashes, including the fragment hash:

>>> print urllib.quote(url)
/Pioneer_Speakers_with_iPod%26reg%3B%7EiPhone%26%23174%3B_Dock%3Fid%3D123%23ipad

What is needed is perhaps a Entity/NCR regex that first converts the references and then does the safe encoding. So that in the end we get:

/Pioneer_Speakers_with_iPod%26reg%3B~iPhone%26%23174%3B_Dock?id=123#ipad

The text was updated successfully, but these errors were encountered:

kmike · 2013-10-10T18:54:25Z

I think that NCR should be converted earlier, by HTML or XML parser, and they shouldn't be handled by w3lib.url.safe_url_string module because

there is no way to tell if ® means NCR or a fragment,
safe_url_string implements RFC-3986 and escaping '#' would be a violation of this RFC.

Similar bug was closed as invalid in firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=353719

nramirezuy · 2013-10-10T19:13:33Z

@stav I agree with @kmike the HTML or XML parser should handle all the NCRs.

stav · 2013-10-10T19:23:13Z

Yes, ok, I agree as well.

stav closed this as completed Oct 10, 2013

jvanasco mentioned this issue May 19, 2017

canonicalize_url use of safe_url_string breaks when an encoded hash character is encountered #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Entities and Numeric character references in URL #5

HTML Entities and Numeric character references in URL #5

stav commented Mar 29, 2013

kmike commented Oct 10, 2013

nramirezuy commented Oct 10, 2013

stav commented Oct 10, 2013

HTML Entities and Numeric character references in URL #5

HTML Entities and Numeric character references in URL #5

Comments

stav commented Mar 29, 2013

kmike commented Oct 10, 2013

nramirezuy commented Oct 10, 2013

stav commented Oct 10, 2013