-
Notifications
You must be signed in to change notification settings - Fork 106
Fix tests on non-ASCII characters in URL + new safe_url_string() #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is
|
| def safe_url_string(url, encoding='utf8', path_encoding='utf8'): | ||
| """Convert the given url into a legal URL by escaping unsafe characters | ||
| according to RFC-3986. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docstring below should also be updated: the unicode url is not converted to str using given encoding. I'd suggest something like (following the summary from the PR):
If a bytes url is given, it is first converted to str using the given
encoding (which defaults to 'utf-8'). 'utf-8' encoding is used for
URL path component (unless overriden by path_encoding), and given
encoding is used for query string or form data.
When passing a encoding, you should use the encoding of the
original page (the page from which the url was extracted from).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lopuhin , indeed. very nice docstring, thanks!
|
@redapple I've checked all the tests, apart from the old docstring everything looks good! |
|
@lopuhin , correct. I've omitted unicode domains tests. |
|
@redapple agreed, I think it'll just mean adding more tests, all current tests will stay the same |
| urldefrag, urlencode, urlparse, | ||
| quote, parse_qs, parse_qsl) | ||
| from six.moves.urllib.request import pathname2url, url2pathname | ||
| from w3lib.util import to_bytes, to_native_str, to_unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is a time to add __all__ to w3lib.url - see https://github.com/redapple/w3lib/commit/7d83b092f4054ba439a115ec041e9b190385b155 and https://github.com/scrapy/scrapy/blob/ebef6d7c6dd8922210db8a4a44f48fe27ee0cd16/scrapy/utils/url.py#L16;
Was deprecated since v1.1: https://github.com/scrapy/w3lib/blob/v1.1/w3lib/url.py
|
Note: |
|
@lopuhin , I've added support for IDNs |
| b'0123456789' b'_.-') | ||
|
|
||
|
|
||
| def urljoin_rfc(base, ref, encoding='utf-8'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this backfired on Scrapinghub's testing bots.
@kmike , do you think we should bring it back? announcing it will be removed in, say, 1.16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with bringing it back, there is no need for aggressive removal of deprecated code.
Fixes scrapyGH-2321 The idea is to not guess the encoding of "Location" header value and simply percent-encode non-ASCII bytes, which should then be re-interpreted correctly by the remote website in whatever encoding was used originally. See https://tools.ietf.org/html/rfc3987#section-3.2 This is similar to the changes to safe_url_string in scrapy/w3lib#45
#2322) * Do not interpret non-ASCII bytes in "Location" and percent-encode them Fixes GH-2321 The idea is to not guess the encoding of "Location" header value and simply percent-encode non-ASCII bytes, which should then be re-interpreted correctly by the remote website in whatever encoding was used originally. See https://tools.ietf.org/html/rfc3987#section-3.2 This is similar to the changes to safe_url_string in scrapy/w3lib#45 * Remove unused import
This is a continuation of #44, related to scrapy/scrapy#1783.
The rule I followed here is to always use UTF-8 encoding for URL path component, and
encodingfor query string or form data. (the use of UTF-8 can be overriden if needed using the newpath_encodingargument tosafe_url_string())This follows what I understand from http://tools.ietf.org/html/rfc3987#page-10 and what browser tests (Chrome and Firefox) exhibit.
This PR changes
safe_url_string()in an incompatible way, but I believe it's more correct this way.It also includes verbatim copies of Scrapy's
to_unicode(),to_bytes()andto_native_str()Note: This PR also removes
urljoin_rfcwhich is deprected since v1.1