Fix tests on non-ASCII characters in URL + new safe_url_string() #45

redapple · 2016-03-22T22:02:15Z

This is a continuation of #44, related to scrapy/scrapy#1783.

The rule I followed here is to always use UTF-8 encoding for URL path component, and encoding for query string or form data. (the use of UTF-8 can be overriden if needed using the new path_encoding argument to safe_url_string())

This follows what I understand from http://tools.ietf.org/html/rfc3987#page-10 and what browser tests (Chrome and Firefox) exhibit.

This PR changes safe_url_string() in an incompatible way, but I believe it's more correct this way.

It also includes verbatim copies of Scrapy's to_unicode(), to_bytes() and to_native_str()

Note: This PR also removes urljoin_rfc which is deprected since v1.1

codecov-io · 2016-03-22T22:04:13Z

Current coverage is `89.58%`

Merging #45 into master will decrease coverage by -1.38% as of 123d4b2

Powered by Codecov. Updated on successful CI builds.

lopuhin · 2016-03-23T08:41:40Z

w3lib/url.py

+def safe_url_string(url, encoding='utf8', path_encoding='utf8'):
    """Convert the given url into a legal URL by escaping unsafe characters
    according to RFC-3986.



I think the docstring below should also be updated: the unicode url is not converted to str using given encoding. I'd suggest something like (following the summary from the PR):

If a bytes url is given, it is first converted to str using the given encoding (which defaults to 'utf-8'). 'utf-8' encoding is used for URL path component (unless overriden by path_encoding), and given encoding is used for query string or form data. When passing a encoding, you should use the encoding of the original page (the page from which the url was extracted from).

@lopuhin , indeed. very nice docstring, thanks!

lopuhin · 2016-03-23T08:50:03Z

@redapple I've checked all the tests, apart from the old docstring everything looks good!
I think now the only part that is different from the browser behavior is handling of unicode domains, but this is a different issue, right?

redapple · 2016-03-23T08:54:41Z

@lopuhin , correct. I've omitted unicode domains tests.
I believe it can be built on top (I hope it would not break too much of this PR)

lopuhin · 2016-03-23T08:56:32Z

@redapple agreed, I think it'll just mean adding more tests, all current tests will stay the same

kmike · 2016-03-23T11:52:46Z

w3lib/url.py

+                                    urldefrag, urlencode, urlparse,
+                                    quote, parse_qs, parse_qsl)
+from six.moves.urllib.request import pathname2url, url2pathname
+from w3lib.util import to_bytes, to_native_str, to_unicode


I wonder if it is a time to add __all__ to w3lib.url - see https://github.com/redapple/w3lib/commit/7d83b092f4054ba439a115ec041e9b190385b155 and https://github.com/scrapy/scrapy/blob/ebef6d7c6dd8922210db8a4a44f48fe27ee0cd16/scrapy/utils/url.py#L16;

Was deprecated since v1.1: https://github.com/scrapy/w3lib/blob/v1.1/w3lib/url.py

redapple · 2016-03-23T16:16:07Z

Note: str_to_unicode and unicode_to_str can be removed (but #46 also needs to happen)

redapple · 2016-03-25T18:21:34Z

@lopuhin , I've added support for IDNs

redapple · 2016-03-30T16:30:55Z

w3lib/url.py

                      b'0123456789' b'_.-')
-
-
-def urljoin_rfc(base, ref, encoding='utf-8'):


so this backfired on Scrapinghub's testing bots.
@kmike , do you think we should bring it back? announcing it will be removed in, say, 1.16?

I'm fine with bringing it back, there is no need for aggressive removal of deprecated code.

Fixes scrapyGH-2321 The idea is to not guess the encoding of "Location" header value and simply percent-encode non-ASCII bytes, which should then be re-interpreted correctly by the remote website in whatever encoding was used originally. See https://tools.ietf.org/html/rfc3987#section-3.2 This is similar to the changes to safe_url_string in scrapy/w3lib#45

#2322) * Do not interpret non-ASCII bytes in "Location" and percent-encode them Fixes GH-2321 The idea is to not guess the encoding of "Location" header value and simply percent-encode non-ASCII bytes, which should then be re-interpreted correctly by the remote website in whatever encoding was used originally. See https://tools.ietf.org/html/rfc3987#section-3.2 This is similar to the changes to safe_url_string in scrapy/w3lib#45 * Remove unused import

lopuhin and others added 2 commits February 29, 2016 10:59

Add tests by @redapple, do urljoin on unicode strings.

24dbe88

Fix tests on non-ASCII characters in URL + new safe_url_string()

2c00a14

This was referenced Mar 22, 2016

Fix get_base_url for non-ascii urls for Python 3 #44

Closed

Fix and re-enabled canonicalize_url() tests scrapy/scrapy#1874

Closed

lopuhin reviewed Mar 23, 2016
View reviewed changes

Update safe_url_string() docstring

5daebcd

redapple mentioned this pull request Mar 23, 2016

get_base_url fails for non-ascii URLs in Python 3 scrapy/scrapy#1783

Closed

kmike reviewed Mar 23, 2016
View reviewed changes

Remove deprecated urljoin_rfc()

3253677

Was deprecated since v1.1: https://github.com/scrapy/w3lib/blob/v1.1/w3lib/url.py

redapple mentioned this pull request Mar 24, 2016

Customize linkextractor's _collect_string_content() scrapy/scrapy#1799

Closed

Support Internationalized Domain Names with safe_url_string()

b8b9055

kmike merged commit ae0e214 into scrapy:master Mar 25, 2016

This was referenced Mar 29, 2016

Remove deprecated encode_multipart() and related tests #46

Closed

Add encoding to Link object + re-enable HTML entities links tests scrapy/scrapy#1880

Closed

redapple reviewed Mar 30, 2016
View reviewed changes

redapple mentioned this pull request Mar 30, 2016

Revert "Remove deprecated urljoin_rfc()" #48

Merged

redapple added this to the v1.14 milestone Apr 4, 2016

kmike mentioned this pull request Apr 10, 2016

w3lib 1.14 breaks scrapy 0.24.6 usage of urlparse.urldefrag #54

Closed

redapple mentioned this pull request Oct 12, 2016

Do not interpret non-ASCII bytes in "Location" and percent-encode them scrapy/scrapy#2322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix tests on non-ASCII characters in URL + new safe_url_string() #45

Fix tests on non-ASCII characters in URL + new safe_url_string() #45

Uh oh!

redapple commented Mar 22, 2016

Uh oh!

codecov-io commented Mar 22, 2016

Uh oh!

lopuhin Mar 23, 2016

Uh oh!

redapple Mar 23, 2016

Uh oh!

lopuhin commented Mar 23, 2016

Uh oh!

redapple commented Mar 23, 2016

Uh oh!

lopuhin commented Mar 23, 2016

Uh oh!

kmike Mar 23, 2016

Uh oh!

redapple commented Mar 23, 2016

Uh oh!

redapple commented Mar 25, 2016

Uh oh!

redapple Mar 30, 2016

Uh oh!

kmike Mar 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		b'0123456789' b'_.-')


		def urljoin_rfc(base, ref, encoding='utf-8'):

Fix tests on non-ASCII characters in URL + new safe_url_string() #45

Fix tests on non-ASCII characters in URL + new safe_url_string() #45

Uh oh!

Conversation

redapple commented Mar 22, 2016

Uh oh!

codecov-io commented Mar 22, 2016

Current coverage is 89.58%

Uh oh!

lopuhin Mar 23, 2016

Choose a reason for hiding this comment

Uh oh!

redapple Mar 23, 2016

Choose a reason for hiding this comment

Uh oh!

lopuhin commented Mar 23, 2016

Uh oh!

redapple commented Mar 23, 2016

Uh oh!

lopuhin commented Mar 23, 2016

Uh oh!

kmike Mar 23, 2016

Choose a reason for hiding this comment

Uh oh!

redapple commented Mar 23, 2016

Uh oh!

redapple commented Mar 25, 2016

Uh oh!

redapple Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

kmike Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Current coverage is `89.58%`