Skip to content

Conversation

@redapple
Copy link
Contributor

This is a continuation of #44, related to scrapy/scrapy#1783.

The rule I followed here is to always use UTF-8 encoding for URL path component, and encoding for query string or form data. (the use of UTF-8 can be overriden if needed using the new path_encoding argument to safe_url_string())

This follows what I understand from http://tools.ietf.org/html/rfc3987#page-10 and what browser tests (Chrome and Firefox) exhibit.

This PR changes safe_url_string() in an incompatible way, but I believe it's more correct this way.

It also includes verbatim copies of Scrapy's to_unicode(), to_bytes() and to_native_str()

Note: This PR also removes urljoin_rfc which is deprected since v1.1

@codecov-io
Copy link

Current coverage is 89.58%

Merging #45 into master will decrease coverage by -1.38% as of 123d4b2

Powered by Codecov. Updated on successful CI builds.

def safe_url_string(url, encoding='utf8', path_encoding='utf8'):
"""Convert the given url into a legal URL by escaping unsafe characters
according to RFC-3986.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docstring below should also be updated: the unicode url is not converted to str using given encoding. I'd suggest something like (following the summary from the PR):

If a bytes url is given, it is first converted to str using the given
encoding (which defaults to 'utf-8'). 'utf-8' encoding is used for
URL path component (unless overriden by path_encoding), and given
encoding is used for query string or form data.
When passing a encoding, you should use the encoding of the
original page (the page from which the url was extracted from).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lopuhin , indeed. very nice docstring, thanks!

@lopuhin
Copy link
Member

lopuhin commented Mar 23, 2016

@redapple I've checked all the tests, apart from the old docstring everything looks good!
I think now the only part that is different from the browser behavior is handling of unicode domains, but this is a different issue, right?

@redapple
Copy link
Contributor Author

@lopuhin , correct. I've omitted unicode domains tests.
I believe it can be built on top (I hope it would not break too much of this PR)

@lopuhin
Copy link
Member

lopuhin commented Mar 23, 2016

@redapple agreed, I think it'll just mean adding more tests, all current tests will stay the same

urldefrag, urlencode, urlparse,
quote, parse_qs, parse_qsl)
from six.moves.urllib.request import pathname2url, url2pathname
from w3lib.util import to_bytes, to_native_str, to_unicode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redapple
Copy link
Contributor Author

Note: str_to_unicode and unicode_to_str can be removed (but #46 also needs to happen)

@redapple
Copy link
Contributor Author

@lopuhin , I've added support for IDNs

b'0123456789' b'_.-')


def urljoin_rfc(base, ref, encoding='utf-8'):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this backfired on Scrapinghub's testing bots.
@kmike , do you think we should bring it back? announcing it will be removed in, say, 1.16?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with bringing it back, there is no need for aggressive removal of deprecated code.

@redapple redapple added this to the v1.14 milestone Apr 4, 2016
redapple added a commit to redapple/scrapy that referenced this pull request Oct 12, 2016
Fixes scrapyGH-2321

The idea is to not guess the encoding of "Location" header value
and simply percent-encode non-ASCII bytes,
which should then be re-interpreted correctly by the remote website
in whatever encoding was used originally.

See https://tools.ietf.org/html/rfc3987#section-3.2

This is similar to the changes to safe_url_string in
scrapy/w3lib#45
dangra pushed a commit to scrapy/scrapy that referenced this pull request Oct 20, 2016
#2322)

* Do not interpret non-ASCII bytes in "Location" and percent-encode them

Fixes GH-2321

The idea is to not guess the encoding of "Location" header value
and simply percent-encode non-ASCII bytes,
which should then be re-interpreted correctly by the remote website
in whatever encoding was used originally.

See https://tools.ietf.org/html/rfc3987#section-3.2

This is similar to the changes to safe_url_string in
scrapy/w3lib#45

* Remove unused import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants