Fix get_base_url for non-ascii urls for Python 3 #44

lopuhin · 2016-02-29T08:00:34Z

I added tests by @redapple, and fixed get_base_url to do urljoin on unicode strings.

The line with baseurl = str_to_unicode(baseurl, encoding) is required because get_base_url seems to be expected to work with byte urls as well (test_get_base_url in the same file). I am not sure which encoding to use here, decided to be more permissive to avoid breaking existing python 2 code which might be using utf-8 encoded strings here.

Related to scrapy/scrapy#1783

codecov-io · 2016-02-29T08:02:14Z

Current coverage is `90.96%`

Merging #44 into master will not affect coverage as of 6a657d3

Powered by Codecov. Updated on successful CI builds.

redapple · 2016-03-16T17:23:55Z

@lopuhin ,
after testing a few cases with Chrome and Firefox, and also discussing with @kmike ,
I'm having second thoughts on tests expecting non-UTF8 encoded data prior to percent-escaping in URL paths.

What I'm seeing with modern browsers is that the encoding of the page (or of a <form> within the page) only affects the query part of URLs, not paths.
So I believe base URL (paths) should always be crafted with UTF-8 encoded characters, then percent-escaped.
Hence,
u'http://example.org/sterling\u00a3' should translate to b'http://example.org/sterling%C2%A3'

dvdbng · 2016-03-17T05:53:51Z

if I'm reading the spec correctly, compliant web servers should interpret encoded and decoded paths the same, but in practice not all of them do (e.g. this vs this)

redapple · 2016-03-17T09:38:30Z

@Youwotma , true, RFC 3986, §3.3 Path does allow "&" in path segments:

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

This server seems to treat "&" as special within the path component, while it shouldn't:

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.

redapple · 2016-03-22T22:03:51Z

@lopuhin , I submitted a new PR #45 continuing your work and following #44 (comment)
would you mind having a look?

lopuhin · 2016-03-23T08:44:00Z

@redapple #45 looks excellent! Closing this

Add tests by @redapple, do urljoin on unicode strings.

24dbe88

redapple mentioned this pull request Mar 22, 2016

Fix tests on non-ASCII characters in URL + new safe_url_string() #45

Merged

lopuhin closed this Mar 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix get_base_url for non-ascii urls for Python 3 #44

Fix get_base_url for non-ascii urls for Python 3 #44

Uh oh!

lopuhin commented Feb 29, 2016

Uh oh!

codecov-io commented Feb 29, 2016

Uh oh!

redapple commented Mar 16, 2016

Uh oh!

dvdbng commented Mar 17, 2016

Uh oh!

redapple commented Mar 17, 2016

Uh oh!

redapple commented Mar 22, 2016

Uh oh!

lopuhin commented Mar 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix get_base_url for non-ascii urls for Python 3 #44

Fix get_base_url for non-ascii urls for Python 3 #44

Uh oh!

Conversation

lopuhin commented Feb 29, 2016

Uh oh!

codecov-io commented Feb 29, 2016

Current coverage is 90.96%

Uh oh!

redapple commented Mar 16, 2016

Uh oh!

dvdbng commented Mar 17, 2016

Uh oh!

redapple commented Mar 17, 2016

Uh oh!

redapple commented Mar 22, 2016

Uh oh!

lopuhin commented Mar 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Current coverage is `90.96%`