Skip to content

Conversation

@lopuhin
Copy link
Member

@lopuhin lopuhin commented Feb 29, 2016

I added tests by @redapple, and fixed get_base_url to do urljoin on unicode strings.

The line with baseurl = str_to_unicode(baseurl, encoding) is required because get_base_url seems to be expected to work with byte urls as well (test_get_base_url in the same file). I am not sure which encoding to use here, decided to be more permissive to avoid breaking existing python 2 code which might be using utf-8 encoded strings here.

Related to scrapy/scrapy#1783

@codecov-io
Copy link

Current coverage is 90.96%

Merging #44 into master will not affect coverage as of 6a657d3

Powered by Codecov. Updated on successful CI builds.

@redapple
Copy link
Contributor

@lopuhin ,
after testing a few cases with Chrome and Firefox, and also discussing with @kmike ,
I'm having second thoughts on tests expecting non-UTF8 encoded data prior to percent-escaping in URL paths.

What I'm seeing with modern browsers is that the encoding of the page (or of a <form> within the page) only affects the query part of URLs, not paths.
So I believe base URL (paths) should always be crafted with UTF-8 encoded characters, then percent-escaped.
Hence,
u'http://example.org/sterling\u00a3' should translate to b'http://example.org/sterling%C2%A3'

@dvdbng
Copy link
Contributor

dvdbng commented Mar 17, 2016

if I'm reading the spec correctly, compliant web servers should interpret encoded and decoded paths the same, but in practice not all of them do (e.g. this vs this)

@redapple
Copy link
Contributor

@Youwotma , true, RFC 3986, §3.3 Path does allow "&" in path segments:

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

This server seems to treat "&" as special within the path component, while it shouldn't:

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component
. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.

@redapple
Copy link
Contributor

@lopuhin , I submitted a new PR #45 continuing your work and following #44 (comment)
would you mind having a look?

@lopuhin
Copy link
Member Author

lopuhin commented Mar 23, 2016

@redapple #45 looks excellent! Closing this

@lopuhin lopuhin closed this Mar 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants