Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding of "Location" header on redirects using latin-1 can be wrong #2321

Closed
redapple opened this issue Oct 12, 2016 · 0 comments
Closed

Decoding of "Location" header on redirects using latin-1 can be wrong #2321

redapple opened this issue Oct 12, 2016 · 0 comments
Labels
bug
Milestone

Comments

@redapple
Copy link
Contributor

@redapple redapple commented Oct 12, 2016

Web servers should use encoded URLs in their "Location" headers, but they don't always do.

This website for example, for this URL http://www.yjc.ir/fa/news/1815565/
redirects to www.yjc.ir/fa/news/1815565/اعزام-كوهنوردان-ايراني-به-كيليمانجارو

but the bytes received are UTF-8 encoded, and not percent-escaped:

'Location': ['/fa/news/1815565/\xd8\xa7\xd8\xb9\xd8\xb2\xd8\xa7\xd9\x85-\xd9\x83\xd9\x88\xd9\x87\xd9\x86\xd9\x88\xd8\xb1\xd8\xaf\xd8\xa7\xd9\x86-\xd8\xa7\xd9\x8a\xd8\xb1\xd8\xa7\xd9\x86\xd9\x8a-\xd8\xa8\xd9\x87-\xd9\x83\xd9\x8a\xd9\x84\xd9\x8a\xd9\x85\xd8\xa7\xd9\x86\xd8\xac\xd8\xa7\xd8\xb1\xd9\x88']

RedirectMiddleware decodes the header as "latin1" (this is new in Scrapy 1.1) and issues a request to http://www.yjc.ir/fa/news/1815565/%C3%98%C2%A7%C3%98%C2%B9%C3%98%C2%B2%C3%98%C2%A7%C3%99%C2%85-%C3%99%C2%83%C3%99%C2%88%C3%99%C2%87%C3%99%C2%86%C3%99%C2%88%C3%98%C2%B1%C3%98%C2%AF%C3%98%C2%A7%C3%99%C2%86-%C3%98%C2%A7%C3%99%C2%8A%C3%98%C2%B1%C3%98%C2%A7%C3%99%C2%86%C3%99%C2%8A-%C3%98%C2%A8%C3%99%C2%87-%C3%99%C2%83%C3%99%C2%8A%C3%99%C2%84%C3%99%C2%8A%C3%99%C2%85%C3%98%C2%A7%C3%99%C2%86%C3%98%C2%AC%C3%98%C2%A7%C3%98%C2%B1%C3%99%C2%88

which is not correct.

curl -i "http://www.yjc.ir/fa/news/1815565/" and wget http://www.yjc.ir/fa/news/1815565/ handle it just fine and correctly follow http://www.yjc.ir/fa/news/1815565/%D8%A7%D8%B9%D8%B2%D8%A7%D9%85-%D9%83%D9%88%D9%87%D9%86%D9%88%D8%B1%D8%AF%D8%A7%D9%86-%D8%A7%D9%8A%D8%B1%D8%A7%D9%86%D9%8A-%D8%A8%D9%87-%D9%83%D9%8A%D9%84%D9%8A%D9%85%D8%A7%D9%86%D8%AC%D8%A7%D8%B1%D9%88

(curl fixed the issue not too long ago )

Thanks @stav for reporting!

@redapple redapple added the bug label Oct 12, 2016
redapple added a commit to redapple/scrapy that referenced this issue Oct 12, 2016
Fixes scrapyGH-2321

The idea is to not guess the encoding of "Location" header value
and simply percent-encode non-ASCII bytes,
which should then be re-interpreted correctly by the remote website
in whatever encoding was used originally.

See https://tools.ietf.org/html/rfc3987#section-3.2

This is similar to the changes to safe_url_string in
scrapy/w3lib#45
@redapple redapple added this to the v1.2.1 milestone Oct 12, 2016
@dangra dangra closed this in #2322 Oct 20, 2016
dangra added a commit that referenced this issue Oct 20, 2016
#2322)

* Do not interpret non-ASCII bytes in "Location" and percent-encode them

Fixes GH-2321

The idea is to not guess the encoding of "Location" header value
and simply percent-encode non-ASCII bytes,
which should then be re-interpreted correctly by the remote website
in whatever encoding was used originally.

See https://tools.ietf.org/html/rfc3987#section-3.2

This is similar to the changes to safe_url_string in
scrapy/w3lib#45

* Remove unused import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant