Decoding of "Location" header on redirects using latin-1 can be wrong #2321
Comments
redapple
added a commit
to redapple/scrapy
that referenced
this issue
Oct 12, 2016
Fixes scrapyGH-2321 The idea is to not guess the encoding of "Location" header value and simply percent-encode non-ASCII bytes, which should then be re-interpreted correctly by the remote website in whatever encoding was used originally. See https://tools.ietf.org/html/rfc3987#section-3.2 This is similar to the changes to safe_url_string in scrapy/w3lib#45
dangra
added a commit
that referenced
this issue
Oct 20, 2016
#2322) * Do not interpret non-ASCII bytes in "Location" and percent-encode them Fixes GH-2321 The idea is to not guess the encoding of "Location" header value and simply percent-encode non-ASCII bytes, which should then be re-interpreted correctly by the remote website in whatever encoding was used originally. See https://tools.ietf.org/html/rfc3987#section-3.2 This is similar to the changes to safe_url_string in scrapy/w3lib#45 * Remove unused import
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Web servers should use encoded URLs in their "Location" headers, but they don't always do.
This website for example, for this URL http://www.yjc.ir/fa/news/1815565/
redirects to www.yjc.ir/fa/news/1815565/اعزام-كوهنوردان-ايراني-به-كيليمانجارو
but the bytes received are UTF-8 encoded, and not percent-escaped:
RedirectMiddleware
decodes the header as "latin1" (this is new in Scrapy 1.1) and issues a request to http://www.yjc.ir/fa/news/1815565/%C3%98%C2%A7%C3%98%C2%B9%C3%98%C2%B2%C3%98%C2%A7%C3%99%C2%85-%C3%99%C2%83%C3%99%C2%88%C3%99%C2%87%C3%99%C2%86%C3%99%C2%88%C3%98%C2%B1%C3%98%C2%AF%C3%98%C2%A7%C3%99%C2%86-%C3%98%C2%A7%C3%99%C2%8A%C3%98%C2%B1%C3%98%C2%A7%C3%99%C2%86%C3%99%C2%8A-%C3%98%C2%A8%C3%99%C2%87-%C3%99%C2%83%C3%99%C2%8A%C3%99%C2%84%C3%99%C2%8A%C3%99%C2%85%C3%98%C2%A7%C3%99%C2%86%C3%98%C2%AC%C3%98%C2%A7%C3%98%C2%B1%C3%99%C2%88which is not correct.
curl -i "http://www.yjc.ir/fa/news/1815565/"
andwget http://www.yjc.ir/fa/news/1815565/
handle it just fine and correctly follow http://www.yjc.ir/fa/news/1815565/%D8%A7%D8%B9%D8%B2%D8%A7%D9%85-%D9%83%D9%88%D9%87%D9%86%D9%88%D8%B1%D8%AF%D8%A7%D9%86-%D8%A7%D9%8A%D8%B1%D8%A7%D9%86%D9%8A-%D8%A8%D9%87-%D9%83%D9%8A%D9%84%D9%8A%D9%85%D8%A7%D9%86%D8%AC%D8%A7%D8%B1%D9%88(curl fixed the issue not too long ago )
Thanks @stav for reporting!
The text was updated successfully, but these errors were encountered: