FixesscrapyGH-2321
The idea is to not guess the encoding of "Location" header value
and simply percent-encode non-ASCII bytes,
which should then be re-interpreted correctly by the remote website
in whatever encoding was used originally.
See https://tools.ietf.org/html/rfc3987#section-3.2
This is similar to the changes to safe_url_string in
scrapy/w3lib#45
#2322)
* Do not interpret non-ASCII bytes in "Location" and percent-encode them
FixesGH-2321
The idea is to not guess the encoding of "Location" header value
and simply percent-encode non-ASCII bytes,
which should then be re-interpreted correctly by the remote website
in whatever encoding was used originally.
See https://tools.ietf.org/html/rfc3987#section-3.2
This is similar to the changes to safe_url_string in
scrapy/w3lib#45
* Remove unused import
Web servers should use encoded URLs in their "Location" headers, but they don't always do.
This website for example, for this URL http://www.yjc.ir/fa/news/1815565/
redirects to www.yjc.ir/fa/news/1815565/اعزام-كوهنوردان-ايراني-به-كيليمانجارو
but the bytes received are UTF-8 encoded, and not percent-escaped:
RedirectMiddleware
decodes the header as "latin1" (this is new in Scrapy 1.1) and issues a request to http://www.yjc.ir/fa/news/1815565/%C3%98%C2%A7%C3%98%C2%B9%C3%98%C2%B2%C3%98%C2%A7%C3%99%C2%85-%C3%99%C2%83%C3%99%C2%88%C3%99%C2%87%C3%99%C2%86%C3%99%C2%88%C3%98%C2%B1%C3%98%C2%AF%C3%98%C2%A7%C3%99%C2%86-%C3%98%C2%A7%C3%99%C2%8A%C3%98%C2%B1%C3%98%C2%A7%C3%99%C2%86%C3%99%C2%8A-%C3%98%C2%A8%C3%99%C2%87-%C3%99%C2%83%C3%99%C2%8A%C3%99%C2%84%C3%99%C2%8A%C3%99%C2%85%C3%98%C2%A7%C3%99%C2%86%C3%98%C2%AC%C3%98%C2%A7%C3%98%C2%B1%C3%99%C2%88which is not correct.
curl -i "http://www.yjc.ir/fa/news/1815565/"
andwget http://www.yjc.ir/fa/news/1815565/
handle it just fine and correctly follow http://www.yjc.ir/fa/news/1815565/%D8%A7%D8%B9%D8%B2%D8%A7%D9%85-%D9%83%D9%88%D9%87%D9%86%D9%88%D8%B1%D8%AF%D8%A7%D9%86-%D8%A7%D9%8A%D8%B1%D8%A7%D9%86%D9%8A-%D8%A8%D9%87-%D9%83%D9%8A%D9%84%D9%8A%D9%85%D8%A7%D9%86%D8%AC%D8%A7%D8%B1%D9%88(curl fixed the issue not too long ago )
Thanks @stav for reporting!
The text was updated successfully, but these errors were encountered: