-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BOM should take precedence over Content-Type header when detecting the encoding #5601
Comments
This is a mistake of target web page.I don't think scrapy could add some methods to detect encoding by checking web content's encoding.For example,if a web page is composed of some strings of defferent encodings,so which is the right encoding ? |
We already try to detect the encoding of the page. |
Hey @simapple! Scrapy aims to have a behavior which is similar to web browsers.
Real-world browser behavior is documented in various WHATWG standards, so they have an answer for this. This BOM issue is not a random hack which only Scrapy would have, it's something which all browsers do, and something which is described in the standard. |
Sorry,I was not very clear about WHATWG standard.Following the standard is always right. |
BOM should take precedence over Content-Type header when detecting the encoding Fixes GH-5601.
Currently Scrapy uses headers first to detect the encoding. But browsers actually put a higher priority for BOM; this is also in WHATWG standard. It can be checked e.g. by running this server, and opening URL in a browser - UTF-8 is used by browser, but cp1251 is used by Scrapy:
When opening this page in a browser, it shows "Привет!".
Spider code to check it:
Spider outputs
See also: scrapy/w3lib#189 - it's a similar issue, but fixing it in w3lib is not enough to make it working in Scrapy.
The text was updated successfully, but these errors were encountered: