Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOM should take precedence over Content-Type header when detecting the encoding #5601

Closed
kmike opened this issue Aug 16, 2022 · 4 comments · Fixed by #5611
Closed

BOM should take precedence over Content-Type header when detecting the encoding #5601

kmike opened this issue Aug 16, 2022 · 4 comments · Fixed by #5611

Comments

@kmike
Copy link
Member

kmike commented Aug 16, 2022

Currently Scrapy uses headers first to detect the encoding. But browsers actually put a higher priority for BOM; this is also in WHATWG standard. It can be checked e.g. by running this server, and opening URL in a browser - UTF-8 is used by browser, but cp1251 is used by Scrapy:

import codecs
from http.server import BaseHTTPRequestHandler
from http.server import HTTPServer


class HttpGetHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("Content-type", "text/html; charset=cp1251")
        self.end_headers()
        self.wfile.write(codecs.BOM_UTF8)
        self.wfile.write("<!DOCTYPE html>".encode('utf8'))
        self.wfile.write("Привет!".encode('utf8'))


if __name__ == '__main__':
    httpd = HTTPServer(('', 8000), HttpGetHandler)
    httpd.serve_forever()

When opening this page in a browser, it shows "Привет!".

Spider code to check it:

import scrapy
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    name = "tst"

    start_urls = ["http://0.0.0.0:8000"]

    def parse(self, response):
        return {"encoding": response.encoding, "text": response.text}


if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(MySpider)
    process.start()

Spider outputs

{'encoding': 'cp1251', 'text': 'Привет!'}

See also: scrapy/w3lib#189 - it's a similar issue, but fixing it in w3lib is not enough to make it working in Scrapy.

@simapple
Copy link

This is a mistake of target web page.I don't think scrapy could add some methods to detect encoding by checking web content's encoding.For example,if a web page is composed of some strings of defferent encodings,so which is the right encoding ?

@wRAR
Copy link
Member

wRAR commented Aug 18, 2022

We already try to detect the encoding of the page.

@kmike
Copy link
Member Author

kmike commented Aug 18, 2022

Hey @simapple! Scrapy aims to have a behavior which is similar to web browsers.

For example,if a web page is composed of some strings of defferent encodings,so which is the right encoding ?

Real-world browser behavior is documented in various WHATWG standards, so they have an answer for this. This BOM issue is not a random hack which only Scrapy would have, it's something which all browsers do, and something which is described in the standard.

@simapple
Copy link

Sorry,I was not very clear about WHATWG standard.Following the standard is always right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants