BOM should take precedence over Content-Type header when detecting the encoding #189

kmike · 2022-08-16T15:38:12Z

Currently html_to_unicode prefers Content-Type header if BOM is present.
But browsers, as well as WHATWG standard use BOM first. This can be checked e.g. by running this server, and opening URL in a browser - UTF-8 is used if BOM is present, and cp1251 is used if it's not:

import codecs
from http.server import BaseHTTPRequestHandler
from http.server import HTTPServer


class HttpGetHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("Content-type", "text/html; charset=cp1251")
        self.end_headers()
        self.wfile.write(codecs.BOM_UTF8)
        self.wfile.write("<!DOCTYPE html>".encode('utf8'))
        self.wfile.write("Привет!".encode('utf8'))


if __name__ == '__main__':
    httpd = HTTPServer(('', 8000), HttpGetHandler)
    httpd.serve_forever()

kmike changed the title ~~BOM should take precedence over Content-Type header when detecting an encoding~~ BOM should take precedence over Content-Type header when detecting the encoding Aug 16, 2022

This was referenced Aug 16, 2022

BOM should take precedence over Content-Type header when detecting the encoding scrapy/scrapy#5601

Closed

BOM should take precedence over Content-Type header when detecting the encoding scrapinghub/web-poet#64

Closed

BurnzZ mentioned this issue Sep 23, 2022

update html_to_unicode() so that the BOM is used first to check the e… #191

Merged

Gallaecio closed this as completed in #191 Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOM should take precedence over Content-Type header when detecting the encoding #189

BOM should take precedence over Content-Type header when detecting the encoding #189

kmike commented Aug 16, 2022

BOM should take precedence over Content-Type header when detecting the encoding #189

BOM should take precedence over Content-Type header when detecting the encoding #189

Comments

kmike commented Aug 16, 2022