Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOM should take precedence over Content-Type header when detecting the encoding #189

Closed
kmike opened this issue Aug 16, 2022 · 0 comments · Fixed by #191
Closed

BOM should take precedence over Content-Type header when detecting the encoding #189

kmike opened this issue Aug 16, 2022 · 0 comments · Fixed by #191

Comments

@kmike
Copy link
Member

kmike commented Aug 16, 2022

Currently html_to_unicode prefers Content-Type header if BOM is present.
But browsers, as well as WHATWG standard use BOM first. This can be checked e.g. by running this server, and opening URL in a browser - UTF-8 is used if BOM is present, and cp1251 is used if it's not:

import codecs
from http.server import BaseHTTPRequestHandler
from http.server import HTTPServer


class HttpGetHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("Content-type", "text/html; charset=cp1251")
        self.end_headers()
        self.wfile.write(codecs.BOM_UTF8)
        self.wfile.write("<!DOCTYPE html>".encode('utf8'))
        self.wfile.write("Привет!".encode('utf8'))


if __name__ == '__main__':
    httpd = HTTPServer(('', 8000), HttpGetHandler)
    httpd.serve_forever()
@kmike kmike changed the title BOM should take precedence over Content-Type header when detecting an encoding BOM should take precedence over Content-Type header when detecting the encoding Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant