Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgitb html: wrong encoding for utf-8 #66935

Closed
wrohdewald mannequin opened this issue Oct 27, 2014 · 13 comments
Closed

cgitb html: wrong encoding for utf-8 #66935

wrohdewald mannequin opened this issue Oct 27, 2014 · 13 comments
Labels
stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@wrohdewald
Copy link
Mannequin

wrohdewald mannequin commented Oct 27, 2014

BPO 22746
Nosy @amauryfa, @vstinner, @ezio-melotti, @bitdancer, @serhiy-storchaka
Files
  • cgibug.py
  • 22746.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2014-10-27.18:48:57.478>
    labels = ['type-bug', 'library', 'expert-unicode']
    title = 'cgitb html: wrong encoding for utf-8'
    updated_at = <Date 2014-12-03.07:50:08.854>
    user = 'https://bugs.python.org/wrohdewald'

    bugs.python.org fields:

    activity = <Date 2014-12-03.07:50:08.854>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2014-10-27.18:48:57.478>
    creator = 'wrohdewald'
    dependencies = []
    files = ['37044', '37047']
    hgrepos = []
    issue_num = 22746
    keywords = ['patch']
    message_count = 11.0
    messages = ['230085', '230099', '230117', '230131', '230133', '230134', '230148', '230149', '230159', '230361', '232073']
    nosy_count = 6.0
    nosy_names = ['amaury.forgeotdarc', 'vstinner', 'ezio.melotti', 'r.david.murray', 'wrohdewald', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'needs patch'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue22746'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @wrohdewald
    Copy link
    Mannequin Author

    wrohdewald mannequin commented Oct 27, 2014

    The attached script shows the non-ascii characters wrong wherever they occur, including the exception message and the comment in the source code.

    Looking at the produced .html, I can say that cgitb simply passes the single byte utf-8 codes without encoding them as needed.

    Same happens with Python3.4 (after applying some quick and dirty changes to cgitb.py, see bug bpo-22745).

    @wrohdewald wrohdewald mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 27, 2014
    @bitdancer
    Copy link
    Member

    If you look at the file, you'll find that the data is in utf-8 (at least if your locale is a utf-8 locale). However, html is by default interpreted as latin-1, so that's what the webrowser displays when you pass the file on disk to it. If you add "encoding='latin-1'" to your open call, your script will work. What you do if you need to display non-latin1 characters, I don't know. (See https://bugzil.la/760050, for example).

    Note: the above is for python3. I don't remember how you do the equivalent in python2...a naive codecs.open call just got me a UnicodeDecodeError.

    @wrohdewald
    Copy link
    Mannequin Author

    wrohdewald mannequin commented Oct 28, 2014

    If you cannot offer a solution for arbitrary unicode, you have no solution at all. Afer all, that is what unicode is about: support ALL languages, not only your own.

    I do not quite understand why you think this is not a bug.

    If cgitb encodes unicode like & x e 4 ; (remove spaces), the browser does not have to guess the encoding, it will always show the correct character. This works for all of unicode. See https://en.wikipedia.org/wiki/Unicode_and_HTML#Numeric_character_references

    So this bug is fixable, I am reopening it.

    For Python3, the fix is actually very simple: Do not write doc but str(doc.encode('ascii', 'xmlcharrefreplace')), like in the attached patch. This patch works for me but there might be yet uncovered code paths. And my source file is encoded in utf-8, other source file encodings should be tested too. I do not know if cgitb correctly honors the source file header like # -- coding: utf-8 --

    Fixing this for Python2 is certainly doable too but perhaps more difficult because a Python2 str() may have an unknown encoding.

    @wrohdewald wrohdewald mannequin removed the invalid label Oct 28, 2014
    @wrohdewald wrohdewald mannequin reopened this Oct 28, 2014
    @amauryfa
    Copy link
    Member

    What about
    open(..., encoding='latin-1', errors='xmlcharrefreplace')

    @wrohdewald
    Copy link
    Mannequin Author

    wrohdewald mannequin commented Oct 28, 2014

    What about
    open(..., encoding='latin-1', errors='xmlcharrefreplace')

    That works fine. I tested with a chinese character 与

    But I do not think the application should work around something that cgitb is supposed to handle. More so since the documentation is dead silent about this. You need to use codecs.open instead of open and add those kw arguments. As long as this is not explained in the documentation, I guess it is a bug for everyone not using latin-1.

    @wrohdewald
    Copy link
    Mannequin Author

    wrohdewald mannequin commented Oct 28, 2014

    correction: A bug for everyone using non-ascii characters.

    @amauryfa
    Copy link
    Member

    You need to use codecs.open instead of open
    No, why? in python3 open() supports the errors handler.

    @bitdancer
    Copy link
    Member

    In normal HTML utf-8 works fine, doesn't it?. It's only when reading from a file (where the browser doesn't know the encoding) that it fails. Do you have a use case for xmlcharrefreplace in the HTML context (which is what cgitb is primarily targeted at). Some place where the web page can't be declared as utf-8, perhaps?

    I suppose it might be a not-unreasonable enhancement request to have a parameter to Hook that says "do xmlcharrefreplace", but since the workaround is actually simpler than that, I don't know if that is worthwhile or not. Or do people feel like doing the replacement all the time (it's only in tracebacks, after all) be the right thing to do?

    @wrohdewald
    Copy link
    Mannequin Author

    wrohdewald mannequin commented Oct 28, 2014

    > You need to use codecs.open instead of open
    No, why? in python3 open() supports the errors handler.

    right, but not in python2 which has the same problem. I need my code to run with both.

    Do you have a use case for xmlcharrefreplace in the HTML context?

    No, my only use case is the local file.

    @ezio-melotti
    Copy link
    Member

    In normal HTML utf-8 works fine, doesn't it?

    It does, in fact as long as the encoding used by the browser matches the one used in the file, no charrefs needs to be used (except > < and "). Of course, if non-Unicode encodings are used, the range of available characters that can go directly in the HTML will be more limited, but this can be solved by using charrefs -- the browser will display the corresponding character no matter what is the encoding. This also means that if charrefs are used for all non-ASCII characters, then the browser will be able to display the page no matter what encoding is being used (as long as it's ASCII-compatible, and most encoding are). The downside is that it will make the source less readable and possible longer, especially if there are lot of non-ASCII characters, but if most of the characters are expected to be ASCII, using charrefs might be ok.

    @serhiy-storchaka
    Copy link
    Member

    We can convert cgitb.hook to produce ASCII-compatible output with charrefs in 3.x. But there is a problem with str in 2.7. 8-bit string can contain non-ASCII data and the encoding is not known in general case.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @hugovk
    Copy link
    Member

    hugovk commented Apr 11, 2022

    I think we can close this? The cgitb module is deprecated in 3.11 and set for removal in 3.13.

    See PEP 594 – Removing dead batteries from the standard library, #91217 and #32410.

    @hugovk hugovk closed this as completed Apr 11, 2022
    @AlexWaygood
    Copy link
    Member

    I think we can close this? The cgitb module is deprecated in 3.11 and set for removal in 3.13.

    See PEP 594 – Removing dead batteries from the standard library, #91217 and #32410.

    Cc. @amauryfa, @wrohdewald

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants