cgitb html: wrong encoding for utf-8 #66935

wrohdewald · 2014-10-27T18:48:57Z

BPO	22746
Nosy	@amauryfa, @vstinner, @ezio-melotti, @bitdancer, @serhiy-storchaka
Files	cgibug.py 22746.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2014-10-27.18:48:57.478>
labels = ['type-bug', 'library', 'expert-unicode']
title = 'cgitb html: wrong encoding for utf-8'
updated_at = <Date 2014-12-03.07:50:08.854>
user = 'https://bugs.python.org/wrohdewald'

bugs.python.org fields:

activity = <Date 2014-12-03.07:50:08.854>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'Unicode']
creation = <Date 2014-10-27.18:48:57.478>
creator = 'wrohdewald'
dependencies = []
files = ['37044', '37047']
hgrepos = []
issue_num = 22746
keywords = ['patch']
message_count = 11.0
messages = ['230085', '230099', '230117', '230131', '230133', '230134', '230148', '230149', '230159', '230361', '232073']
nosy_count = 6.0
nosy_names = ['amaury.forgeotdarc', 'vstinner', 'ezio.melotti', 'r.david.murray', 'wrohdewald', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue22746'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

wrohdewald · 2014-10-27T18:48:57Z

The attached script shows the non-ascii characters wrong wherever they occur, including the exception message and the comment in the source code.

Looking at the produced .html, I can say that cgitb simply passes the single byte utf-8 codes without encoding them as needed.

Same happens with Python3.4 (after applying some quick and dirty changes to cgitb.py, see bug bpo-22745).

bitdancer · 2014-10-27T19:54:33Z

If you look at the file, you'll find that the data is in utf-8 (at least if your locale is a utf-8 locale). However, html is by default interpreted as latin-1, so that's what the webrowser displays when you pass the file on disk to it. If you add "encoding='latin-1'" to your open call, your script will work. What you do if you need to display non-latin1 characters, I don't know. (See https://bugzil.la/760050, for example).

Note: the above is for python3. I don't remember how you do the equivalent in python2...a naive codecs.open call just got me a UnicodeDecodeError.

wrohdewald · 2014-10-28T04:21:13Z

If you cannot offer a solution for arbitrary unicode, you have no solution at all. Afer all, that is what unicode is about: support ALL languages, not only your own.

I do not quite understand why you think this is not a bug.

If cgitb encodes unicode like & x e 4 ; (remove spaces), the browser does not have to guess the encoding, it will always show the correct character. This works for all of unicode. See https://en.wikipedia.org/wiki/Unicode_and_HTML#Numeric_character_references

So this bug is fixable, I am reopening it.

For Python3, the fix is actually very simple: Do not write doc but str(doc.encode('ascii', 'xmlcharrefreplace')), like in the attached patch. This patch works for me but there might be yet uncovered code paths. And my source file is encoded in utf-8, other source file encodings should be tested too. I do not know if cgitb correctly honors the source file header like # -- coding: utf-8 --

Fixing this for Python2 is certainly doable too but perhaps more difficult because a Python2 str() may have an unknown encoding.

amauryfa · 2014-10-28T09:12:12Z

What about
open(..., encoding='latin-1', errors='xmlcharrefreplace')

wrohdewald · 2014-10-28T09:32:36Z

What about
open(..., encoding='latin-1', errors='xmlcharrefreplace')

That works fine. I tested with a chinese character 与

But I do not think the application should work around something that cgitb is supposed to handle. More so since the documentation is dead silent about this. You need to use codecs.open instead of open and add those kw arguments. As long as this is not explained in the documentation, I guess it is a bug for everyone not using latin-1.

wrohdewald · 2014-10-28T09:37:21Z

correction: A bug for everyone using non-ascii characters.

amauryfa · 2014-10-28T13:19:28Z

You need to use codecs.open instead of open
No, why? in python3 open() supports the errors handler.

bitdancer · 2014-10-28T13:43:52Z

In normal HTML utf-8 works fine, doesn't it?. It's only when reading from a file (where the browser doesn't know the encoding) that it fails. Do you have a use case for xmlcharrefreplace in the HTML context (which is what cgitb is primarily targeted at). Some place where the web page can't be declared as utf-8, perhaps?

I suppose it might be a not-unreasonable enhancement request to have a parameter to Hook that says "do xmlcharrefreplace", but since the workaround is actually simpler than that, I don't know if that is worthwhile or not. Or do people feel like doing the replacement all the time (it's only in tracebacks, after all) be the right thing to do?

wrohdewald · 2014-10-28T16:01:18Z

> You need to use codecs.open instead of open
No, why? in python3 open() supports the errors handler.

right, but not in python2 which has the same problem. I need my code to run with both.

Do you have a use case for xmlcharrefreplace in the HTML context?

No, my only use case is the local file.

ezio-melotti · 2014-10-31T17:49:33Z

In normal HTML utf-8 works fine, doesn't it?

It does, in fact as long as the encoding used by the browser matches the one used in the file, no charrefs needs to be used (except > < and "). Of course, if non-Unicode encodings are used, the range of available characters that can go directly in the HTML will be more limited, but this can be solved by using charrefs -- the browser will display the corresponding character no matter what is the encoding. This also means that if charrefs are used for all non-ASCII characters, then the browser will be able to display the page no matter what encoding is being used (as long as it's ASCII-compatible, and most encoding are). The downside is that it will make the source less readable and possible longer, especially if there are lot of non-ASCII characters, but if most of the characters are expected to be ASCII, using charrefs might be ok.

serhiy-storchaka · 2014-12-03T07:50:09Z

We can convert cgitb.hook to produce ASCII-compatible output with charrefs in 3.x. But there is a problem with str in 2.7. 8-bit string can contain non-ASCII data and the encoding is not known in general case.

hugovk · 2022-04-11T11:18:44Z

I think we can close this? The cgitb module is deprecated in 3.11 and set for removal in 3.13.

See PEP 594 – Removing dead batteries from the standard library, #91217 and #32410.

AlexWaygood · 2022-04-12T14:03:37Z

I think we can close this? The cgitb module is deprecated in 3.11 and set for removal in 3.13.

See PEP 594 – Removing dead batteries from the standard library, #91217 and #32410.

Cc. @amauryfa, @wrohdewald

wrohdewald mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 27, 2014

bitdancer closed this as completed Oct 27, 2014

bitdancer added the invalid label Oct 27, 2014

wrohdewald mannequin removed the invalid label Oct 28, 2014

wrohdewald mannequin reopened this Oct 28, 2014

vstinner added the topic-unicode label Oct 28, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

hugovk closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgitb html: wrong encoding for utf-8 #66935

cgitb html: wrong encoding for utf-8 #66935

wrohdewald mannequin commented Oct 27, 2014

wrohdewald mannequin commented Oct 27, 2014

bitdancer commented Oct 27, 2014

wrohdewald mannequin commented Oct 28, 2014

amauryfa commented Oct 28, 2014

wrohdewald mannequin commented Oct 28, 2014

wrohdewald mannequin commented Oct 28, 2014

amauryfa commented Oct 28, 2014

bitdancer commented Oct 28, 2014

wrohdewald mannequin commented Oct 28, 2014

ezio-melotti commented Oct 31, 2014

serhiy-storchaka commented Dec 3, 2014

hugovk commented Apr 11, 2022

AlexWaygood commented Apr 12, 2022

cgitb html: wrong encoding for utf-8 #66935

cgitb html: wrong encoding for utf-8 #66935

Comments

wrohdewald mannequin commented Oct 27, 2014

wrohdewald mannequin commented Oct 27, 2014

bitdancer commented Oct 27, 2014

wrohdewald mannequin commented Oct 28, 2014

amauryfa commented Oct 28, 2014

wrohdewald mannequin commented Oct 28, 2014

wrohdewald mannequin commented Oct 28, 2014

amauryfa commented Oct 28, 2014

bitdancer commented Oct 28, 2014

wrohdewald mannequin commented Oct 28, 2014

ezio-melotti commented Oct 31, 2014

serhiy-storchaka commented Dec 3, 2014

hugovk commented Apr 11, 2022

AlexWaygood commented Apr 12, 2022