Force BOM option in UTF output. #45669

jgsack · 2007-10-25T22:59:36Z

BPO	1328
Nosy	@doerwalter

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/doerwalter'
closed_at = <Date 2008-03-22.14:42:01.782>
created_at = <Date 2007-10-25.22:59:35.971>
labels = ['type-feature', 'expert-unicode']
title = 'Force BOM option in UTF output.'
updated_at = <Date 2008-03-22.14:44:52.747>
user = 'https://bugs.python.org/jgsack'

bugs.python.org fields:

activity = <Date 2008-03-22.14:44:52.747>
actor = 'doerwalter'
assignee = 'doerwalter'
closed = True
closed_date = <Date 2008-03-22.14:42:01.782>
closer = 'doerwalter'
components = ['Unicode']
creation = <Date 2007-10-25.22:59:35.971>
creator = 'jgsack'
dependencies = []
files = []
hgrepos = []
issue_num = 1328
keywords = []
message_count = 19.0
messages = ['56759', '56780', '56782', '56801', '56813', '56814', '56817', '57028', '57033', '57041', '57522', '57527', '57529', '57691', '63705', '64189', '64217', '64324', '64325']
nosy_count = 5.0
nosy_names = ['doerwalter', 'jafo', 'jgsack', 'ggenellina', 'Rhamphoryncus']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue1328'
versions = ['Python 2.6', 'Python 2.5']

jgsack · 2007-10-25T22:59:35Z

The behavior of codecs utf_16_[bl]e is to omit the BOM.

In a testing environment (and perhaps elsewhere), a forced BOM is useful.
I'm requesting an optional argument something like
force_BOM=False

I guess it would require such an option in multiple function calls, sorry I
don't know enough to itemize them.

If this is implemented, it might be desirable to think about the aliases
like unicode*unmarked.

Regards,
..jim

jgsack · 2007-10-26T06:33:32Z

Feature Request REVISION
========================
Upon reflection and more playing around with some test cases, I wish to
revise my feature request.

I think the utf8 codecs should accept input with or without the "sig".
On output, only the utf_8_sig should write the 3-byte "sig". This behavior
change would not seem disruptive to current applications.

For utf16, (arguably) a missing BOM should merely assume machian endianess.
For utf_16_le, utf_16_be input, both should accept & discard a BOM.
On output, I'm not sure; maybe all should write a BOM unless passed a flag
signifying no bom?
Or to preserve backward compat, could have a parm write_bom defaulting to
True for utf16 and False for utf_16_le and utf_16_be. This is a
modification of the originial request (for a force_bom flag).

Unless I have confused myself with my test cases, the current codecs are
slightly inconsistent for the utf8 codecs:

utf8 treats "sig" as real data, if present, but..
utf_8_sig works right even without the "sig" (so this one I like as is!)

The 16'ers seem to match the (inferred) specs, but for completeness here:
utf_16 refuses to proceed w/o BOM (even with correct endian input data)
utf_16_le treats BOM as data
utf_16_be treats BOM as data

Regards,
..jim

jgsack · 2007-10-26T06:43:59Z

Later note: kind of weird!

On my LE machine, utf16 reads my BE-formatted test data (no BOM)
apparently assumng some kind of surrogate format, until it finds
an "illegal UTF-16 surrogate".

That I fail to understand, especially since it quits upon seeing
a BOM with valid LE data.

Test data and test code available on request.

Regards,
..jim

gvanrossum · 2007-10-26T17:44:31Z

Can't you force a BOM by simply writing \ufffe at the start of the file?

jgsack · 2007-10-26T19:34:47Z

re: msg56782

Yes, of course I can explicitly write the BOM. I did realize that after
my first post ( my-'duh' :-[ ).

But after playing some more, I do think this issue has become a
worthwhile one. My second post msg56780 asks that utf_8 be tolerant
of the 3-byte sig BOM, and uf_16_[be]e be tolerant of their BOMs,
which I argue is consistent with "be liberal on what you accept".

A second half of that message suggests that it might be worth
considering something like a write_bom parameter with utf_16
defaulting to True, and utf_16_[bl]e defaulting to False.

My third post (m56782) may actually represent a bug. I have a
unittest for this and would be glad to provide (although I need
to reduuce a larger test to a simple case). I will look at this
again, and re-pester you as required.

Regards (and thanks for the reply),
..jim

gvanrossum · 2007-10-26T19:36:46Z

If you can, please submit a patch that fixes all those issues, with
unit tests and doc changes if at all possible. That will make it much
easier to evaluate the ramifications of your proposal(s).

jgsack · 2007-10-26T19:54:12Z

OK, I will work on it. I have just downloaded trunk and will see what
I can do. Might be a week or two.

..jim

Rhamphoryncus · 2007-11-01T19:07:37Z

The problem with "being tolerate" as you suggest is you lose the ability
to round-trip. Read in a file using the UTF-8 signature, write it back
out, and suddenly nothing else can open it.

Conceptually, these signatures shouldn't even be part of the encoding;
they're a prefix in the file indicating which encoding to use.

Note that the BOM signature (ZWNBSP) is a valid code point. Although it
seems unlikely for a file to start with ZWNBSP, if were to chop a file
up into smaller chunks and decode them individually you'd be more likely
to run into it. (However, it seems general use of ZWNBSP is being
discouraged precisely due to this potential for confusion[1]).

In summary, guessing the encoding should never be the default. Although
it may be appropriate in some contexts, we must ensure we emit the right
encoding for those contexts as well. [2]

[1] http://unicode.org/faq/utf_bom.html#38
[2] http://unicode.org/faq/utf_bom.html#28

jgsack · 2007-11-01T19:56:30Z

Adam Olsen wrote:

Adam Olsen added the comment:

The problem with "being tolerate" as you suggest is you lose the ability
to round-trip. Read in a file using the UTF-8 signature, write it back
out, and suddenly nothing else can open it.

I'm sorry, I don't see the round-trip problem you describe.

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

No round trip problem _for me_.

Now If I need to exchange with some else, that's a different matter. One
way or another I need to know what format they need and create the
output they require for their input.

Am I missing something in your statement of a problem?

Conceptually, these signatures shouldn't even be part of the encoding;
they're a prefix in the file indicating which encoding to use.

Yes, I'm aware of that, but you can't predict what you may find in dusty
archives, or what someone may give to you. IMO, that's the basis of
being tolerant in what you accept, is it not?

Note that the BOM signature (ZWNBSP) is a valid code point. Although it
seems unlikely for a file to start with ZWNBSP, if were to chop a file
up into smaller chunks and decode them individually you'd be more likely
to run into it. (However, it seems general use of ZWNBSP is being
discouraged precisely due to this potential for confusion[1]).

I understand that throwing away a ZWNBSP at the beginning of a file does
risk discarding data rather than metadata. I also believe the standards
people recognized that and deliberately picked a BOM character that is a
calculated low risk. I'm willing to accept that risk.

In summary, guessing the encoding should never be the default. Although
it may be appropriate in some contexts, we must ensure we emit the right
encoding for those contexts as well. [2]

[1] http://unicode.org/faq/utf_bom.html#38
[2] http://unicode.org/faq/utf_bom.html#28

From my point of view, I don't see that being tolerant in what _I_ (or
my applications) accept violates any guidelines.

Please explain where I am wrong.

Regards,
..jim

Rhamphoryncus · 2007-11-01T22:21:33Z

On 11/1/07, James G. sack (jim) <report@bugs.python.org> wrote:

James G. sack (jim) added the comment:

Adam Olsen wrote:
> Adam Olsen added the comment:
>
> The problem with "being tolerate" as you suggest is you lose the ability
> to round-trip. Read in a file using the UTF-8 signature, write it back
> out, and suddenly nothing else can open it.

I'm sorry, I don't see the round-trip problem you describe.

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

No round trip problem _for me_.

Now If I need to exchange with some else, that's a different matter. One
way or another I need to know what format they need and create the
output they require for their input.

Am I missing something in your statement of a problem?

You don't seem to think it's important to interact with other
programs. If you're importing with no intent to write out to a common
format, then yes, autodetecting the BOM is just fine. Python needs a
more general default though, and not guessing is part of that.

> Conceptually, these signatures shouldn't even be part of the encoding;
> they're a prefix in the file indicating which encoding to use.

Yes, I'm aware of that, but you can't predict what you may find in dusty
archives, or what someone may give to you. IMO, that's the basis of
being tolerant in what you accept, is it not?

Garbage in, garbage out. There's a lot of protocols with whitespace,
capitalization, etc that you can fudge around while retaining the same
contents; character set encodings aren't one of them.

jgsack · 2007-11-15T08:40:49Z

re: msg57041, I'm sorry if I gave the wrong impression about interacting
with other programs. I started this feature request with some half-baked
thinking, which I tried to revise in my second post.

Anyway I'm most interested right now in lobbying for a change to utf_8 to
accept input with an _optional_ BOM-signature so that the input part would
behave just like utf_8_sig, where the BOM-sig is already optional (on
input).

In the process of trying to come up with a test and patch for this, I
discovered a bug in utf_8_sig (issue bpo-1444 http://bugs.python.org/
bpo-1444).

After there is some action on that I will return here to continue with
utf_8, which I have convinced myself (anyways) is a reasonable and safe
revision.

..jim

doerwalter · 2007-11-15T12:57:09Z

jgsack wrote:

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

That's exactly what the utf_8_sig codec does. The decoder accepts input
with or without the BOM (the (first) BOM doesn't get returned). The
encoder always prepends a BOM.

Or do you want a codec that behaves like utf_8 on reading and like
utf_8_sig on writing? Such a codec indead indead wouldn't roundtrip.

doerwalter · 2007-11-15T13:41:57Z

For utf16, (arguably) a missing BOM should merely assume machian
endianess.
For utf_16_le, utf_16_be input, both should accept & discard a BOM.
On output, I'm not sure; maybe all should write a BOM unless passed a flag
signifying no bom?
Or to preserve backward compat, could have a parm write_bom defaulting to
True for utf16 and False for utf_16_le and utf_16_be. This is a
modification of the originial request (for a force_bom flag).

The Unicode FAQ (http://unicode.org/faq/utf_bom.html#28) clearly states:

"""
Q: How I should deal with BOMs?
[...]
Where the precise type of the data stream is known (e.g. Unicode
big-endian or Unicode little-endian), the BOM should not be used. In
particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE,
UTF-32BE or UTF-32LE a BOM *must* not be used. [...]

jgsack · 2007-11-20T03:39:56Z

More discussion of utf_8.py decoding behavior (and possible change):

For my needs, I would like the decoding parts of the utf_8 module to treat
an initial BOM as an optional signature and skip it if there is one (just
like the utf_8_sig decoder). In fact I have a working patch that replaces
the utf_8_sig decode, IncrementalDecoder and StreamReader components by
direct transplants from utf_8_sig (as recently repaired -- there was a
SteamReader error).

However the reason for discussion is to ask how it might impact existing
code.

I can imagine there might be utf_8 client code out there which expects to
see a leading U+feff as (perhaps) a clue that the output should be returned
with a BOM-signature (say) to accomodate the guessed input requirements of
the remote correspondant.

Making my work easier might actually make someone else's work (probably,
annoyingly) harder.

So what to do?

I can just live with code like
if input[0] == u"\ufeff":
input=input[1:}
spread around, and of course slightly different for incremental and stream
inputs.

But I probably wouldn't. I would probably substitute a
"my_utf_8" encoding for to make my code a little cleaner.

Another thought I had would require "the other guy" to update his code, but
at least it wouldn't make his work annoyingly difficult like my original
change might have.

Here's the basic outline:

Add another decoder function that returns a 3-tuple
decode3(input, errors='strict') => (data, consumed, had_bom)
where had_bom is true if a leading bom was seen and skipped
then the usual decode is just something like
def decode(input, errors='strict'):
return decode3(input, errors)[:2]
add member variable and accessor to both IncrementalDecoder and
StreamReader classes something like
def had_bom(self):
return self.had_bom
and initialize/set the self.had_bom variable as required.

This complicates the interface somewhat and requires some additional
documantation.

Tpo document my original simple [-minded] idea required
possibly only a few more words in the existing paragraph
on utf_8_sig, to mention that both mods had the same
decoding behavior but different encoding.

I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost
the same", it's possible that future refactoring might unify them with
differences contained in behavor-flags (eg, skip_leading_bom). The leading
bom processing might even be pushed into codecs.utf_8_decode for possible
minor advantages.

Is there anybody monitoring this who has an opinion on this?

..jim

jafo · 2008-03-17T18:18:56Z

It sounds like the Unicode FAQ has an authoritative statement on this,
is this a "wontfix", or does this need more discussion? Perhaps on
python-dev or at the sprints this week?

doerwalter · 2008-03-20T18:16:12Z

I don't see exactly what James is proposing.

For my needs, I would like the decoding parts of the utf_8 module
to treat an initial BOM as an optional signature and skip it if
there is one (just like the utf_8_sig decoder). In fact I have
a working patch that replaces the utf_8_sig decode,
IncrementalDecoder and StreamReader components by direct
transplants from utf_8_sig (as recently repaired -- there was a
SteamReader error).

I've you want a decoder that behave like the utf-8-sig decoder, use the
utf-8-sig decoder. I don't see how changing the utf-8 decoder helps here.

I can imagine there might be utf_8 client code out there which
expects to see a leading U+feff as (perhaps) a clue that the
output should be returned with a BOM-signature (say) to
accomodate the guessed input requirements of the remote
correspondant.

In this case use UTF-8: The leading BOM will be passed to the application.

I can just live with code like
if input[0] == u"\ufeff":
input=input[1:}
spread around, and of course slightly different for incremental
and stream inputs.

Can you post an example that requires this code?

jgsack · 2008-03-20T22:21:40Z

Can you post an example that requires this code?

This is not a big issue, and it wouldn't hurt if it got declared "go away
and come back later if you have patch, test, docs, and a convincing use
case".

..But, for the record..

Suppose I want to both read and write some utf8. It is unknown whether the
input has a BOM, but it is known to be utf8. I want to write utf8 without
any BOM. I see two options, which I find slightly ugly/annoying/error-prone:

a) Use 2 separate encodings: read via utf_8_sig so as to transparently
accept input with/without BOM; use utf_8 on output to not emit any BOM.

b) Use utf_8 for read and write and explicitly check for and discard
leading BOM on input if any.

What _I_ would prefer is that utf_8 would ignore a BOM, if present (just
like utf_8_sig).

(What I was talking about in my last post was a complication in
consideration of someone else who would prefer otherwise, or of code that
might break upon my change.)

Regards,
..jim

doerwalter · 2008-03-22T14:42:01Z

If you want to use UTF-8-sig for decoding and UTF-8 for encoding and
have this available as one codec you can define your owen codec for this:

import codecs

def search_function(name):
    if name == "myutf8":
        utf8 = codecs.lookup("utf-8")
        utf8_sig = codecs.lookup("utf-8-sig")
        return codecs.CodecInfo(
            name='myutf8',
            encode=utf8.encode,
            decode=utf8_sig.decode,
            incrementalencoder=utf8.IncrementalEncoder,
            incrementaldecoder=utf8_sig.IncrementalDecoder,
            streamreader=utf8_sig.StreamReader,
            streamwriter=utf8.StreamWriter,
        )


codecs.register(search_function)

Closing the issue as "wont fix"

doerwalter · 2008-03-22T14:44:53Z

Oops, that code was supposed to read:

import codecs

def search_function(name):
    if name == "myutf8":
        utf8 = codecs.lookup("utf-8")
        utf8_sig = codecs.lookup("utf-8-sig")
        return codecs.CodecInfo(
            name='myutf8',
            encode=utf8.encode,
            decode=utf8_sig.decode,
            incrementalencoder=utf8.incrementalencoder,
            incrementaldecoder=utf8_sig.incrementaldecoder,
            streamreader=utf8_sig.streamreader,
            streamwriter=utf8.streamwriter,
        )


codecs.register(search_function)

jgsack mannequin added type-bug An unexpected behavior, bug, or error topic-unicode labels Oct 25, 2007

jafo mannequin changed the title ~~feature request: force BOM option~~ Force BOM option in UTF output. Mar 17, 2008

jafo mannequin assigned doerwalter Mar 17, 2008

jafo mannequin added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Mar 17, 2008

doerwalter closed this as completed Mar 22, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force BOM option in UTF output. #45669

Force BOM option in UTF output. #45669

jgsack mannequin commented Oct 25, 2007

jgsack mannequin commented Oct 25, 2007

jgsack mannequin commented Oct 26, 2007

jgsack mannequin commented Oct 26, 2007

gvanrossum commented Oct 26, 2007

jgsack mannequin commented Oct 26, 2007

gvanrossum commented Oct 26, 2007

jgsack mannequin commented Oct 26, 2007

Rhamphoryncus mannequin commented Nov 1, 2007

jgsack mannequin commented Nov 1, 2007

Rhamphoryncus mannequin commented Nov 1, 2007

jgsack mannequin commented Nov 15, 2007

doerwalter commented Nov 15, 2007

doerwalter commented Nov 15, 2007

jgsack mannequin commented Nov 20, 2007

jafo mannequin commented Mar 17, 2008

doerwalter commented Mar 20, 2008

jgsack mannequin commented Mar 20, 2008

doerwalter commented Mar 22, 2008

doerwalter commented Mar 22, 2008

Force BOM option in UTF output. #45669

Force BOM option in UTF output. #45669

Comments

jgsack mannequin commented Oct 25, 2007

jgsack mannequin commented Oct 25, 2007

jgsack mannequin commented Oct 26, 2007

jgsack mannequin commented Oct 26, 2007

gvanrossum commented Oct 26, 2007

jgsack mannequin commented Oct 26, 2007

gvanrossum commented Oct 26, 2007

jgsack mannequin commented Oct 26, 2007

Rhamphoryncus mannequin commented Nov 1, 2007

jgsack mannequin commented Nov 1, 2007

Rhamphoryncus mannequin commented Nov 1, 2007

jgsack mannequin commented Nov 15, 2007

doerwalter commented Nov 15, 2007

doerwalter commented Nov 15, 2007

jgsack mannequin commented Nov 20, 2007

jafo mannequin commented Mar 17, 2008

doerwalter commented Mar 20, 2008

jgsack mannequin commented Mar 20, 2008

doerwalter commented Mar 22, 2008

doerwalter commented Mar 22, 2008